What went wrong on July 19, 2024?
On July 19, 2024, the cybersecurity company CrowdStrike made a mass update to its CrowdStrike Falcon sensor on millions of computers worldwide. These updates help identify new threats and improve cyberattack prevention capabilities. This upgrade proved to be a big mistake, causing a major crash on Windows systems everywhere.
Let’s Dive a Little into the Technical Side
I’m not a Windows expert, but the Internet is full of them, and luckily, it’s not so hard to find some information on why this happened from a Windows point of view.
The Kernel
All operating systems have something in common, you have the user space and the kernel space. The user space refers to everything that runs outside of the operating system’s kernel, basically all the everyday applications. When running on user space, you don’t have direct access to the computer’s hardware, including CPU, memory, etc… Applications are much more limited in what they can do and access than anything running on kernel mode. When an application fails it just shuts down without affecting other applications. When the kernel fails, the whole system comes down.
CrowdStrike Kernel-mode boot-start driver
The CrowdStrike sensor needs to run in kernel mode to prevent some threats. Because of this, CrowdStrike created a driver that is signed as safe to run in the kernel.
How could the driver still be considered safe when the changes proved to be so lethal? The driver itself was not changed and, therefore, was still digitally signed as safe.
Signing the driver means getting a Microsoft certificate for the driver. Getting this certificate is a non-trivial task that takes time. To be ahead of any possible threat by always updating their sensor as fast as possible, CrowdStrike devised a strategy to update what the driver does without modifying the driver itself.
The driver reads external files containing the information needed to prevent any cyberattack. Instead of updating the driver, which would mean getting a new certificate, they update these external channel files. Updating these files means they don’t need to wait for a new certificate and still be fast on their work to keep servers secure.
Since CrowdStrike wants to make sure their driver is always loaded on system boot, they flagged it as a boot-start driver. This means that the system cannot boot without the driver. What happens if the driver fails to load? The system can’t boot.
The July 19, 2024 Update
On July 19, 2024, CrowdStrike rolled out a mass upload of a new channel file. This file would make the driver fail (kernel panic) and cause a system shutdown, with the additional problem of not being able to boot.
CrowdStrike noticed the issue and sent a new update an hour later, but any computer already affected wouldn’t get the new file version.
The solution? Physically going to the computer, boot on safe mode (this loads a really limited number of drivers) and deleting the file.
According to people who got access to the file and looked at it, the file was just zeros. Certainly not the file that CrowdStrike really wanted to deploy.
You're correct.
Full of zeros at least. pic.twitter.com/PJcCsUb9Vc
— christian_taillon (@christian_tail) July 19, 2024
What Can Be Done to Prevent Another July 19?
Everybody who works in the IT industry knows that bugs are impossible to avoid, and without internal knowledge of how things work in CrowdStrike, it seems unfair to comment on their engineering practices or quality.
This, however, doesn’t stop me from commenting on how I think the problem could have been avoided (it’s easy to talk with Monday’s newspaper, as we say in Uruguay)
- Make sure you have a deployment pipeline that includes automated tests and that your tests are complete. TDD is a good practice to help accomplish this. It’s not enough, but it is a strong starting point.
- When working with multiple systems, always do updates incrementally. This is called canary deployment and means rolling out the update to a small percentage of systems first. Make sure everything went as expected, and then continue with the rest. If you are going to update millions of systems, start with 1, then 100, then 1000, etc…
- Avoid doing Friday deployments. You want to ensure that someone will be there to troubleshoot any issues raised, mainly in global environments.
- Make sure your software handles errors gracefully. You don’t want corrupt content or config files with the right name to crash your application in a way that breaks the whole system.
An additional question that we could answer in a separate post is: What could CrowdStrike customers have done to prevent this from happening?