Last week on Friday, July 19, thousands of fliers across the United States fought their way through TSA security only to arrive at their gates to find flights delayed and canceled. In restaurants, mobile ordering options failed, and millions of PC users opened their computers to find nothing but a blue screen.
So, what happened? CrowdStrike, a cloud-based computer security firm, pushed a “Rapid Response Content” update with an uncaught bug that caused 8.5 million machines running Windows to crash, colloquially known as the “Blue Screen of Death.” This widespread outage devastated a myriad of companies and organizations, including major airlines, banks and even 911 call centers.
Unlike most software bugs, this was not the result of a code change. CrowdStrike’s Rapid Response Content are actually “template instances,” essentially a configuration for their security software, the Falcon platform. The platform then does its own internal validation and uses these templates as a pattern to detect or prevent. Unfortunately, one of the template instances CrowdStrike pushed on Friday morning had problematic data but still passed Falcon’s internal validation due to a pre-existing bug.
Usually, when large companies find errors in recently updated code, they perform a rollback, where they revert their codebase to the last stable version. However, the unique nature of this error meant that the affected machines were unable to even startup, much less receive a code fix from CrowdStrike. Every single affected machine had to be repaired and rebooted manually.
This devastating error is a direct result of CrowdStrike breaking a cardinal rule of software development: do not deploy on Fridays. While it may seem silly, this rule is essential for IT professionals keeping a work-life balance and not working long weekends to fix a silly mistake. Bugs and errors are unavoidable when writing code but are easy to mitigate with a well-structured workflow. CrowdStrike’s global outage has highlighted the importance of hiring teams of trained quality assurance professionals. Since the COVID-19 pandemic, many companies have been downsizing their IT and QA teams, but the need for human involvement is still as prevalent as ever.
Another major takeaway that many individuals and companies have realized from this outage is that the modern technical age is not as redundant and decentralized as some might like to think. With data centers all over the world and “cloud” computing being the norm, mass outages like this one have become less common. But while the number of total servers and data centers might have gone up, the number of cloud provider options has only gone down.
When looking at cloud computing options, the only names in town are Amazon Web Services, Microsoft Azure, Google Cloud and the occasional smaller provider. The fewer companies that offer cloud services, or applications to run on them, like CrowdStrike, the more centralized our modern infrastructure becomes. A Microsoft specific outage turns into a worldwide disaster, since each company has such a massive slice of the market. In the age before cloud computing, each company and individual looking to build back end or even front end infrastructure for a website or application would have their own servers, and self hosting was the norm.
If one company had a security failure, software bug, crash, the damage would be limited to just one organization, not a massive organization. Centralization of cloud computing, layoffs in QA and field technicians will only continue to make the modern world less reliable.