CrowdStrike Outage: a Wake-up Call to Reimagine Tech Resilience

Windows blue screen of death — Photo by Milad Fakurian on Unsplash

8/19/2024

A global tech outage triggered by an update to CrowdStrike's software affected Microsoft Windows systems across a wide range of regulated critical industries including finance, air travel, health care, logistics and transportation in July.

The fiasco resulted from a preventable yet extremely common pattern of failure familiar to cybersecurity experts. Developers were relying on ad hoc, hand-written code to check if everything about complex data ingested by the system is as expected. When rough and ready practices are employed, complex data inevitably exceeds the programmers' capability to make input data checks complete and correct, with disastrous results for their clients.

Crowdstrike recently published two documents (found here and here) that reveal some technical details about the global Windows outage caused by their software, which purports to protect Windows systems via a methodology known as Endpoint Detection and Response (EDR).

The first of these documents described the flaw as a "logic error". The second one calls it an "undetected error" in a "content configuration update", which "passed validation despite containing problematic content data," due to a "bug in the Content Validator". The second document then lays out a plan of action to prevent such flaws from happening again.

Although the analysis being undertaken by CrowdStrike is still ongoing, these descriptions suggest a deeper pattern of failure that their proposed mitigations fail to address.

For the last decade, Dartmouth's Institute for Security, Technology and Society has engaged in ongoing research on exactly this pattern of failure in an effort to raise awareness of the problem.

Nature of the Failure Pattern

The nature of this fallible approach is relatively simple: unless defined in an unambiguous, machine-readable form, complex data inevitably exceeds the programmers' capability to ensure that everything having to do with that data is as expected.

This reality leads to code acting on assumptions about data that have not been checked. The results are both lack of resilience–as in CrowdStrike's case–and lack of security.

Intuitively, the more complex the data, the more rules it is expected to obey and the more potential combinations of bits and bytes in input must be checked so that they don't surprise and crash the receiving code.

Experience shows that the ability of a human programmer to check that all of these rules are exactly obeyed and all of thousands or millions of potential combinations of basic data are accounted for is quite limited. That is, errors of inadvertently omitted checks are bound to happen.

This is exactly what happened in CrowdStrike's outage. The data update surprised the code. The new data was not as the code expected. Consuming it brought the code into an unexpected state, which led to a catastrophic crash.

New data rather than new code was the culprit, as CrowdStrike stressed in their very first technical communication.

Use of Language-Theoretic Security

The key insight of Dartmouth's research of the last ten years, known as Language- theoretic security or LangSec [note], is that the distinction between input data and code is a distinction without a difference—and is, in fact, a long-standing fallacy responsible for a majority of vulnerabilities.

It is tempting to think of data as a passively consumed medium, in contrast to code, which forces action and makes things happen.

In this frame of thought, data is safer to deploy rapidly and frequently, whereas code, as the active agent, naturally needs and receives more testing. Data updates, by contrast, presumably present less risk and can be tested faster, ostensibly enabling more "rapid" deployment as suggested by CrowdStrike's second document.

LangSec analysis, however, shows that this frame of thinking is fatefully wrong.

LangSec argues that sufficiently complex data are indistinguishable from code in its effects. Moreover, being indirect, these effects are actually harder to analyze, test, and control.

The only way to do so is automatically generate the input-checking code from the machine-readable data descriptions that account for all of the expectations of the data. Anything else is a boon to attackers—or, in the case of CrowdStrike—a resilience risk that can be triggered with no actual malice.

Changing Code or Changing Data?

Notably, CrowdStrike stressed in its very first technical communication that it had not changed its product's code, and that the only change was in the configuration data that the said unchanged code consumed.

CrowdStrike's second communication further emphasized this distinction, stressing that code updates are less frequent, whereas data updates are "rapid" and could occur "several times a day."

This statement strongly suggests an expectation that the effects of the new data could be tested much faster than those of the new code, within hours rather than months. The fallout of this apparent assumption is now history.

It gets worse. CrowdStrike's description of the flaw as a "logic error" is not conducive to understanding why and where the fateful flaw occurred and what might be done to correct it preemptively.

A "logic flaw" can occur anywhere on the path from translating the descriptions of a business logic into code to the electronic circuits of a microchip. Writ large, a logic error is not preventable or predictable, a "glitch" can occur anywhere. Both recent evidence and cumulative experience, however, suggest otherwise.

MITRE–a nonprofit research organization that manages a key part of cybersecurity research for the U.S. Government–reports that over 80% of all vulnerabilities are in input-handling (a.k.a. input "parsing") code, suggesting that this is the dominant cause of both vulnerability and instability that cannot be offset by current testing practices.

A new approach is required to help developers manage the multiple unexpected combinations of input data without crashing or providing abundant overlooked corner cases for adversaries to explore.

Dealing with Mission Critical Software

It's time to recognize that today's software practices cannot create code that deals with complex data and is sufficiently resilient and secure for mission-critical industries.

The crucial fault of these practices is that programmers hand-write the logic to check the input data, based on English language descriptions that span many pages of text and harbor parts that can be understood differently by different individuals.

It is too easy for the programmers to miss a check or a so-called corner case (an unusual combination of data items and conditions). There is no way, either empirically or theoretically, to ensure that the resulting checks are complete, correct, and agree between code written by different programmers. At the same time, an erroneous or an omitted check can be devastating.

A new approach to creating input-handling code is needed and has been successfully demonstrated in a variety of applications.

This approach focuses on the precise and mechanized descriptions of the data itself, automatically generates code from such descriptions, and automatically reasons about its safety.

The era of hand-written code and ambiguous or ad hoc data specifications must come to an end, especially for mission-critical software and ever more so for the kinds of security software mandated for critical or regulated industries.

In fact, a decade ago, Microsoft internally shifted to forced injection of automatic reasoning into Microsoft's own developers' process for creating Windows kernel drivers. More recently, Microsoft embarked on leveraging mathematically unambiguous data descriptions for generating and securing code in their critical environments, known as the EverParse project.

Applications of this research project include the Azure cloud core infrastructure. EverParse researchers credit Dartmouth's LangSec and have keynoted LangSec workshops with their global infrastructure protection success stories.

Although the rigors of this process were likely seen as impossible to enforce on third party vendors such as CrowdStrike, the recent outage strongly suggests that the business-as-usual in data consumption by critical applications is unsustainable—just as Dartmouth's LangSec research highlighted.

The Way Forward

Since the outage, multiple suggestions have been published on how the global meltdown could have been avoided by a limited gradual roll-out of the changes that would allow CrowdStrike to curtail the global effects.

Although plausible, these measures may or may not have mitigated the global impact of the flaw because they would have collected a statistical sample of the flaw's impact rather than preventing the flaw from happening in the first place.

Experience shows that the effects of input processing flaws on complex systems tend to go beyond random statistical sampling even when they are random (Murphy's Law), and much less so when driven by actual well-informed malice (Machiavelli vs. Murphy).

Neither is acceptable as the line to hold in the national security infrastructure. Dartmouth's LangSec research shows the way to improve these systems and suss out and fix their existing weaknesses.

LangSec was co-founded by Len Sassaman and Meredith L. Patterson, who realized there was a strong connection between input data complexity and dominant classes of programming flaws, and, together with the Internet legend Dan Kaminsky, used this insight to demonstrate massive flaws in the World Wide Web's trust and e-commerce infrastructure. Soon after, LangSec found an academic home at Dartmouth. This year, the LangSec community celebrated the tenth year of their dedicated workshop at the IEEE Security & Privacy Symposium.

Written by

Sergey Bratus