Epic sepsis model missed patients and swamped staff

Tombstone icon

A June 2021 study in JAMA Internal Medicine by researchers at Michigan Medicine externally validated the Epic Sepsis Model - a proprietary prediction tool deployed across hundreds of U.S. hospitals - and found it missed two-thirds of actual sepsis cases while generating so many false alarms that clinicians would need to investigate 109 alerts to find one real patient. The model's AUC of 0.63 fell well short of the 0.76 to 0.83 range Epic had cited in internal documentation, and the study found the tool only caught 7 percent of sepsis cases that clinicians themselves had missed. Epic later overhauled the algorithm and began recommending hospitals train the model on their own patient data before clinical deployment.

Incident Details

Severity:Facepalm
Company:Epic Systems
Perpetrator:Vendor
Incident Date:
Blast Radius:Clinicians drowned in useless alerts, real sepsis patients slipped through, and health systems had to audit Epic’s black-box thresholds and workflows to keep patients safe.
Advertisement

Epic Systems, the largest electronic health records vendor in the United States, ships a sepsis prediction tool called the Epic Sepsis Model (ESM) as part of its EHR platform. The model ingests roughly 80 clinical data elements - vital signs, lab results, comorbidities, demographics - and outputs a risk score intended to flag patients developing sepsis before clinicians would otherwise notice. By the time University of Michigan researchers published their external validation in June 2021, the ESM was already running at hundreds of hospitals nationwide.

The problem was that nobody outside Epic had published a rigorous, independent check of whether the thing actually worked.

The Wong et al. study

Andrew Wong, Karandeep Singh, and colleagues at Michigan Medicine conducted a retrospective cohort study covering every adult patient admitted to their hospital system between December 6, 2018 and October 20, 2019. Sepsis occurred in about 7 percent of those hospitalizations. They ran the ESM against the data and measured how well its predictions matched reality.

The findings were bleak. At a score threshold of 6 or higher - within Epic's own recommended range - the model produced an area under the receiver operating characteristic curve (AUC) of 0.63. For context, a coin flip scores 1.0 on a combined sensitivity-plus-specificity scale where 2.0 is perfect, and 0.63 AUC translates to discrimination only modestly above chance. Epic's internal documentation and vendor-adjacent studies had reported AUC figures between 0.76 and 0.83. The gap between the marketed performance and the independently measured performance was substantial.

The model missed roughly two-thirds of sepsis cases. At the same time, it fired off enough false alarms that a clinician following up on every alert would need to work through 109 of them to find a single actual sepsis patient. That tradeoff - high miss rate combined with high false alarm rate - is about the worst outcome a screening tool can deliver. Clinicians get buried in noise while genuinely sick patients slip through.

Only catching what doctors already knew

One of the more damning findings concerned the model's relationship with clinical practice. The ESM identified only 7 percent of sepsis patients who were missed by a clinician, based on whether antibiotics had already been administered in a timely fashion. In other words, when the model did fire a correct alert, the treating team had almost always already recognized the problem and started treatment.

A prior study had found that the ESM produced alerts at a median of 7 hours after the first lactate level was measured - a lab test that clinicians typically order when they already suspect sepsis. The Wong study's own sensitivity analysis supported this interpretation: when they included predictions made up to 3 hours after the sepsis event (rather than only before), the AUC jumped to 0.80. The model looked much better at detecting sepsis that had already happened than at predicting sepsis that was about to happen.

This pointed to a specific design flaw. The model used antibiotics administration as one of its prediction variables. Since antibiotics are ordered when a clinician suspects sepsis, the model was partly detecting the clinician's own suspicion rather than providing independent early warning. The alerts were, in a sense, an echo of judgment calls that had already been made at the bedside.

The editorial response

An accompanying editorial in the same issue of JAMA Internal Medicine, authored by Habib, Lin, and Grant, laid out the broader implications. They noted that models with a combined specificity-plus-sensitivity below 1.5 "must be incorporated into care with caution, particularly when a validation study is not published, as Epic failed to do." The editorial specifically criticized Epic for deploying a proprietary algorithm to hundreds of hospitals without publishing an external validation study - a gap that left hospitals unable to assess the tool's reliability for their own patient populations.

The editorial framed the problem in terms that should be familiar to anyone working with production software: you cannot trust a vendor's internal benchmarks, run on the vendor's own data, as proof that the product will work in your environment. External validation is the minimum bar, and Epic had not cleared it before widespread deployment.

The financial incentive structure

A separate investigation by STAT News in July 2021 added another layer to the story. The reporting revealed that Epic provided financial incentives - reportedly as much as one million dollars - to hospitals that adopted its predictive algorithms. The Verge corroborated this detail: "EHR giant Epic gives financial incentives to hospitals and health systems that use its artificial intelligence algorithms, which can provide false predictions."

This created an awkward incentive structure. Hospitals were being paid to deploy a tool whose effectiveness had never been independently verified in a peer-reviewed study. The financial relationship between vendor and customer complicated the question of whether hospitals were making deployment decisions based on clinical evidence or economic considerations.

The aftermath and overhaul

The 2021 study set off a chain of reported investigations and follow-up research. STAT News published a series of reports documenting problems with the ESM, including the antibiotics-as-prediction-variable issue and the pattern of late alerts arriving after clinicians had already acted.

By October 2022, STAT reported that Epic had overhauled the sepsis algorithm. Corporate documents obtained by the outlet showed that Epic was now recommending that the model be trained on a hospital's own data before clinical use - a substantial departure from the original one-size-fits-all approach. The recommendation acknowledged what the Wong study had demonstrated: a model trained on one patient population does not necessarily generalize to another.

A later study published in JAMA Internal Medicine, covering more than 800,000 patient encounters across 9 hospitals between January 2020 and June 2022, found that the model's performance varied by hospital. Facilities with higher rates of sepsis, more patients with multiple chronic conditions, and more oncology patients saw worse performance from the algorithm. The variability confirmed that the single-model approach was inadequate for the diversity of clinical settings where Epic's software runs.

What this illustrates

The Epic sepsis model saga is a case study in what happens when a predictive tool is deployed at scale without independent scrutiny. The model was proprietary, meaning hospitals could not inspect its internals. It was distributed as part of a larger EHR platform, meaning adoption was tied to a preexisting vendor relationship rather than a standalone clinical evaluation. And its development data and validation were not published in peer-reviewed literature before deployment, meaning the healthcare institutions using it were relying on the vendor's own claims about effectiveness.

The 0.63 AUC figure from Michigan Medicine is not just a disappointing number. It represents real clinical workflows disrupted by false alarms and real sepsis patients whose deterioration was not flagged in time. Sepsis kills roughly 270,000 Americans per year, according to the CDC. A prediction tool that misses two-thirds of cases while simultaneously drowning staff in false alerts is not a marginal improvement over standard care - it is a drag on the clinical process it was supposed to enhance.

Epic's eventual decision to recommend local training data was the right call, but it came after the model had already been running in production at hundreds of hospitals for years. The gap between deployment and external validation is the core of the story.

Discussion