Meltdown: Why our systems fail and what we can do about it

Today, we are in the golden age of meltdowns. More and more of our systems are in the danger zone, but our ability to manage them hasn’t quite caught up.

Ceasar Medina died because of a computer glitch.

Though he was shot in a botched robbery attempt, his killer—a convicted felon named Jeremiah Smith—should have been behind bars at the time. But Smith was one of thousands of inmates that the Washington State Department of Corrections accidentally released because of a software problem: a bug in the DOC’s computer code that, for over a decade, miscalculated prisoner sentences.

Surprising meltdowns like the one at the DOC happen all the time. At UCSF—one of the world’s best hospitals—a sophisticated pharmacy robot and a high-tech prescription system confused a doctor, lulled a pharmacist into approving a massive overdose of a routine antibiotic, and automatically packaged 38 pills, instead of the single pill the doctor intended. A nurse, comforted by the barcode scanner that confirmed the dosage, gave the pills one by one to her patient, a 16-year-old boy, who nearly died as a result.

In 2012, Wall Street giant Knight Capital unintentionally traded billions of dollars of stock and lost nearly $500 million in just half an hour because of a software glitch. It was a stunning meltdown that couldn’t have happened a decade earlier, when humans still controlled trading.

And at the airlines, technological glitches, combined with ordinary human mistakes, have caused outages in reservation and ticketing systems, grounded thousands of flights, and accidentally given pilots vacation during the busy holiday season. These issues cost the airlines hundreds of millions of dollars and delayed nearly a million passengers.

To understand why these kinds of failures keep happening, we turn to an unexpected source: a 93-year-old sociologist named Charles Perrow. After the Three Mile Island nuclear meltdown in 1979, Perrow became interested in how simple human errors spiral out of control in complex technological systems. For Perrow, Three Mile Island was a wake-up call. The meltdown wasn’t caused by a massive external shock like an earthquake or a terrorist attack. Instead, it emerged from the interaction of small failures—a plumbing glitch, a maintenance crew’s oversight, a stuck-open valve, and a series of confusing indicators in the control room.

The official investigation blamed the plant’s staff. But Perrow thought that was a cheap shot since the accident could only be understood in retrospect. That was a scary conclusion. Here was one of the worst nuclear accidents in history, but it wasn’t due to obvious human errors or a big external shock. It somehow just emerged from small mishaps that came together in a weird way.

Over the next four years, Perrow trudged through the details of hundreds of accidents. He discovered that a combination of two things cause systems to exhibit the kind of wild, unexpected behaviors that occurred at Three Mile Island.

The first element is complexity. For Perrow, complexity wasn’t a buzzword; it had a specific definition. A complex system is more like an elaborate web than an assembly line; many of its parts are intricately linked and can easily affect one another. Complexity also means that we need to rely on indirect indicators to assess most situations. We can’t go in to take a look at what’s happening in the belly of the beast. In a nuclear power plant, for example, we can’t just send someone to see what’s happening in the core. We need to piece together a full picture from small slivers—pressure indications, water flow measurements, and the like.

The second part of Perrow’s theory has to do with how much slack there is in a system. He borrowed a term from engineering: tight coupling. When a system is tightly coupled, there is little buffer among its parts. The margin for error is thin, and the failure of one part can easily affect the others. Everything happens quickly, and we can’t just turn off the system while we deal with a problem.

In Perrow’s analysis, it’s the combination of complexity and tight coupling that pushes systems into the danger zone. Small errors are inevitable in complex systems, and once things begin to go south, such systems produce baffling symptoms. No matter how hard we try, we struggle to make a diagnosis and might even make things worse by solving the wrong problem. And if the system is also tightly coupled, we can’t stop the falling dominoes. Failures spread quickly and uncontrollably.

When Perrow came up with his framework in the early 1980s, the danger zone he described was sparse: it included exotic systems like nuclear facilities and space missions. But in the intervening years, we’ve steadily added complexity and tight coupling to many mundane systems. These days, computers—often connected to the internet—run everything from cars to cash registers and from pharmacies to prisons. And as we add new features to existing technologies—such as mobile apps to airline reservation systems—we continue to increase complexity. Tight coupling, too, is on the rise, as the drive for lean operations removes slack and leaves little margin for error.

This doesn’t necessarily imply that things are worse than they used to be. What it does suggest, though, is that we are facing a different kind of challenge, one where massive failures come not from external shocks or bad apples, but from combinations of technological glitches and ordinary human mistakes.

We can’t turn back the clock and return to a simpler world. Airlines shouldn’t switch back to paper tickets and traders shouldn’t abandon computers. Instead, we need to figure out how to manage these new systems. Fortunately, an emerging body of research reveals how we can overcome these challenges.  

The first step is to recognize that the world has changed. But that’s a surprisingly hard thing to do, even in an era where businesses seem to celebrate new technologies like blockchain and AI. When we interviewed the former CEO of Knight Capital years after the firm’s technological meltdown, he said, “We weren’t a technology company—we were a broker that used technology.” Thinking of technology as a support function, rather than the core of a company, has worked for years. But it doesn’t anymore.

We need to assess our projects or businesses through the lens of complexity and tight coupling. If we are operating in the danger zone, we can try to simplify our systems, increase transparency, or introduce more slack. But even when we can’t change our systems, we can change how we manage them.

Consider a climbing expedition to Mount Everest. There are many hidden risks, from crevasses and falling rocks to avalanches and sudden weather changes. Altitude sickness causes blurred vision, and overexposure to UV rays leads to snow blindness. And when a blizzard hits, nothing is visible at all. The mountain is a complex and tightly coupled system, and there isn’t much we can do about that.         

But we can still take steps to make climbing Everest safer. In the past, for example, logistical problems plagued several Everest expeditions: delayed flights, customs issues, problems with supply deliveries, and digestive ailments.

In combination, these small issues caused delays, put stress on team leaders, took time away from planning, and prevented climbers from acclimating themselves to high altitudes. And then, during the final push to the summit, these failures interacted with other problems. Distracted team leaders and exhausted climbers missed obvious warning signs and made mistakes they wouldn’t normally make. And when the weather turns bad on Everest, a worn-out team that’s running behind schedule stands little chance.

Once we realize that the real killer isn’t the mountain but the interaction of many small failures, we can see a solution: rooting out as many logistical problems as possible. And that’s what the best mountaineering companies do. They treat the boring logistical issues as critical safety concerns. They pay a lot of attention to some of the most mundane aspects of an expedition, from hiring logistical staff who take the burden off team leaders to setting up well- equipped base camp facilities. Even cooking is a big deal. As one company’s brochure put it, “Our attention to food and its preparation on Everest and mountains around the world has led to very few gastrointestinal issues for our team members.”

You don’t need to be a mountain climber to appreciate this lesson. After a quality control crisis, for example, managers at pharmaceutical giant Novo Nordisk realized that the firm’s manufacturing had become too complex and unforgiving to manage in traditional ways. In response, they came up with a new approach to finding and addressing small issues that might become big problems.

First, the company created a department of about twenty people who scan for new challenges that managers might ignore or simply not have the time to think about. They talk with non-profits, environmental groups, and government officials about emerging technologies and changing regulations. The goal is to make sure that the company doesn’t ignore small signs of brewing trouble.

Novo Nordisk also uses facilitators to make sure important issues don’t get stuck at the bottom of the hierarchy (as they did before the quality control crisis). The facilitators—around two dozen people recruited from among the company’s most respected managers—work with every unit at least once every few years, evaluating whether there are concerns unit managers may be ignoring. “We go around and find a number of small issues,” a facilitator explained. “We don’t know if they would develop into something bigger if we ignored them. But we don’t run the risk. We follow up on the small stuff.”  

Other organizations use a different approach to manage this kind of complexity. NASA’s Jet Propulsion Laboratory (JPL) does some of the most complex engineering work in the world. Its mission statement is “Dare Mighty Things” or, less formally, “If it’s not impossible, we’re not interested.”

Over the years, JPL engineers have had their share of failures. In 1999, for example, they lost two spacecraft destined for Mars—one because of a software problem onboard the Mars Polar Lander and the other because of confusion about whether a calculation used the English or the metric system.

After these failures, JPL managers began to use outsiders to help them manage the risk of missions. They created risk review boards made up of scientists and engineers who worked at JPL, NASA, or contractors—but who weren’t associated with the missions they reviewed and didn’t buy into the same assumptions as mission insiders.

But JPL’s leaders wanted to go even further. Every mission that JPL runs has a project manager responsible for pursuing ground-breaking science while staying within a tight budget and meeting an ambitious schedule. Project managers walk a delicate line. When under pressure, they might be tempted to take shortcuts when designing and testing critical components. So senior leaders created the Engineering Technical Authority (ETA), a cadre of outsiders within JPL. Every project is assigned an ETA engineer, who makes sure that the project manager doesn’t make decisions that put the mission at risk. 

If an ETA engineer and a project manager can’t agree, they take their issue to Bharat Chudasama, the manager who runs the ETA program. When an issue lands on his desk, Chudasama tries to broker a technical solution. He can also try to get project managers more money, time, or people. And if he can’t resolve the issue, he brings it to his boss, JPL’s chief engineer. Such channels for skepticism are indispensable in the danger zone because the ability of any one individual to know what’s going on is limited, and the cost of being wrong is just too high.

This approach isn’t rocket science. In fact, the creation of outsiders within an organization has a long history. For centuries, when the Roman Catholic Church was considering whether to declare a person a saint, it was the job of the Promoter of the Faith, popularly known as the Devil’s Advocate, to make a case against the candidate and prevent any rash decisions. The Promoter of the Faith wasn’t involved in the decision-making process until he presented his objections, so he was an outsider free from the biases of those who had made the case for a candidate in the first place.

The sports writer Bill Simmons proposed something similar for sports teams. “I’m becoming more and more convinced that every professional sports team needs to hire a Vice President of Common Sense,” Simmons wrote. “One catch: the VP of CS doesn’t attend meetings, scout prospects, watch any film or listen to any inside information or opinions; he lives the life of a common fan. They just bring him in when they’re ready to make a big decision, lay everything out and wait for his unbiased reaction.”  

These solutions might sound obvious, and yet we rarely use them in practice. We don’t realize that many of our decisions contribute to complexity and coupling, resulting in increasingly vulnerable systems. We tend to focus on big, external shocks while ignoring small problems that can combine into surprising meltdowns. And we often marginalize skeptics instead of creating roles for them.

Today, we are in the golden age of meltdowns. More and more of our systems are in the danger zone, but our ability to manage them hasn’t quite caught up. And we can see the results all around us. The good news is that smart organizations are finding ways to navigate this new world, and we can all learn from them.


Excerpted from MELTDOWN by Chris Clearfield and András Tilcsik. Reprinted by arrangement with Penguin Press, a member of Penguin Group (USA) LLC, A Penguin Random House Company. Copyright © Christopher Clearfield and András Tilcsik, 2018.

Cambridge scientists create a successful "vaccine" against fake news

A large new study uses an online game to inoculate people against fake news.

University of Cambridge
Politics & Current Affairs
  • Researchers from the University of Cambridge use an online game to inoculate people against fake news.
  • The study sample included 15,000 players.
  • The scientists hope to use such tactics to protect whole societies against disinformation.
Keep reading Show less

Yale scientists restore brain function to 32 clinically dead pigs

Researchers hope the technology will further our understanding of the brain, but lawmakers may not be ready for the ethical challenges.

Still from John Stephenson's 1999 rendition of Animal Farm.
Surprising Science
  • Researchers at the Yale School of Medicine successfully restored some functions to pig brains that had been dead for hours.
  • They hope the technology will advance our understanding of the brain, potentially developing new treatments for debilitating diseases and disorders.
  • The research raises many ethical questions and puts to the test our current understanding of death.

The image of an undead brain coming back to live again is the stuff of science fiction. Not just any science fiction, specifically B-grade sci fi. What instantly springs to mind is the black-and-white horrors of films like Fiend Without a Face. Bad acting. Plastic monstrosities. Visible strings. And a spinal cord that, for some reason, is also a tentacle?

But like any good science fiction, it's only a matter of time before some manner of it seeps into our reality. This week's Nature published the findings of researchers who managed to restore function to pigs' brains that were clinically dead. At least, what we once thought of as dead.

What's dead may never die, it seems

The researchers did not hail from House Greyjoy — "What is dead may never die" — but came largely from the Yale School of Medicine. They connected 32 pig brains to a system called BrainEx. BrainEx is an artificial perfusion system — that is, a system that takes over the functions normally regulated by the organ. The pigs had been killed four hours earlier at a U.S. Department of Agriculture slaughterhouse; their brains completely removed from the skulls.

BrainEx pumped an experiment solution into the brain that essentially mimic blood flow. It brought oxygen and nutrients to the tissues, giving brain cells the resources to begin many normal functions. The cells began consuming and metabolizing sugars. The brains' immune systems kicked in. Neuron samples could carry an electrical signal. Some brain cells even responded to drugs.

The researchers have managed to keep some brains alive for up to 36 hours, and currently do not know if BrainEx can have sustained the brains longer. "It is conceivable we are just preventing the inevitable, and the brain won't be able to recover," said Nenad Sestan, Yale neuroscientist and the lead researcher.

As a control, other brains received either a fake solution or no solution at all. None revived brain activity and deteriorated as normal.

The researchers hope the technology can enhance our ability to study the brain and its cellular functions. One of the main avenues of such studies would be brain disorders and diseases. This could point the way to developing new of treatments for the likes of brain injuries, Alzheimer's, Huntington's, and neurodegenerative conditions.

"This is an extraordinary and very promising breakthrough for neuroscience. It immediately offers a much better model for studying the human brain, which is extraordinarily important, given the vast amount of human suffering from diseases of the mind [and] brain," Nita Farahany, the bioethicists at the Duke University School of Law who wrote the study's commentary, told National Geographic.

An ethical gray matter

Before anyone gets an Island of Dr. Moreau vibe, it's worth noting that the brains did not approach neural activity anywhere near consciousness.

The BrainEx solution contained chemicals that prevented neurons from firing. To be extra cautious, the researchers also monitored the brains for any such activity and were prepared to administer an anesthetic should they have seen signs of consciousness.

Even so, the research signals a massive debate to come regarding medical ethics and our definition of death.

Most countries define death, clinically speaking, as the irreversible loss of brain or circulatory function. This definition was already at odds with some folk- and value-centric understandings, but where do we go if it becomes possible to reverse clinical death with artificial perfusion?

"This is wild," Jonathan Moreno, a bioethicist at the University of Pennsylvania, told the New York Times. "If ever there was an issue that merited big public deliberation on the ethics of science and medicine, this is one."

One possible consequence involves organ donations. Some European countries require emergency responders to use a process that preserves organs when they cannot resuscitate a person. They continue to pump blood throughout the body, but use a "thoracic aortic occlusion balloon" to prevent that blood from reaching the brain.

The system is already controversial because it raises concerns about what caused the patient's death. But what happens when brain death becomes readily reversible? Stuart Younger, a bioethicist at Case Western Reserve University, told Nature that if BrainEx were to become widely available, it could shrink the pool of eligible donors.

"There's a potential conflict here between the interests of potential donors — who might not even be donors — and people who are waiting for organs," he said.

It will be a while before such experiments go anywhere near human subjects. A more immediate ethical question relates to how such experiments harm animal subjects.

Ethical review boards evaluate research protocols and can reject any that causes undue pain, suffering, or distress. Since dead animals feel no pain, suffer no trauma, they are typically approved as subjects. But how do such boards make a judgement regarding the suffering of a "cellularly active" brain? The distress of a partially alive brain?

The dilemma is unprecedented.

Setting new boundaries

Another science fiction story that comes to mind when discussing this story is, of course, Frankenstein. As Farahany told National Geographic: "It is definitely has [sic] a good science-fiction element to it, and it is restoring cellular function where we previously thought impossible. But to have Frankenstein, you need some degree of consciousness, some 'there' there. [The researchers] did not recover any form of consciousness in this study, and it is still unclear if we ever could. But we are one step closer to that possibility."

She's right. The researchers undertook their research for the betterment of humanity, and we may one day reap some unimaginable medical benefits from it. The ethical questions, however, remain as unsettling as the stories they remind us of.

5 facts you should know about the world’s refugees

Many governments do not report, or misreport, the numbers of refugees who enter their country.

David McNew/Getty Images
Politics & Current Affairs

Conflict, violence, persecution and human rights violations led to a record high of 70.8 million people being displaced by the end of 2018.

Keep reading Show less