Meltdown: Why our systems fail and what we can do about it

Today, we are in the golden age of meltdowns. More and more of our systems are in the danger zone, but our ability to manage them hasn’t quite caught up.

Cover image of Meltdown by Chris Clearfield and Andras Tilcsik
Cover image of Meltdown by Chris Clearfield and Andras Tilcsik

Ceasar Medina died because of a computer glitch.

Though he was shot in a botched robbery attempt, his killer—a convicted felon named Jeremiah Smith—should have been behind bars at the time. But Smith was one of thousands of inmates that the Washington State Department of Corrections accidentally released because of a software problem: a bug in the DOC’s computer code that, for over a decade, miscalculated prisoner sentences.

Surprising meltdowns like the one at the DOC happen all the time. At UCSF—one of the world’s best hospitals—a sophisticated pharmacy robot and a high-tech prescription system confused a doctor, lulled a pharmacist into approving a massive overdose of a routine antibiotic, and automatically packaged 38 pills, instead of the single pill the doctor intended. A nurse, comforted by the barcode scanner that confirmed the dosage, gave the pills one by one to her patient, a 16-year-old boy, who nearly died as a result.

In 2012, Wall Street giant Knight Capital unintentionally traded billions of dollars of stock and lost nearly $500 million in just half an hour because of a software glitch. It was a stunning meltdown that couldn’t have happened a decade earlier, when humans still controlled trading.

And at the airlines, technological glitches, combined with ordinary human mistakes, have caused outages in reservation and ticketing systems, grounded thousands of flights, and accidentally given pilots vacation during the busy holiday season. These issues cost the airlines hundreds of millions of dollars and delayed nearly a million passengers.

To understand why these kinds of failures keep happening, we turn to an unexpected source: a 93-year-old sociologist named Charles Perrow. After the Three Mile Island nuclear meltdown in 1979, Perrow became interested in how simple human errors spiral out of control in complex technological systems. For Perrow, Three Mile Island was a wake-up call. The meltdown wasn’t caused by a massive external shock like an earthquake or a terrorist attack. Instead, it emerged from the interaction of small failures—a plumbing glitch, a maintenance crew’s oversight, a stuck-open valve, and a series of confusing indicators in the control room.

The official investigation blamed the plant’s staff. But Perrow thought that was a cheap shot since the accident could only be understood in retrospect. That was a scary conclusion. Here was one of the worst nuclear accidents in history, but it wasn’t due to obvious human errors or a big external shock. It somehow just emerged from small mishaps that came together in a weird way.

Over the next four years, Perrow trudged through the details of hundreds of accidents. He discovered that a combination of two things cause systems to exhibit the kind of wild, unexpected behaviors that occurred at Three Mile Island.

The first element is complexity. For Perrow, complexity wasn’t a buzzword; it had a specific definition. A complex system is more like an elaborate web than an assembly line; many of its parts are intricately linked and can easily affect one another. Complexity also means that we need to rely on indirect indicators to assess most situations. We can’t go in to take a look at what’s happening in the belly of the beast. In a nuclear power plant, for example, we can’t just send someone to see what’s happening in the core. We need to piece together a full picture from small slivers—pressure indications, water flow measurements, and the like.

The second part of Perrow’s theory has to do with how much slack there is in a system. He borrowed a term from engineering: tight coupling. When a system is tightly coupled, there is little buffer among its parts. The margin for error is thin, and the failure of one part can easily affect the others. Everything happens quickly, and we can’t just turn off the system while we deal with a problem.

In Perrow’s analysis, it’s the combination of complexity and tight coupling that pushes systems into the danger zone. Small errors are inevitable in complex systems, and once things begin to go south, such systems produce baffling symptoms. No matter how hard we try, we struggle to make a diagnosis and might even make things worse by solving the wrong problem. And if the system is also tightly coupled, we can’t stop the falling dominoes. Failures spread quickly and uncontrollably.

When Perrow came up with his framework in the early 1980s, the danger zone he described was sparse: it included exotic systems like nuclear facilities and space missions. But in the intervening years, we’ve steadily added complexity and tight coupling to many mundane systems. These days, computers—often connected to the internet—run everything from cars to cash registers and from pharmacies to prisons. And as we add new features to existing technologies—such as mobile apps to airline reservation systems—we continue to increase complexity. Tight coupling, too, is on the rise, as the drive for lean operations removes slack and leaves little margin for error.

This doesn’t necessarily imply that things are worse than they used to be. What it does suggest, though, is that we are facing a different kind of challenge, one where massive failures come not from external shocks or bad apples, but from combinations of technological glitches and ordinary human mistakes.

We can’t turn back the clock and return to a simpler world. Airlines shouldn’t switch back to paper tickets and traders shouldn’t abandon computers. Instead, we need to figure out how to manage these new systems. Fortunately, an emerging body of research reveals how we can overcome these challenges.  

The first step is to recognize that the world has changed. But that’s a surprisingly hard thing to do, even in an era where businesses seem to celebrate new technologies like blockchain and AI. When we interviewed the former CEO of Knight Capital years after the firm’s technological meltdown, he said, “We weren’t a technology company—we were a broker that used technology.” Thinking of technology as a support function, rather than the core of a company, has worked for years. But it doesn’t anymore.

We need to assess our projects or businesses through the lens of complexity and tight coupling. If we are operating in the danger zone, we can try to simplify our systems, increase transparency, or introduce more slack. But even when we can’t change our systems, we can change how we manage them.

Consider a climbing expedition to Mount Everest. There are many hidden risks, from crevasses and falling rocks to avalanches and sudden weather changes. Altitude sickness causes blurred vision, and overexposure to UV rays leads to snow blindness. And when a blizzard hits, nothing is visible at all. The mountain is a complex and tightly coupled system, and there isn’t much we can do about that.         

But we can still take steps to make climbing Everest safer. In the past, for example, logistical problems plagued several Everest expeditions: delayed flights, customs issues, problems with supply deliveries, and digestive ailments.

In combination, these small issues caused delays, put stress on team leaders, took time away from planning, and prevented climbers from acclimating themselves to high altitudes. And then, during the final push to the summit, these failures interacted with other problems. Distracted team leaders and exhausted climbers missed obvious warning signs and made mistakes they wouldn’t normally make. And when the weather turns bad on Everest, a worn-out team that’s running behind schedule stands little chance.

Once we realize that the real killer isn’t the mountain but the interaction of many small failures, we can see a solution: rooting out as many logistical problems as possible. And that’s what the best mountaineering companies do. They treat the boring logistical issues as critical safety concerns. They pay a lot of attention to some of the most mundane aspects of an expedition, from hiring logistical staff who take the burden off team leaders to setting up well- equipped base camp facilities. Even cooking is a big deal. As one company’s brochure put it, “Our attention to food and its preparation on Everest and mountains around the world has led to very few gastrointestinal issues for our team members.”

You don’t need to be a mountain climber to appreciate this lesson. After a quality control crisis, for example, managers at pharmaceutical giant Novo Nordisk realized that the firm’s manufacturing had become too complex and unforgiving to manage in traditional ways. In response, they came up with a new approach to finding and addressing small issues that might become big problems.

First, the company created a department of about twenty people who scan for new challenges that managers might ignore or simply not have the time to think about. They talk with non-profits, environmental groups, and government officials about emerging technologies and changing regulations. The goal is to make sure that the company doesn’t ignore small signs of brewing trouble.

Novo Nordisk also uses facilitators to make sure important issues don’t get stuck at the bottom of the hierarchy (as they did before the quality control crisis). The facilitators—around two dozen people recruited from among the company’s most respected managers—work with every unit at least once every few years, evaluating whether there are concerns unit managers may be ignoring. “We go around and find a number of small issues,” a facilitator explained. “We don’t know if they would develop into something bigger if we ignored them. But we don’t run the risk. We follow up on the small stuff.”  

Other organizations use a different approach to manage this kind of complexity. NASA’s Jet Propulsion Laboratory (JPL) does some of the most complex engineering work in the world. Its mission statement is “Dare Mighty Things” or, less formally, “If it’s not impossible, we’re not interested.”

Over the years, JPL engineers have had their share of failures. In 1999, for example, they lost two spacecraft destined for Mars—one because of a software problem onboard the Mars Polar Lander and the other because of confusion about whether a calculation used the English or the metric system.

After these failures, JPL managers began to use outsiders to help them manage the risk of missions. They created risk review boards made up of scientists and engineers who worked at JPL, NASA, or contractors—but who weren’t associated with the missions they reviewed and didn’t buy into the same assumptions as mission insiders.

But JPL’s leaders wanted to go even further. Every mission that JPL runs has a project manager responsible for pursuing ground-breaking science while staying within a tight budget and meeting an ambitious schedule. Project managers walk a delicate line. When under pressure, they might be tempted to take shortcuts when designing and testing critical components. So senior leaders created the Engineering Technical Authority (ETA), a cadre of outsiders within JPL. Every project is assigned an ETA engineer, who makes sure that the project manager doesn’t make decisions that put the mission at risk. 

If an ETA engineer and a project manager can’t agree, they take their issue to Bharat Chudasama, the manager who runs the ETA program. When an issue lands on his desk, Chudasama tries to broker a technical solution. He can also try to get project managers more money, time, or people. And if he can’t resolve the issue, he brings it to his boss, JPL’s chief engineer. Such channels for skepticism are indispensable in the danger zone because the ability of any one individual to know what’s going on is limited, and the cost of being wrong is just too high.

This approach isn’t rocket science. In fact, the creation of outsiders within an organization has a long history. For centuries, when the Roman Catholic Church was considering whether to declare a person a saint, it was the job of the Promoter of the Faith, popularly known as the Devil’s Advocate, to make a case against the candidate and prevent any rash decisions. The Promoter of the Faith wasn’t involved in the decision-making process until he presented his objections, so he was an outsider free from the biases of those who had made the case for a candidate in the first place.

The sports writer Bill Simmons proposed something similar for sports teams. “I’m becoming more and more convinced that every professional sports team needs to hire a Vice President of Common Sense,” Simmons wrote. “One catch: the VP of CS doesn’t attend meetings, scout prospects, watch any film or listen to any inside information or opinions; he lives the life of a common fan. They just bring him in when they’re ready to make a big decision, lay everything out and wait for his unbiased reaction.”  

These solutions might sound obvious, and yet we rarely use them in practice. We don’t realize that many of our decisions contribute to complexity and coupling, resulting in increasingly vulnerable systems. We tend to focus on big, external shocks while ignoring small problems that can combine into surprising meltdowns. And we often marginalize skeptics instead of creating roles for them.

Today, we are in the golden age of meltdowns. More and more of our systems are in the danger zone, but our ability to manage them hasn’t quite caught up. And we can see the results all around us. The good news is that smart organizations are finding ways to navigate this new world, and we can all learn from them.


Excerpted from MELTDOWN by Chris Clearfield and András Tilcsik. Reprinted by arrangement with Penguin Press, a member of Penguin Group (USA) LLC, A Penguin Random House Company. Copyright © Christopher Clearfield and András Tilcsik, 2018.

U.S. Navy controls inventions that claim to change "fabric of reality"

Inventions with revolutionary potential made by a mysterious aerospace engineer for the U.S. Navy come to light.

U.S. Navy ships

Credit: Getty Images
Surprising Science
  • U.S. Navy holds patents for enigmatic inventions by aerospace engineer Dr. Salvatore Pais.
  • Pais came up with technology that can "engineer" reality, devising an ultrafast craft, a fusion reactor, and more.
  • While mostly theoretical at this point, the inventions could transform energy, space, and military sectors.
Keep reading Show less

China's "artificial sun" sets new record for fusion power

China has reached a new record for nuclear fusion at 120 million degrees Celsius.

Credit: STR via Getty Images
Technology & Innovation

This article was originally published on our sister site, Freethink.

China wants to build a mini-star on Earth and house it in a reactor. Many teams across the globe have this same bold goal --- which would create unlimited clean energy via nuclear fusion.

But according to Chinese state media, New Atlas reports, the team at the Experimental Advanced Superconducting Tokamak (EAST) has set a new world record: temperatures of 120 million degrees Celsius for 101 seconds.

Yeah, that's hot. So what? Nuclear fusion reactions require an insane amount of heat and pressure --- a temperature environment similar to the sun, which is approximately 150 million degrees C.

If scientists can essentially build a sun on Earth, they can create endless energy by mimicking how the sun does it.

If scientists can essentially build a sun on Earth, they can create endless energy by mimicking how the sun does it. In nuclear fusion, the extreme heat and pressure create a plasma. Then, within that plasma, two or more hydrogen nuclei crash together, merge into a heavier atom, and release a ton of energy in the process.

Nuclear fusion milestones: The team at EAST built a giant metal torus (similar in shape to a giant donut) with a series of magnetic coils. The coils hold hot plasma where the reactions occur. They've reached many milestones along the way.

According to New Atlas, in 2016, the scientists at EAST could heat hydrogen plasma to roughly 50 million degrees C for 102 seconds. Two years later, they reached 100 million degrees for 10 seconds.

The temperatures are impressive, but the short reaction times, and lack of pressure are another obstacle. Fusion is simple for the sun, because stars are massive and gravity provides even pressure all over the surface. The pressure squeezes hydrogen gas in the sun's core so immensely that several nuclei combine to form one atom, releasing energy.

But on Earth, we have to supply all of the pressure to keep the reaction going, and it has to be perfectly even. It's hard to do this for any length of time, and it uses a ton of energy. So the reactions usually fizzle out in minutes or seconds.

Still, the latest record of 120 million degrees and 101 seconds is one more step toward sustaining longer and hotter reactions.

Why does this matter? No one denies that humankind needs a clean, unlimited source of energy.

We all recognize that oil and gas are limited resources. But even wind and solar power --- renewable energies --- are fundamentally limited. They are dependent upon a breezy day or a cloudless sky, which we can't always count on.

Nuclear fusion is clean, safe, and environmentally sustainable --- its fuel is a nearly limitless resource since it is simply hydrogen (which can be easily made from water).

With each new milestone, we are creeping closer and closer to a breakthrough for unlimited, clean energy.

The science of sex, love, attraction, and obsession

The symbol for love is the heart, but the brain may be more accurate.

  • How love makes us feel can only be defined on an individual basis, but what it does to the body, specifically the brain, is now less abstract thanks to science.
  • One of the problems with early-stage attraction, according to anthropologist Helen Fisher, is that it activates parts of the brain that are linked to drive, craving, obsession, and motivation, while other regions that deal with decision-making shut down.
  • Dr. Fisher, professor Ted Fischer, and psychiatrist Gail Saltz explain the different types of love, explore the neuroscience of love and attraction, and share tips for sustaining relationships that are healthy and mutually beneficial.

Sex & Relationships

There never was a male fertility crisis

A new study suggests that reports of the impending infertility of the human male are greatly exaggerated.