Just before 1 a.m. local time on Friday, a systems administrator for a West Coast company that handles funeral and burial services suddenly woke up to find his computer screen flashing. He checked his company phone and was inundated with messages about what his colleagues were calling network problems. The company’s entire infrastructure was down, threatening to disrupt funerals and burials.
It quickly became clear that the massive disruption was caused by an outage at CrowdStrike. The security company accidentally caused chaos around the world Friday and into the weekend by distributing faulty software to its Falcon surveillance platform, disrupting operations at airlines, hospitals and other businesses large and small.
The administrator, who requested anonymity because he wasn’t authorized to speak publicly about the outage, sprung into action immediately. He ended up working nearly 20-hour days, driving from morgue to morgue, resetting dozens of computers himself to fix the problem. The situation, the administrator explained, was urgent: The computers needed to be back online to avoid disrupting funeral schedules and communications between hospitals and morgues.
“With a problem as big as what we had with the CrowdStrike outage, it makes sense for us to function well and make sure these families have service and can be with their families,” the systems administrator said. “People are grieving.”
CrowdStrike’s flawed update disabled some 8.5 million Windows computers worldwide, sending them into a frightening blue screen of death (BSOD) spiral. “The trust we’ve built over the years was wiped away in a matter of hours, dealing a huge blow,” Sean Henry, CrowdStrike’s chief security officer, wrote on LinkedIn early Monday. “But this pales in comparison to the pain we’ve caused our customers and partners. We’ve let down the people we promised to protect.”
Cloud platform outages and other software issues, including malicious cyberattacks, have caused major IT outages and global disruption before. But last week’s incident was particularly notable for two reasons. First, it was caused by a mistake in software that was meant to help and defend networks, not harm them. And second, fixing the problem required physical access to each affected machine, which meant manually booting each computer into Windows Safe Mode and applying the fix.
IT is often an unglamorous, thankless job, but the CrowdStrike fiasco took it to the next level. Some IT pros had to coordinate with remote employees and multiple international locations to explain manual device reset procedures. A junior systems administrator for an Indonesia-based fashion brand had to figure out how to overcome language barriers. “It was daunting,” he says.
“Unless something goes wrong, we don’t know,” a systems administrator at a Maryland healthcare facility told WIRED.
The person woke up just before 1 a.m. EDT to find the screens at the organization’s physical site had gone blue and become unresponsive. The team spent several hours in the early morning hours bringing the servers back online, then had to manually repair more than 5,000 other devices across the company. The outage cut off calls to the hospital and disrupted the system that dispenses medications. They had to write everything down by hand and walk to the pharmacy.