Facebook has said an error during routine maintenance of its network of data centers caused a cascade of problems that took down its platforms for more than six hours on Monday.
In a blogpost published on Tuesday, Santosh Janardhan, vice-president of engineering, said the global outage that saw Facebook, Instagram and WhatsApp go dark for billions of users had begun when the company’s engineers issued a command that unintentionally disconnected Facebook data centers from the rest of the world.
Janardhan described the error as originating within the company’s “global backbone” of fiber-optic cables and data centers.
“This outage was triggered by the system that manages our global backbone network capacity,” Janardhan wrote. “The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.”
“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” Janardhan said.
The company said its systems were designed to audit commands to prevent mistakes, but the audit tool had encountered a bug and had failed to stop the command that caused the outage. The outage had knocked out tools that engineers would normally use to investigate and repair such outages, making the task even more difficult.
The outage was the largest that Downdetector, a web monitoring firm, said it had ever seen.
Facebook said it had not been caused by malicious activity.
While users lost access to one of the world’s most popular messaging apps – WhatsApp has more than 2 billion users – employees were also blocked from internal tools.
The company said it had sent a team of engineers to the location of its data centers to try to debug and restart the systems.
However, it took the company extra time to get engineers inside to work on the servers due to the physical and system security in place.
Even after network connectivity was restored to the data centers, Facebook said it worried a surge in traffic would cause its websites and apps to crash.
But because the company had run drills to prepare for such situations, access to its services returned relatively quickly.
“Every failure like this is an opportunity to learn and get better,” Janardhan wrote. “From here on out, our job is to … make sure events like this happen as rarely as possible.”
The outage came during a difficult week for Facebook, as the US Senate held a hearing with a former employee turned whistleblower who accused the social network of putting profits before people’s safety, a claim that Facebook disputes.