What could make the world’s biggest social media platform go down?
IT outages happen all the time, and as consumers, we don’t really think about it that much. But if you’re running a business, an IT outage is very bad news. During the space of those hours or days, when your site is down, you’re unable to engage with your customers. Every minute that a consumer is unable to access your services or buy your product directly impacts your ability to do business and generate revenue.
This is exactly what happened when, in October of 2021, Facebook, arguably the world’s biggest single source of social media, went down. In just 6 hours, the company lost $100 million in revenue and 3.5 billion users around the world were impacted.
So, what happened?
What happened at Facebook?
As it turned out, it was a minor change that was so seemingly inconsequential that maintenance teams had overlooked the possibility of it causing damage.
During a routine maintenance check, somebody on the back-end team entered the wrong command, which led to an error in the system. Normally, this error would have been corrected by a fail-safe, but on this occasion, the fail-safe… failed.
Before anyone knew what had occurred, the issue had snowballed, spreading from network to network until the entire Facebook ecosystem came crashing down. Not just the social media website but also Facebook Messenger, Whatsapp, Instagram, etc.
But how does something so small create so much chaos?
Incident management—what we call the process of detecting an issue and correcting it—is actually an exceedingly complex process. A small pebble can cause ripples that spread across an entire pond. Similarly, a minute error can affect an entire ecosystem of applications, networks, and systems.
I can imagine the panic that must have spread across the different dev, maintenance, security, and operations teams as they scrambled to find the root cause. Is it malware in the code? Was there a cybersecurity breach? Maybe it’s QA’s fault. Maybe the dev team was responsible.
Between the cascading failures and the mounting pressure of unhappy users, the techies at Facebook would have been going through hell.
We can’t know exactly what was happening at Facebook that day but imagine what it would be like if an unexpected outage were to occur in your organization.
To understand why even minor incidents can pose a real challenge to an SRE team, let’s take a closer look at what incidents are, why they occur, and why they can be so confusing.
What Are Incidents, and Why Do They Occur?
In site reliability engineering, an incident refers to any unexpected event or condition that disrupts the normal operation of a system. SRE teams need to give it immediate attention and resolution in order to restore the system and prevent further complications.
The challenge in dealing with incidents lies in their unpredictable nature and the interwoven complexities of modern IT systems. This is why SRE and DevOps professionals usually gauge incident severity by time-taken-to-resolve rather than a