Facebook Outage: Explained

What resulted in Mark Zuckerberg losing $6 Billion in 6 hours?

The 4th of October Facebook Outage

Internet giant Facebook recently saw its worst outage of all time. This outage lasted for several hours. It affected several million users and businesses worldwide. These users rely on Facebook, and its various platforms, for everything from birthday greetings to ecommerce sales, and everything in between. Hence, when the Facebook networks went down there was utter chaos in the digital world. From cellular companies reporting a 75% jump in network traffic to CEOs of rival internet brands taking a dig. This outage could not escape the attention of anyone.

What caused the Facebook Outage?

The Facebook Outage lasted for over 6 hours.

Given its size, it is no wonder that it has a proprietary network custom built to their needs. This includes their own data centers and a network which spans the entire globe. These systems are fine tuned to deliver lightening fast speed. 

This state of the art infrastructure is the backbone of all of Facebook’s offerings. Hence, it is vital that a great amount of effort goes into its maintenance and upkeep. 

According to Facebook, a command was passed during one of these activities. This command intended to check the availability of their backbone network. Unintentionally it took down the connections to their network. This resulted in the disconnection of Facebook data centers globally. 

While they have systems in place to check such commands, this outage apparently affected their internal service too. And, hence it escaped the internal checks. 

Additionally, the protocol responsible (Border Gateway Protocol or BGP) for connecting Facebook’s geographically diverse systems malfunctioned as well. As a result, Facebook was disconnected from the rest of the internet. 

The BGP, simply put, populates the appropriate address books (DNS) to route the internet data packets to the correct destination. An auditing system shut down the BGP. This happened because the BGP populated the address books with locations which were technically disconnected/unreachable. So, even though the DNS was working, it did not have the desired entries to help route the traffic.

Response to the Facebook Outage

The problem was so big that a mere reset from a remote computer was not enough. To resolve this, Facebook had to physically send personnel to their data centers to reset the systems. However, they now had a new problem on their hands. Facebook anticipated a surge in traffic, so big, that it could cause the network to crash again. 

To counter this, Facebook relied on their internal SOPs and drills which simulate a large-scale failure quite like this. This finally worked, and Facebook was able to restore their service.

A great deal of speculation was propagated on the internet. Some reports went to the extent of speculating this to be a malicious attack on the Facebook infrastructure. Facebook, however, put everybody’s anxiety to rest by issuing an explanation on their blog engineering.fb.com

Leave a Reply

Your email address will not be published. Required fields are marked *