How a BGP error knocked Facebook, WhatsApp and Instagram offline for millions

How a BGP error knocked Facebook, WhatsApp and Instagram offline for millions

Facebook has apologised for the biggest outage in its history that took place yesterday, which also affected Instagram and WhatsApp.

The service blackout affected millions of users around the globe for several hours on Monday afternoon and evening. Users in the UK were reporting problems as early as 4.30pm and weren’t able to get back on until around 10pm.

Facebook hasn’t given any detailed technical information about what caused the outage, but reports suggest it had to scramble engineers to its data centre to uncover the source of the problem.

The company said a ‘faulty configuration change’ was to blame for the problem, but technical experts believe the issue lay with something called the Border Gateway Protocol (BGP) which is one of the systems the internet uses to direct traffic around the web.

Unfortunately for Facebook, many of its engineers rely on internal logins to access the communication tools they need to resolve the problem. With the entire system down, they had to resort to alternative methods of communication – like holding conversations over email.

‘BGP is a technology which ISP’s (Internet Service Providers) share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses,’ explained PJ Norris, principal systems engineer at Tripwire.

The best way to think of BGP is as an air-traffic controller, which sends packets of data around the internet (through the right servers) in the quickest and most efficient way possible. Because the routes around the web are always changing, the BGP is an automated way of keeping things going in the right direction.

When it’s changed, then suddenly your computer doesn’t receive a destination to go to.

‘In other words, Facebook inadvertently removed the ability to tell the world where it lives,’ PJ said.

As well as having to use alternative tools to communicate, Facebook was also grappling with the fact that not everyone was in the same place. Like many tech companies, the pandemic meant a number of employees were working remotely.

‘Those who were onsite at the data centres and offices who were trying to back out the change, were unable to access the environments as the door access control system was down due to the impact of the outage,’ PJ said.

‘So the question always comes down to, “could this have been avoided?” It’s evident at this early stage that Facebook had a single point of failure that cascaded in to a significant and costly outage for the technology giant.

‘Any changes, especially to critical services, should be tested, and double checked before implementation. It’s unclear around the circumstances of this change to the BGP at this point in time, so it’s speculative on how this happened.’

Facebook, which owns both Whatsapp and Instagram, said in a statement: ‘Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication.

‘This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt.

‘We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.’

Source: Read Full Article