The Woeful Network: what Facebook’s latest outage means for 3.5 billion users

Posted: Monday 18 October 2021 by Brian Coombs

Tags: BGP, Cloudflare, DNS, Enterprise, Facebook, Instagram, Keycloak, Mark Zuckerberg, Oculus, outage, Red Hat, single sign-on, Slack, Three, WhatsApp

Categories: BSS/OSS, Cloud, E-Commerce

In case you hadn’t noticed – Facebook was hit by a vast service outage two weeks ago. Why might such an outage become more likely in the future, and what should businesses do to mitigate the risk? Brian Coombs explains.

To paraphrase a certain movie’s tagline, you don’t get to 3.5 billion users without suffering a few outages.

For nearly six hours, the Facebook family, including WhatsApp, Instagram, Oculus VR, plus any apps requiring login via Facebook API, was hit by a massive outage, forcing its users to presumably read a book or watch TV without any handheld distractions.

Telcos reported record levels of SMS messages sent during the outage, with Three UK saying it had been busier than the last three New Year’s Eves combined as people resorted to plain old text messages.
Such was the scale of the outage that it hobbled Facebook’s own ability to respond, crashing everything from their internal messaging platform to the access passes required to fix servers.

From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn't have network/logical access. So blocked at both ends from reversing it.
— briankrebs (@briankrebs) October 4, 2021

Being knocked offline for a few hours isn’t uncommon for any app or website; Facebook last suffered a big one back in 2019, taking down its apps for over 14 hours, and prior to that another full outage in 2008, when it had a mere 150 million users.

Just days before the recent blackout, a similar DNS error had brought down Slack for approximately 1% of users, but these types of issues are usually localised problems that are quickly resolved, affecting only a minority of customers or a select few services.

However, Facebook’s latest outage resulted in Downdetector receiving over 10.6 million reports from around the world, the most it had ever received for a single incident.

This issue extended to remote access for engineers still working from home, forcing the company to call in a team to solve the problem in-person at the company’s Santa Clara data centre. Reports that engineers had to cut through server cages with an angle grinder have since been corrected. The finger of blame has been pointed at Facebook’s new peering automation service introduced in May, a hypothesis supported, if not confirmed, by the company’s own statement.

This catastrophe couldn’t have come at a worse time for Facebook, striking a day after the Facebook Files whistleblower Frances Haugen, former product manager of its Civic Misinformation team, publicly revealed herself. Moreover the outage has highlighted another problem for Facebook; excessive reliance on its singular platform, and the hazards of placing all digital communication eggs in one basket.

While the fact that you couldn’t spend the evening scrolling through Facebook, or using other apps that depend on it for authentication, won’t have troubled too many consumers, there is a more serious side to this sort of outage. From businesses running their communications though WhatsApp or Messenger, to those selling on Facebook Marketplace, an outage like this will seriously hurt sales and customer services, driving them to expand their channels as a result.

Such outages could also become more likely as the company realises its ambitions to consolidate the ostensibly separate WhatsApp and Instagram into Facebook, whilst it continues to subsume an increasing number of other start-ups and services.

This, plus other high-profile events, such as June’s Fastly outage, has raised questions about the stratification of essential Internet services in the hands of just a few organisations.

In recent years, “Login with Facebook/Google/Microsoft” has become ubiquitous across the web. It has definite advantages for users, eliminating one step in the registration process and it’s one less password to remember – but it means that if the identity provider is down, so too is your website or service. For companies using this as their sole method of authentication – including Facebook itself – this means a total business shutdown.
How can your business avoid the same fate?

The obvious answer is by using a configurable authentication system that can easily support multiple sign-in options. For example, we use Red Hat Single Sign On (or Keycloak as the community version is known) in the Cerillion Enterprise BSS/OSS suite which allows for a traditional distinct username/password but also with a quick configuration change you can enable your users to authenticate via Facebook/Google/Microsoft or on the enterprise side more likely Azure Active Directory. If one of these is down your users can still at least use one of the other methods to get into your application ensuring a strong degree of resilience - certainly more than Facebook, at the very least.

Businesses must weigh up the benefits of tying themselves into an ecosystem such as Facebook for the ease of access and reach it brings. Ultimately, what is worse for your business: being locked into a closed system, or being locked out in the event of downtime?

Those that do depend on these apps would greatly benefit from diversifying their means of commerce and communications to effectively manage their operations and deliver the best possible user experience. This isn’t something that you should have to handle yourself; there’s plenty of software out there that can manage this for you.

18 Oct 2021

About the author

Brian Coombs

Product Director, Cerillion

The Woeful Network: what Facebook’s latest outage means for 3.5 billion users

Keep up with the latest company news and industry analysis