How is Greggs, renowned for crafting the UK’s favorite sausage roll, connected to tech giants Apple and Meta?
In March and April 2024, there has been an observed struggle for customers to access various services, ranging from baked goods to Big Macs and WhatsApp messages, due to IT outages.
According to experts, these recent instances of outages are not mere coincidences; rather, they are occurring with increasing frequency.
These notable cases have brought a particular website to the forefront: Downdetector, a platform that monitors web outages and provides insight into the scope of recent challenges faced by companies.
On April 3 alone, Downdetector recorded over 1.75 million user-reported issues globally for WhatsApp, with tens of thousands also reported for the App Store and Apple TV.
Despite inquiries from the BBC, neither company responded regarding the cause of their respective outages.
Brennen Smith, vice president of technology at Downdetector’s parent company, Ookla, highlights that these incidents align with their observations of more frequent outages and a higher volume of user reports as they occur.
“The internet is not becoming more stable,” he commented to the BBC.
To comprehend this trend, it’s essential to delve into the structure of the internet itself, which resembles software with its multiple layers. Each time regulatory changes are mandated, demands for seamless data access arise, or the integration of new features like AI chatbots is sought, additional layers are introduced.
The introduction of more layers and complexity heightens the risk of malfunctions.
“There’s currently a drive for tech giants to swiftly incorporate groundbreaking technology into their offerings,” Mr. Smith noted. “While this push for innovation may accelerate progress, it also carries the potential risk of system disruptions.”
Moving Parts and Thundering Herds: Navigating the Dynamics of Today’s Tech Landscape
When considering the internet, it’s important to acknowledge its vulnerability to various factors. Typos in code, hardware malfunctions, power disruptions, and cyber attacks are just a few examples that can lead to service disruptions. Additionally, severe weather conditions like heatwaves and storms can impact data centers, which house the servers crucial for online services.
According to Sam Kirkman from cybersecurity firm NetSPI, the complexity of the internet’s infrastructure means that even a minor issue can cause significant problems. Over the past decade, many companies have transitioned from managing their servers in-house to utilizing cloud services, enabling them to operate more efficiently. However, this reliance on cloud providers also means that an outage in one location can have widespread consequences across multiple platforms and companies.
Notably, service interruptions have affected major industry players like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, as well as smaller providers like Fastly and Cloudflare. Events such as Black Friday or bank holidays can exacerbate outages due to increased demand and reduced staffing.
While theories suggest that Fridays may witness more outages, it remains speculative. Nevertheless, many companies avoid implementing updates or changes on Fridays to minimize risks. During outages, engineers face the challenge of addressing technical issues while managing a surge of users attempting to access the service, a scenario referred to as a “thundering herd.”
In summary, the internet’s susceptibility to a range of factors underscores the need for robust infrastructure and contingency plans to mitigate disruptions and ensure reliable service delivery.
‘Technical debt’
At the core of this is another foundational reality of the digital realm: while advancements in services and products continue to progress, the underlying infrastructure often remains outdated.
To put it differently, as Mr. Kirkman emphasizes, the contemporary internet relies heavily on “a framework of antiquated technology.” He cites the Border Gateway Protocol (BGP) as a prime example, citing Meta’s six-hour outage in October 2021 as evidence.
Due to misconfigured BGP updates from Facebook, communication with the broader internet was essentially halted, leaving users unable to connect with loved ones or conduct business operations.
According to Mr. Kirkman, the ongoing maintenance of BGP presents a significant challenge due to its inherent complexity and the difficulty of implementing updates. Even minor adjustments can have far-reaching consequences, potentially leading to the collapse of entire platforms.
This underscores what he describes as “technical debt,” a persistent issue that has implications for the broader internet infrastructure. While these challenges are not novel, the increasing reliance on online services magnifies their significance, posing a growing obstacle for companies striving to mitigate such risks.
Mr. Smith echoes this sentiment, emphasizing a heightened sense of urgency among stakeholders. He emphasizes the critical importance of ensuring the resilience of services to maintain online functionality while simultaneously introducing new innovations and features to meet evolving demands.
“Tech’s downward trend, from WhatsApp’s outage to Greggs’ system failure, raises questions about our reliance on technology and the need for robust systems. A wake-up call for innovation and resilience in the digital age! #TechGlitches