On November 2, 2023, CloudflareIts customer-facing interface, including their website and API, including logging and analytics, stopped working properly. That was bad.
More than 7.5 million websites use Cloudflare, and 3,280 of the world’s 10,000 most popular websites rely on its content delivery network (CDN) services. The good news is that the CDN didn’t go down. The bad news is that CloudFlare Dashboard and its related Application Programming Interface (API) They were down for about two days.
Also: The Best VPN Services (And How to Choose the Right One for You)
This kind of thing doesn’t just happen — or it shouldn’t, anyway — to major Internet service companies. So, the multi-million dollar question is: ‘What happened?’ The answer, according to CloudFlare CEO, Matthew Prince, was a power-related incident at a trio of the company’s primary data centers in Oregon, which are operated by flexible, that cascaded into one problem after another. Thirty-six hours later, Cloudflare is finally back to normal.
Prince didn’t pussyfoot around the issue:
To begin with, it should never have happened. We believe that we have high availability systems in place that should have prevented such an outage, even when one of our primary data center providers failed catastrophically. And, although many systems remained online as designed, some critical systems had non-obvious dependencies that made them unavailable. I am saddened and embarrassed by this incident and the pain it has caused our customers and our team.
He’s right – this should never have happened. CloudFlare’s control plane and analytics system run on servers in three data centers around Hillsboro, Oregon. But, they are all independent of each other; Each has multiple utility power feeds and multiple redundant and independent Internet connections.
The trio of data centers aren’t close enough together that a natural disaster would knock them all out at once. At the same time, they are still close enough that they can run all active-redundant data clusters. So, by design, if any facility goes offline, the remaining ones must pick up the load and continue working.
Sounds great, right? However, that is not what happened.
The first thing that happened was that a power failure at the Flexential facility caused an unexpected service disruption. Portland General Electric (PGE) was forced to shut down one of its independent power feeds to the building. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential operated their generators so that the feed was low.
However, this approach is a no-no for those of you who don’t know data center best practices. You shall not use off-premise power and generator at the same time. Adding insult to injury, Flexential didn’t tell Cloudflare that they would be converting to generator power.
Also: 10 Ways to Speed Up Your Internet Connection Today
Then, there was a ground fault in a PGE transformer going into the data center. And, when I say ground fault, I don’t mean a short one, the one where you go down to the basement to fix a fuse. I mean a 12,470-volt bad boy that took out the connections and all the generators in less time than it took you to read this sentence.
In theory, a bank of UPS batteries should have kept the servers running for 10 minutes, which should have been enough time to restart the generators. Instead, the UPSs started dying in about four minutes, and the generators never made it back up in time.
There may have been no one able to save the situation, but When onsite, overnight staff “Consisting of security and an unaccompanied technician who had only been on the job for a week,” the situation was hopeless.
Also: The Best VPN Services for iPhone and iPad (Yes, You Should Use One)
Meanwhile, CloudFlare discovered the hard way that some critical systems and new services had not yet been integrated into its high-availability setup. Furthermore, CloudFlare’s decision to keep logging systems outside of high-availability clusters, because analysis latency would be acceptable, proved to be wrong. If the CloudFlare staff couldn’t get a good look at the logs to see what was going wrong, the output would remain static.
It turned out that, while the three data centers were “mostly” redundant, they weren’t completely. Two other data centers operating in the region took over high-availability clusters and kept critical services online.
So far so good. However, a subset of services that were supposed to be in the high-availability cluster had dependencies that were running exclusively in dead data centers.
Specifically, two important services that process logs and power CloudFlare’s analytics — Kafka And Click House — was only available in offline data centers So, when services were called for Kafka and Clickhouse on the high-availability cluster, they failed.
Cloudflare admits It was “too late about the need for new products and their associated databases to integrate with high-availability clusters.” Moreover, many of its services depend on the availability of its core facilities.
Many companies operate this way, however The prince admittedIt “doesn’t play to CloudFlare’s strengths. We are good at distributed systems. During this incident, our global network works as expected. But many fail when the core is unavailable. We have to use distributed system products that “we have all of our services. to make it available to all our customers, so that even if our core facilities are disrupted, they will operate mostly as normal.”
Also: Cybersecurity 101: Everything on how to protect your privacy and stay safe online
Hours later, everything was finally back up and running — and it wasn’t easy. For example, almost all the power breakers were fried, and Flexentel had to go and buy more to replace them.
Anticipating multiple power surges, CloudFlare also decided “the only safe recovery process is to follow a full bootstrap of the entire facility.” This method means rebuilding and rebooting all servers, which takes several hours.
The incident, which lasted until November 4, was finally resolved. Looking ahead, Prince concluded: “We have the right systems and procedures in place to be able to withstand the cascading string of failures that we’ve seen at our data center provider, but we need to be more rigorous about enforcing those that are followed and tested. . For unknown dependencies. It will have my full attention and the attention of a large part of our team through the balance of the year. And the pain of the last few days will make us better.”