The 2019 Cloudflare outage highlighted the importance of treating configuration as code. This article explores why configuration should be managed with the same rigor as software code, discusses common pitfalls in deployment practices, and emphasizes the need for thorough testing to ensure system stability and performance.
You probably experienced it yourself, or at least heard about it. The internet broke for half an hour on July 2nd, 2019. What really happened was Cloudflare, a major CDN (content delivery network) provider, had a global outage. But it effectively broke the internet, because majority of websites either use Cloudflare directly, or depend on a service which uses it.
This shows us how the whole web is getting more and more centralized and if one of the major players has issues, everybody feels it. But we’re not here to rant about centralization and how bad it is. It’s bad.
No. I want to talk about a tangent topic, which (probably) played a role in this outage.
And one more thing, before we begin. In no way do I think Cloudflare are incompetent or did anything wrong. If anything, they handled the incident openly and quickly, as it should be. Good job!
Code is king
Developers like code. Sometimes they even get in love with their code, which is an issue of it’s own. They learned to treat code well long time ago. We have amazing version control systems and tools. We have great testing frameworks and any developer worth anything writes automated tests and they are being run on every commit.
Long gone are the days where we FTPd our code to servers from local machines. We learned to automatically package and deploy the code, with things like blue/green deployments, canary deploys, rolling upgrades and so on. All the good stuff.
Configuration is, ehm, …
At runtime, all but the simplest software, requires some form of configuration. To make things more complicated, there are several types of configuration.
- internal app configuration - the things like which modules should do what in Nette or how Spring does it. I’d argue that this should not be called configuration and should be strictly part of the app, in version control and deterministically packaged.
- infrastructure configuration - you know, things like if debugging should be enabled, connection strings to the database, credentials. This most often is environment specific.
- runtime configuration - this usually is also environment specific and includes things like feature flags, various limit values and constants, but also business rules. Think VAT rates, discount calculation values, which URL should redirect where and so on. Many people mistake this for “data”, but it’s not. It’s configuration.
Now, how do we treat this configuration? The internal app configuration most probably is treated like code, which is a good thing. It starts to get messy when this type of config is mixed with other kinds of configuration in one file. This should be avoided, but that is generally not too complicated.
The infrastructure configuration lives in config files of various sorts: INI files, YAML files, JSON files or even XML files, god forbid. Since these files are not shipped with the application, they have to be generated or updated by some other means. Infrastructure management tools, like Chef or Puppet try to automate this task, but then again, you have to maintain the configuration values in some other form within these tools. Chef, for example, stores the config values in the environment files, which are usually JSON. These can be versioned, so at least there is that, but good luck running any kind of automated tests of the values themselves, or the actual files that will be generated.
And the runtime configuration? As I wrote earlier, most people consider this data. And where do we store data? In a database, of course! And how do you do a canary deployment of a changed value in a database row? You don’t. Unless you especially somehow build for it, which is unheard of.
What does it have to do with Cloudflare
So why am I even writing about this in connection with the Cloudflare incident? If you read the preliminary report, you will find this sentence.
The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.
Cloudflare are not amateurs. They are a highly successful and competent company. You can find on the interwebs a lot of interesting info about their infrastructure. They do the canary deploys, they do dark launching, they do rolling upgrades. Except, when they change the filtering rules. This is just a configuration change. Probably a record in a database somewhere, which then gets replicated to their POPs. It would seem, they do not treat the configuration as code. And they probably have several good reasons for that. One of them, and probably the most important, is, that it is faster.
You see, the building, the testing, the canary deploys, the gradual rollouts, these things take time. If you add or change a rule in a filterset, you want that to happen immediately. Heck, you want that with software deploys also, but we sort of accepted the fact that if you want to do it safely, it takes a bit of time. But just a config change? Nah, that’s just silly, let’s simply change it.
Configuration as code
This gets me to the root of this post: Configuration is code. It may not look it, but it changes the way our software operates. It fulfills all the characteristics of code, except it “looks” much simpler. Looks can be deceiving.
You should ask all the same questions as with code. Is it in version control? Has it passed code review? Is it covered by tests? Is the deploy automated? Do we have documentation?
Simply, treat your configuration as you treat your code.
PS: Dark launching is not free
One other interesting nugget of wisdom came out of the Cloudflare incident. Read the following sentence from the report:
These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.
So, again, Cloudflare know what they are doing. They are not releasing a change before running it in sort of “inactive mode”. If you don’t know, a technique called dark launching is a way to release new functionality into production without actually affecting users. Maybe you release a new search algorithm, but you enable it only for users with a specially crafted cookie. Or you send your search queries to the new engine and compare the results with the old one, but you still serve the users form the old one. This technique is generally better than just running it in your test environment, because for any non-trivial application, simulating real users in test is near impossible. But as we have seen, it is not free. If you make a certain kind of error in implementing this dark feature, it can still affect your production. So do not consider dark launching as replacement for testing of any kind. It’s not. It’s a release strategy.
PS2: Performance testing
The report is a gift that keeps giving. Let’s take a look at one more part:
Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide.
I would hazard a guess, that Cloudflare has some sort of syntax checking on the regular expressions, so that they do not release an invalid one. But apparently, they were not checking for performance of the expressions. And I don’t blame them, cause that is really hard.
I am not bashing regular expressions here. There are a great tool, but a sharp one. Adjust your usage of them accordingly.
What I want to talk about is performance testing as part of the release process. Which obviously they did not have. And let’s be honest: almost nobody really does, except maybe Cloudflare now after this incident. I’d argue, that at least some basic performance tests should be part of any mature continuous integration pipeline. It’s not easy, but can be done.
Or if that’s not possible, at least setup daily performance tests, using your Selenium server or what have you. There are several companies, which offer synthetic performance test suites. I especially like SpeedCurve but there are others. Point them at your production but also at your test site and have the tests run several times a day. And make sure you set-up performance budgets and alerts when you go over them.
UPDATE July 13, 2019: Cloudflare published an extensive post-mortem on their blog.
Comments