I am sure this has never happened to you, but try to imagine. Your on-call phone just got the message: The site is down! You open your Nagios dashboard and you are greeted by a sea of red. All Apache servers are not responding. The database connections are exhausted. Load balancers are screaming that they have no healthy backends. Everything else looks dead quiet. Everybody starts throwing around their guesses, what is going on.
The conversations and events described below are entirely fictional and any resemblance to actual persons, living or dead, or actual events is purely coincidental.
Team lead: Did someone deploy something?
Dev: We last deployed the site four hours ago. It’s the database probably, I saw some alert saying the connection pool is exhausted, call the DBA!
DBA: The database is fine. There are just 10x the number of normal open connections from application servers, but it’s not doing anything. I made one mistake two years ago and now it’s always the database you blame. Maybe it’s the network?
Network admin: If it was the network, nothing would work. But there is almost no traffic. It must be something in the code. Or maybe the load balancers went crazy, it says here they are offline?
Operations: The load balancers are offline, because the backend servers are timing out so they have nowhere to send the traffic. Are you sure you did not deploy anything?
Manager: Are we under attack? I bet it’s DDoS. Someone said there are 10x the number of connections than is usual, it looks like an attack.
… 10 minutes of similar conversations pass quickly …
SRE: It’s probably nothing, but you know that widget that we have on the product page? The one that shows how many people are currently viewing that page? We have a microservice for that, right? Well, the Redis instance, which holds the data for it, looks like it has load of 120. That is probably not good, right? But surely this cannot be the reason the whole site is down, right? I mean, the widget is pretty, but not crucial for the site to function.
Dev: Sure, if the call fails, we just don’t display it and everything works fine. We planned for this. That cannot be it.
SRE: See, that is the thing. The service works, but the response takes 10 seconds. But you guys have, like, very short timeouts, when you call the service on the product page, right?
How did we miss that?
Maybe you have not experienced something like this, but I have. And more times than I would care to admit. Even though timeouts are a known resilience technique since forever, they are (or more precisely the lack of them is) the most common source of downtime in any sufficiently complex system I have seen. There are several reasons for that.
Insane defaults. Most network calls and primitives have timeouts built-in. But these timeouts have default values set in the last millennium. Let’s take curl, the most popular library used to make HTTP calls. It has a connection timeout of 300 seconds (5 minutes). And the total request timeout is even 0, meaning it never times out. You have to deliberately set it to something sane yourself. The same situation is with most timeouts in the TCP/IP stack and in many libraries. In practice, the defaults are usually so high that it’s almost as if there was no timeout at all.
Imperfect mental models. When we reason about a complex system, we use a mental model. In a perfect world, everyone would have the same mental model, which would exactly replicate the real thing. Oh, and we would also have a beautiful up-to-date Visio schematic of such a system printed on every wall. Alas, we do not live in a perfect world. The documentation is usually scattered over wikis and README files in the repo, if it exists at all. Everyone works with an imperfect mental model and it depends on seniority, how far the model is removed from the real thing. Most people will not know about every moving piece in the system, let alone know how they interact. In such situation, they stick to what they know and look for problems there. In a complete meltdown, the problems are manifesting everywhere and it’s hard to tell cause from consequence.
Difficult testing. Most commonly used testing techniques will not cover cases, where the call that normally takes 10ms suddenly takes 10s. This is near impossible to cover with a unit test and even integration tests usually deal with the call either going thru normally or failing completely. I have yet to see a static analysis tool, which would tell you “You are using a curl call, but you did not set a timeout for the connection” (I’d be more than happy if someone could correct me on this). It takes a manual process to check all the places for external calls and make sure reasonable timeouts are in place.
Timeouts are not a cure for everything. But they have two main benefits. They help prevent cascading failures, like the one we described in the fictional incident. And they allow you to make choices, when there is still time to salvage the situation.
Even if people are aware and have set timeouts, often they are too long and when sh*t hits the fan they do little to help the situation. When I ask them, why is there a 5 second timeout on a call that usually takes 100ms, they explain there are reasons. To name a few, the reasons might be:
“During a maintenance window at night, the database gets overloaded and the call takes much longer than usual. When we had the timeout lower, we used to get notifications at night. Increasing the timeout fixed that.”
“We have that one customer who has 1000 items on his wishlist and if we lower the timeout on the wishlist service call to 100ms, the wishlist will stop working for him.”
You get the idea. All of these are sort of valid, but simply show a bad design or capacity issues.
If your database gets overloaded during maintenance, you are doing it wrong. Maybe you can split the maintenance into more frequent but smaller batches? Or do not produce garbage that needs cleaning up in the first place.
And about that customer… Have you checked with your sales people to see if this customer is actually bringing any profit, or is it just your competition using the wishlist to monitor your prices? And even if it was a legit profit-making customer, are you sure he wants to see all the 1000 products on his wishlist on one page? Probably not, so introduce paging and limit the single request to 10 items or something like that.
People are often afraid of small failures and in the process of preventing them sacrifice the overall system resilience, which in turn allows huge, stop-the-world, outages to happen. I say: Be aggressive with your timeouts. If your system use case has outliers, like I mentioned above, which prevent you from using aggressive timeouts, deal with them. More often than not they are canaries, which point to design weaknesses in your services. They help you uncover technical debt.
But how do you set reasonable timeouts in a complex system? One rendered web page can consist of several calls to other system components and services. Be they databases, caches, internal or external APIs. Even accessing files on file system can be considered calling an external service, because these days the file often does not sit on the local server hard drive anyway and is susceptible to network problems, same as any other service call.
It helps to use the concept of performance budgets. Let’s say you set an internal target, that a page must be rendered in less than 5 seconds or is considered a failure. Obviously your target is not to take 5 seconds to generate every page, but it’s the absolute limit until you fail. It could be a timeout of an upstream load balancer or CDN, for example. This gives you a time budget, where you need to fit everything, including failures. Now you can start considering which external calls you make and divide your budget between them.
Let’s say you have a call to a service that is absolutely essential for the page. Without the information from this service the page cannot function. It means you probably want to set the timeout for this service to such a value, that you will have enough time in your budget to retry the call once or twice if the first one fails or times out.
Then you might have a service which is not that essential, like the one we mentioned in our initial drama. You are even OK when 5% of the page views do not contain the information from this service. In this case, you can set the timeout to 95th percentile of the typical call duration (assuming you have such data, which is essential for any kind of performance budgeting). That way, you save your performance budget for things you really need.
Assuming you use caching for some type of service call, there is one more thing you can do. The typical scenario is, you get some value from a service, which is expensive to calculate and you store it for a while in some sort of cache. This could be memcached, Redis, doesn’t matter. You set the time you want to cache it for as a timeout or expiration on the key. That way, you first look into the cache, if it’s there you use the value, if it’s not, you call the service to get the fresh value. But you can be smarter than that. You can make the expiration time part of the data you store in the cache and set the actual key expiration for a longer period, say 3x your desired cache period. This allows you to use this slightly stale data if the call to the service fails or times out. It may not be up to your freshness standard, but depending on your use case, still better than completely failing. This is especially useful in cases where you absolutely need the data and the service call is expensive, so you do not have time in your budget for a retry, if it times out.
If you add all the budgets for the calls together, you should still fit into your overall budget. Or you can go a bit over, obviously. You know, like in real life, where you hope that not all the budgeted expenses materialize. But it presents a risk and you should be at least aware of it.
The individual service call budgets are also useful in conversations with the people who’s service you are calling. If you are the only consumer of the service, they can use it as their own top level budget and repeat the process.
So we know how things should be built, but how do you actually make sure that they really are?
It helps if you specifically mention these design patterns in your coding standards or development handbook or some such thing, if you have it. It also helps if you have a code review process in place and you specifically mention checking for proper use of timeouts in your review guidelines. Put it right besides checking for SQL injection attack vectors and other such review criteria you might have.
This still does not prevent people from forgetting to put the timeouts in or letting them set it to too long. We can however use a technique from the field of chaos engineering.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Specifically, we will randomly introduce delays into our response. It’s actually really simple and trivial to implement. As a service maintainer, all you have to do, is put a
sleep() call into your service and invoke it with a pre-set probability randomly. Let’s say normally your service takes 100ms to respond. Now, in 0.01% (1 in 10000) calls you pause for 5 seconds and thus make the call last 5100ms. I suggest you do that and then wait. Also, now would be a good time to create a dashboard which plots the number of such events. Or maybe a pretty histogram of call duration, hmm?
After a while, some colleagues might show up at your desk, explaining that suddenly something weird started appearing in their dashboards and asking questions. That is a good thing and you can have a conversation about your performance budgets and how they can set reasonable timeouts and failure policies.
If nobody shows up, then you really need to start worrying. Either nobody uses your service, which might be either good or bad news, but more likely, they are not even aware your service sometimes takes that long to respond. At that time, it’s you, who should pay them a visit and start the conversation. You can even send them a link to this article (wink wink).
I suggest you have the failure probability configurable per environment. Start with introducing it in test. Crank up the fail rate to 10% or more and increase the delay, because usually testing environments do not get that much traffic and people are used to things not working smoothly in test. You have to really fail miserably in order for someone to notice. Once you believe you have uncovered everything, it’s time to step up your game and introduce the failures to production. Obviously you want it to happen much less frequently and ideally only during working hours, when everyone is fresh and ready to respond. Oh, and tell people you are doing that, please.
If you were not convinced before, I hope you are now, that timeouts are not evil and in fact are absolutely necessary if you want to build resilient system. Even though the concept is trivial, the proper use is not. And checking for proper use is even harder. Nobody gets it right the first time, but if you identify at least half the places where timeouts should be and put them in place, you’ll have increased the resilience of your system. Resilience is not binary, it’s a continuous scale.
Please let me know in the comments if you found the information helpful or if you found better ways to deal with timeouts in your systems.
The topic of system resilience is dear to me and if you’d like to learn more on this topic I suggest you read my previous articles: Four levels of resilience in systems and Implementing Circuit Breaker Pattern.