One of the editions of Resilience Roundup, which is a newsletter dedicated to the topic of systems resilience, compiled by Thai Wood, talks about a paper from MIT's Nancy Leveson called “The Role of Software in Spacecraft Accidents". In this paper, the author studied five prominent disasters in aerospace industry with relation to software. Thai Wood does a much better job of summarizing the paper than I would, so I will not go over the whole paper here. Go read the newsletter linked above. I'd just like to focus on three of the themes written about in the paper.
Failure of systems
One thing all the studied incidents have in common, is their systemic nature. It means, that components, both software and hardware, operated according to their specification and did not fail individually. However, the disasters were caused by unforeseen interactions between the components and their combined behavior.
This is especially true for software components, which usually have much more “freedom of movement” than hardware. There is much higher variability in input data, code paths and software configuration. In most cases, this is what we want and why we solve things in software rather than in hardware. But it's also by far the leading cause of system failures.
… we are building systems where the interactions among the components (often controlled by software) cannot all be planned, understood, anticipated, or guarded against …
Redundancy in hardware and software
When we design systems, the tendency is to focus on individual components and their failures. It's only logical, since it's much easier to reason about the functions of individual components. As engineers, we've been doing this for a long time and in the realm of hardware, this is often solved by redundancy.
The aerospace industry is stellar example of this. Almost all components, certainly the critical ones, are installed in pairs or even three copies. What can usually be observed in this kind of redundancy, that the components are highly independent. All airliners have two (or more) engines. These engines are operated and controlled independently and the plane can take off, fly and safely land with just one operating engine.
Coming back to computer hardware, let's take the example of power supplies. If your server requires 500 watts of power, it often has two 500 watt power supplies. If one of them fails, the server can still operate on the remaining one. By adding redundancy, the power sub-system is not significantly more complex than the non-redundant version.
Now let's see what's the situation in the world of software. It's not uncommon, that adding redundancy to software system increases it's complexity by order of magnitude. Let's show this on a simple example. Let's say you are building an application which allows the users to upload pictures and later view them.
The non-redundant version is super simple. When the user uploads the image, you store it on a local disk and later just load the image back from disk and display it. It can probably be written in an afternoon and runs on your laptop same as on the data center server.
To make the system resilient against any given component failure, the story suddenly get's really complicated. You should install the application on multiple servers. But now you need to somehow share the uploaded image files between the servers. One thing you could do is sync the files between the servers. That might be doable for two servers with rsync, but gets really messy when you want three or more. You could also use some kind of dedicated data store. It could be a relational database, but then again you need to make that redundant by some sort of clustering or replication. Or you could use a clustered object store such as Minio or CEPH. Sure, you could use a cloud service like S3 or Google's Cloud Storage. Oh, and remember you need to configure your application servers to point to the data store. So you need a properly configured network. And of course, you need to secure all of this.
The point I am trying to make here is: the attempt to make the system more resilient by adding redundancy is often actually the cause for said system downtime. Distributed systems are just hard.
Reflecting on the many incidents I have seen in my career, I believe the number of incidents caused by various database clustering solutions outnumbers the number of failures of individual database nodes. Therefore, if you run a well maintained database server you might very well have better uptime than running a fancy cluster. Of course, sometimes you might not have any other option. If you really need to maintain 100% uptime or you need more capacity that can be provided by a single box, you just have to go this way. But be prepared to pay for it. If it's just about potentially losing some revenue when your server goes down, one should make a serious calculation of costs and benefits of redundancy.
Idealized testing scenarios
The last theme deals with proper testing. Software testing is a well studied discipline. Unfortunately, most of the testing I have seen involves functional testing. That is, verifying that the software performs pre-defined scenarios correctly under normal circumstances. Rarely does it involve testing the system under non-standard conditions and also under realistic loads. If at all, we limit ourselves to see how the system behaves if one of the components fails completely (by turning it off). That's what I call idealized testing scenarios.
A general principle in testing aerospace systems is to fly what you test and test what you fly.
When we are load testing our e-commerce systems for that Black Friday campaign, how many of us do so under the condition that one of the two database servers is down. We might be quite ready for a case, when the service we depend on is down. But do you also test for a case where the service responds after 10 seconds delay for every fifth request?
I understand why that is. Simulating these things is also hard. It is especially hard under circumstances, where test team is separate from the development team and they are both separate from operations team. Simulating service latencies can be done by inserting few lines of code into the application, but if the test team is under different management, that might not be a priority. Deliberately failing a database node has to be done by administrators and if they are incentivised by system uptime, that might be difficult to sell to them.
This is actually one of the conclusions of the paper and though the author does not call it by this name, we already know what this is called: DevOps. The lack of communication channels and poor information flow is cited as one of the major factors in all disasters. Cross team communication is hard, that's why we should strive to limit it's necessity by putting most roles in one functional team.