Determining the order of events in a distributed system isn’t straightforward, as the concept of “now” is relative. This article explores the challenges of achieving consensus in distributed systems and why relying on proven algorithms like Raft or Paxos may be essential.
It turns out, that knowing the chronological order of two events is kind of a useful thing.
If you sell cinema tickets and there are two buyers for one seat, you need to know who was first. You sell the ticket to this person and deny the other. If you have just one ticket office window, it’s easy. The buyers form a queue and the order of ticket sales is thus determined. The thing with this approach is, it does not scale and if you want to fill a large cinema or you run a cinema multiplex you are in trouble. The queue would be too long.
Decentralize
You need more ticket offices. You need a distributed system. Unfortunately, it turns out, that determining a common order of events in a distributed system that everybody agrees on is pretty hard.
You could split the available tickets between the offices and, for example, only sell odd numbered rows in one office and even numbered in the other. That’s called sharding and has it’s own set of problems. An example of the kind of problems you will need to solve is: What happens if one customer wants two tickets, one in an odd row and one in even row, but only when they are both available. The name, sharding, fits. Shards have sharp edges and you can easily cut yourself.
Alternatively, you can have multiple offices, but just one ticket book, where you record the purchases. All ticket agents will pass this book around. That’s called locking or using semaphores. It’s a bit better than a single queue, as you can handle the payment and talk to customers in parallel, but effectively you can still serve only one customer at any given time, though this time is at least shortened.
Timing it
One obvious approach to determining the order of events is recording a precise time the event happened and later use this time to sort the records in one common journal. If event A has an earlier timestamp than event B, it means A happened first and B second. Alas, there are two flaws of this approach.
The first problem is, it only helps you in hindsight. In our example with ticket sales, knowing that Joe has bought the ticket first and Mary second solves the problem only partially. You will only let Joe in, but you will have refund the price to Mary and she’s not gonna be happy, because you ruined her date.
The second and probably more important problem is, we are not able to measure the time precisely enough. Certainly not enough so that all actors in the distributed system have the same time. “But we have NTP,” you say. “It was supposed to keep the time in the network in sync,” you say. Well, it does, but not really. At least not good enough for the purpose of determining the true order of events. If you read the famous Google’s Spanner paper, one of the nuggets of wisdom from their research is, that even in highly synchronized cluster, the drift between nodes is typically between 1 and 7 milliseconds. That’s a lot, considering a modern CPU can perform about one hundred million instructions in each millisecond.
So the time recording is useless in determining the correct order of things and we not only need to record the true order for later review–we need to know that one thing has happened, before we allow the other thing to happen. What are we left with? It’s basically Beverly Hills High gossip system.
Something has happened
Let’s take a look at an exchange required to record reliably that an event took place in a distributed system of three agents. As is customary in these kinds of scenarios, we’ll name them Alice, Bob and Charlie.
- X has happened and only Alice knows it.
- Alice sends a message to Bob and Charlie, that X has happened. So Alice knows X and also knows, that she messaged Bob and Charlie.
- Bob receives the message and records X in his journal. Now Alice and Bob know of X and Alice knows she told Bob and Charlie.
- Bob sends a receipt confirmation message to Alice. Now Alice and Bob know of X and Alice knows she told Bob and Charlie and Bob knows that he sent a confirmation to Alice.
- Alice receives the confirmation from Bob. Now Alice and Bob know of X and Alice knows she told Charlie and Alice knows that Bob knows of X. Bob currently does not know that Alice knows that he knows X.
- For brevity, let’s say that Charlie does the same thing as Bob in steps 3. to 5. So now Alice, Bob and Charlie all know of X and Alice already knows that everyone knows of X.
- Now Alice can tell Charlie that she knows that Bob already knows of X. Alternatively, Bob can tell Charlie directly and vice versa. Of course, Charlie has to confirm to Alice that he now knows that Bob knows of X.
- Alice sends receipt confirmation of receipt confirmation back to Bob. After Bob receives it, he can be sure that Alice knows that he knows of X.
- After the same kind of exchange between Alice and Charlie we finally can get to an end state that we actually desire in a group: Everyone knows X has happened and also everyone knows that everyone knows about it.
You get the idea. It’s complicated. There is no “now” in distributed systems. Or rather, if you define “now” as a border between things that were and things that will be, “now” can vary among the participats. Oh, and because the network is not reliable, any of the messages can be lost or received multiple times.
Simplified flow
The above flow, though already simplified, as it does not describe all the messages required, can be simplified a bit more to these basic milestones.
- Someone knows X
- Everyone knows X
- Someone knows that everyone knows X
- Everyone knows that everyone knows X
And that, kids, is why we can’t have nice things. Distributed consenus systems are hard and ugly. Don’t even think about coming up with one yourself. Same as you “don’t roll your own crypto”, you should rely on proven consensus algorithms like Raft or Paxos and ideally use a respectable library that implements them. Or use Zookeeper.
This is actually a continuation of the theme I touched in my previous article Resilience lessons from aerospace disasters. Probably, the first thing you should ask yourself is “do I really need a distributed system?” Or more precisely, does it make business sense to pay for one? Sometimes, the answer is yes. But only sometimes.
Comments