Lamport starts talking about distributed systems, lamport clocks and state machines at minute 15th.
Transcript
His most cited paper
I wanted to talk about your most cited paper, the one titled Time, Clocks, and the Ordering of Events in Distributed Systems. What's the story behind the paper and the problem you were solving with it?
The origin was simple. Somebody sent me a paper on building distributed databases, where you have multiple copies of the data in different places and you need to keep them synchronized in some way. I looked at it and realized that their solution had a problem: it had the property that things would be executed as if they occurred in sequence, but that sequence could be different from the sequence in which they actually happened.
The notion of what “happening before” means is not obvious to most people, but I happened to have learned about special relativity—specifically the space-time view, where space and time are considered together as one four-dimensional thing. Einstein wrote his paper in 1905, and later this four-dimensional view clarified what it means for one event to occur before another.
That notion is: one event happens before another if a signal could have been sent from the first event and received at the second before it occurred, assuming nothing travels faster than the speed of light.
I realized there was an obvious analogy. The notion of “happens before” in a distributed system is exactly the same, except instead of signals traveling at the speed of light, it's about whether one event could have affected another through messages actually sent in the system.
What really surprised people was this definition of “happens before” in distributed systems. This was also one of the first papers I would call scientific in distributed systems.
I may have made the mistake—something I had been warned against—of putting two ideas in one paper. The other idea was that there exists an algorithm that produces an ordering satisfying this condition: if one event happens before another, then it appears before it in the ordering.
I realized that if you had such an algorithm, you could use it to provide the synchronization needed for any distributed system. You could describe the system in terms of a state machine.
A state machine, as I described it, has a state, and processes execute commands in order. Each command changes the state and produces a value. So you can describe the system entirely in terms of how commands affect the state and what outputs they produce.
This seemed obvious to me, but in practice, that was the most important idea in the paper. It showed that distributed systems can be built by thinking in terms of state machines and reasoning about concurrent systems in that way.
However, that part was almost completely ignored. In fact, twice I spoke to people about the paper and they said there was nothing in it about state machines. I had to go back and reread it to make sure I wasn’t imagining things—it really was there.
This idea is also important for another reason. If you're trying to understand a concurrent program, most are written assuming atomic actions—so you think of execution as a sequence of events.
But to understand why a program produces the correct answer, you don't focus on the initial input—that quickly becomes irrelevant. What matters is the current state. The only thing that determines what the program does next is its state.
So the way to understand a program is to identify a property of the state at each point that guarantees the final result will be correct. That property is a boolean function of the state called an invariant.
Understanding the invariant is the key to understanding the system.
I realized that this applies equally to concurrent systems. People often try to reason about behavior in terms of sequences of events, but the number of possible sequences grows exponentially, making it easy to miss cases.
In contrast, reasoning using invariants is much more manageable. While the number of possible executions is exponential, the complexity of invariant-based reasoning grows much more slowly—roughly quadratically with the number of processes.
That’s why invariants are a better method. Still, for a long time, people in distributed systems theory focused on methods based on partial orderings. While those approaches can work in some cases—like the bakery algorithm—they are the exception.
In practice, the method that reliably works is the use of invariants.
The "Byzantine Generals" problem
I think that's something we hear about and learn when going through college and computer science. The name is great, and I want to know the story behind that problem.
After I wrote that Time, Clocks paper, which tells you how to build a distributed system assuming no failures, it was obvious that one reason for distributed systems is that you have multiple computers so that if one fails, you can keep going.
In particular, that was the problem being solved at SRI when I joined, but before I got there I had already started working on it. I didn’t have any clear idea of what failures to assume, so I assumed the worst possible case: that a failed process might do absolutely anything.
I came up with an algorithm that would implement a state machine under that assumption. The algorithm used digital signatures. It relied on the fact that a faulty process might do anything, but it could not forge the signature of another process. That means a message can be trusted to have come from a particular sender. You can relay messages, and recipients can verify that the relayed message is exactly the one originally sent.
When I got to SRI, I realized that people there were trying to solve the same problem, but there were two differences. First, this was around 1975, and very few people knew about digital signatures. I happened to know about them because Whitfield Diffie—one of the authors of the Diffie-Hellman paper—was a friend of mine.
At one point we were at a coffee house, and he was describing the problem of building digital signatures, which hadn’t yet been solved. I said it seemed straightforward, and I literally wrote the first digital signature algorithm on a napkin. It wasn’t practical at the time because it required something like 128 bits to sign a single bit of data. That can be improved by signing a hash instead of the full document, assuming the hash is secure and can’t be reversed or forged.
So digital signatures were part of my toolkit. The people at SRI didn’t have that, but they had a nicer abstraction. Instead of agreeing on a sequence of commands, they focused on reaching agreement on a single command, and then repeating that process. This was a cleaner way of describing the problem.
The first paper we published included both approaches. Since they didn’t use digital signatures, their algorithm required more processes: to tolerate one faulty process, you needed four processes. With digital signatures, you only needed three.
The algorithm without digital signatures was more complicated. The general solution for multiple faults was a work of genius—almost incomprehensible. You had to read through a very complex proof showing that to tolerate n faulty processes, you needed 4n processes, whereas with digital signatures you only needed 3n. The simpler case with a single fault wasn’t too hard, but the general case—developed by Marshall Pease—was brilliant.
Later, I found a simpler inductive proof, but the original result was remarkable.
We published the paper, and I realized that this idea of Byzantine faults—where a process can behave arbitrarily—was fundamental. I had assumed arbitrary failures simply because I didn’t know what else to assume, but the people at SRI had a concrete motivation: they were building systems for aircraft control.
They needed to handle the possibility that a component could behave unpredictably or even maliciously. Every time you tried to design an algorithm assuming fewer processes—say three for one fault—you could always find some plausible failure scenario that broke it. That’s why four were necessary.
At the time, digital signatures were expensive, so that version of the algorithm wasn’t widely used. Today, computation is cheap, so that might be different.
I remember speaking with an engineer at Boeing, and he told me that when they read the paper, their reaction was simply: “Oh, we need four computers.”
I realized this was an important result that needed to be widely understood. I had learned something from Edsger Dijkstra. He wrote a paper called The Dining Philosophers Problem. The underlying problem wasn’t particularly important, but it had a memorable story—a group of philosophers sharing forks—which made it popular.
So I decided our work needed a good story. I invented the Byzantine generals scenario: a group of generals must agree whether to attack or retreat. If all attack, they win. If too few attack, they lose. But one of the generals might be a traitor.
The problem becomes: how can they reach agreement despite that possibility? That framing made the problem intuitive and memorable.
Originally, there was a related problem called the “Chinese Generals Problem,” described by Jim Gray as an impossibility result. That inspired the idea of generals.
At first, I considered calling them Albanian generals, since Albania at the time was very isolated. But someone pointed out that there are Albanians in the world. I realized that “Byzantine” was better—there are no Byzantines anymore—and that made it the perfect name.
So while this wasn’t the first formulation of the problem, it was the first time it had a memorable name and clear framing.
Why was the problem interesting? Because it was obvious that computers would eventually control airplanes. This was during the oil crisis in the 1970s, and engineers knew they could make more fuel-efficient planes by reducing control surface size—but that made planes aerodynamically unstable. Humans couldn’t manage the adjustments fast enough, but computers could.
So it was clear that computers would be flying airplanes. People thought that to tolerate one failure, you needed three computers. They didn’t realize that with arbitrary failures, you need four. That made the result extremely important.
Throughout my career, I worked in industry rather than academia. Many of the problems I worked on came from practical needs—engineers coming to me with real issues. Paxos, for example, came from someone needing an algorithm to solve a real problem.