The CAP Theorem: A Simple Guide to a Hard Distributed Systems Problem |

Hey readers,

Imagine you’re building a massive application with users all over the world. To make it fast, you have two identical database servers: one in London and one in New York. Your goal is simple: keep the data on both servers perfectly in sync at all times.

One day, a deep-sea cable is damaged, and the network connection between London and New York is severed.

Now you have a serious problem. A user in London updates their profile. A moment later, a user in New York requests that same profile. What should your system do? Should the New York server return the old, out-of-date data? Or should it return an error because it can’t be sure it has the latest version?

This dilemma is the heart of the CAP Theorem. It’s a fundamental principle in distributed systems, first proposed by Eric Brewer, and it states that any distributed data store can only provide two of the following three guarantees: Consistency, Availability, and Partition Tolerance.

To understand this, let’s imagine our two database servers are two librarians, Alice in London and Bob in New York, who must keep their library catalogs perfectly in sync.

1. Consistency

Consistency means that every client who reads from the database gets the most recent write. Every user sees the same data at the same time.

In our library analogy, this means Alice and Bob have the exact same version of the library catalog at all times. If a new book is added in London, it is instantly reflected in New York. A patron asking for the newest book will get the same answer from both Alice and Bob.

2. Availability

Availability means that every request receives a non-error response. The system is always up and running.

For our librarians, this means the library is always open for business. When a patron asks a question, they always get an answer. The answer might not be perfectly up-to-date if the phone lines are down, but they will get an answer. The system never just returns an error message.

3. Partition Tolerance

Partition Tolerance means the system continues to operate despite a “partition” (a communication break) between nodes.

This is the reality of our scenario: the phone line between Alice in London and Bob in New York is cut. They can no longer communicate. In any real-world distributed system, you have to assume that the network will fail at some point. Therefore, Partition Tolerance is a must-have. You cannot sacrifice it.

4. The Choice: You Can Only Pick Two

Since we must accept that network partitions (P) will happen, the real choice is between Consistency and Availability. When a partition occurs, what do you sacrifice?

System 1: Prioritize Consistency over Availability (CP)

If the phone line between Alice and Bob breaks, and we want to maintain perfect consistency, we have to make a hard choice. To prevent their catalogs from becoming out of sync, we might tell Bob in New York to stop accepting any new book checkouts or returns until the phone line is restored.

The system chose Consistency. The data remains perfectly consistent between London and New York (because no new data is being written in New York).
The system sacrificed Availability. The New York library is effectively “down” for any write operations. Patrons in New York who want to check out a book get an error.
Examples: Traditional relational databases like PostgreSQL or MySQL in a multi-master configuration often lean towards being CP.

System 2: Prioritize Availability over Consistency (AP)

If the phone line breaks, and we want to maintain availability, we let both Alice and Bob continue to operate their libraries independently. Alice adds new books in London, and Bob adds new books in New York.

The system chose Availability. Both libraries are fully operational and serving all patrons.
The system sacrificed Consistency. For a while, Alice’s catalog and Bob’s catalog are out of sync. They have different views of the truth. Once the phone line is restored, they will need a process to reconcile their differences. This is known as eventual consistency.
Examples: Many NoSQL databases like Cassandra, DynamoDB, and CouchDB are designed to be AP systems, prioritizing high availability even in the face of network failures.

What’s the next move?

Challenge: Look back at the “SQL vs. NoSQL” article we published a few weeks ago. Based on what you learned there about ACID transactions and BASE guarantees, which side of the CAP theorem do traditional SQL databases usually fall on? What about common NoSQL databases?

Understanding the CAP Theorem is about understanding trade-offs. There is no “right” answer, only the right answer for the specific needs of your application.

Thanks for reading!

Bou~codes and Naima from 10xdev blog.