CAP Patterns

The CAP Theorem dictates that only two of its three characteristics can be guaranteed at any given time.

Intro to CAP

  • Consistency
    • Every read will be based off of the latest write
  • Availability
    • Every request will be given a response, although the response data might be stale
  • Partition Tolerance
    • It can handle network partitions or network failures

MTV's The Real World

If your service is in the cloud, the P in Partitioning has to always be accounted for. This is because networks aren't reliable - network outages, router malfunctions, etc.

This means that in reality, we have to choose P and either C or A in the CAP theorem for our service.

That is:

CP - Consistency + Partition Tolerance

AP - Availability + Partition Tolerance

So in production, our choices are pretty limited. We can only choose between CP or AP.

CP - Consistency/Partition Tolerance

You enforce consistency with this pattern. Waiting for a response from the partitioned node might result in a timeout error. CP is a good choice if your business needs require atomic reads and writes.

AP - Availability/Partition Tolerance

You enforce availability with this pattern. Your service won't time out, but you might give back outdated data. If your business allows for eventual consistency, go for it.

Consistency Types

For consistency, we can achieve one of the following options, with pros/cons for each:

  • Weak Consistency
    • After a write, new reads may or may not see it
    • ex: Memcached
  • Eventual Consistency
    • After a write, new reads should see it soon, usually within \(X\) milliseconds.
    • ex: DynamoDB
  • Strong Consistency
    • After a write, new reads will see it (atomic transaction)
    • ex: RDBMS

Availability Types

For availability, we can use one of the following patterns

Failover/Redundancy

Failover is commonly used for having load balancers that route to additional load balancers, to prevent single points of failure in availability.

  • Active-Passive
    • Requests hit the active host and the passive host maintains a heartbeat with the active host. If the heartbeat gets lost, then the passive host takes over and becomes the active host.
  • Active-Active
    • Requests hit both active hosts and traffic gets managed between both hosts.

But Active-Passive and Active-Active can be also used to provide redundancy for services and are not limited to just load balancers.

For example, if we want to enforce Active-Passive failover for a service cluster, we can use Apache Zookeeper to do Service Discovery and Leader Election, so that if the current active host (with ID = 0) dies, the next passive host (with ID + 1) can get notified and take over as the new leader. All passive hosts could get notified of the active host dying, but that can lead to the thundering herd problem, where it causes a spike of traffic due to the many concurrent notification pings. That is bad for production!

For an example Zookeeper project, take a look at example-node-docker-zoo.

Replication

Replication is a common pattern used for managing RDBMS at scale.

  • Master-Master Replication
    • All master hosts handle writes and reads and coordinates with each other. If either master goes down, the other master will take over.
    • Violates ACID since data is loosely consistent now.
    • Replication lag is higher
  • Master-Slave Replication
    • The master host handles writes. When a write comes in, it gets replicated to slave hosts
    • The slave hosts handle reads. A slave host can also replicate to other slave hosts
    • If the master goes down, the system can still enter a read-only mode while a slave host becomes promoted to a master

It is worth noting that adding more replicas increases latency, complexity and costs.

But consistency...

In a distributed system with Master-Master hosts or Master-Slave hosts, ACID properties are not easy to come by. To have strong consistency, you would need to apply the Quorum Consensus algorithm to ensure that there is at least one host between the Read hosts and the Write hosts that takes care of both reads AND writes.

Where \(R\) is a Read host and \(W\) is a Write host:

Strongly Consistent: \(R + W > 5\)
Eventually Consistent: \(R + W <= 5\)

For example, if there are \(5\) hosts, \(3\) could be for writes, \(3\) could be for reads. With Quorum Consensus, a versioning number is added for every database table row. This versioning number can be verified across a cluster of Read hosts, giving priority to whichever row has the highest version number among the Read hosts.

Availability Measurement

Availability is generally measured with the number of 9 digits. For example, 99.9% would be three 9's, and 99.99% would be four 9's.

When measuring availability for services that depend on other services, the availability calculation is made as follows:

Sequential

If service Foo depends on service Bar to start first, then we can say that the availability in sequence is determined as follows:

Availability (Total) = Availability (Foo) * Availability (Bar)

The more sequential components there are in a system, the lower the availability.

Parallel

If service Foo can function independently of service Bar, then we can say that the availability in parallel is determined as follows:

Availability (Total) = 1 - (1 - Availability (Foo)) * (1 - Availability (Bar))

The more parallel components there are in a system, the higher the availability.

Related

RDBMS Optimization