@rkenmi - Distributed scaling with Relational Databases

# Background

A lot of articles will talk about how to scale databases. Typically, they will talk about the purpose and the general idea of sharding and replication, but often times these topics are explained separately and not so much in conjunction. What if you wanted to combine both? Can you combine both? In the real-world, these ideas are commonly used together so it is beneficial to understand how these techniques can be used in practice.

With that in mind, I personally believe that the subject matter of database scaling and in particular, scaling SQL, is a confusing topic. This is primarily due to two reasons. One is due to how sharding and replication is commonly explained (above), especially since the idea of sharding or replication isn't exactly tied to a specific database. In addition, the ideas of sharding or replication aren't really limited to just databases - sharding is very similar to load balancing conceptually and replication can be done on just about any kind of service. The other reason is due to the ACID compliant nature of a single-server SQL database and how that becomes much more fuzzier when you transition over to SQL on distributed systems and you want to keep the same ACID guarantees such as atomicity or strong consistency.

Instead, most articles will just point you to the alternatives to SQL (i.e. NoSQL) for situations where we need high write throughputs, or to store big data (petabytes and beyond) or have incredibly low latency (tens of milliseconds) on reads/writes. Or maybe you just want to stick to NoSQL because it handles the general job of scaling for you under the hood. All of these are valid reasons.

But transitioning from one database to another can be one heck of a change that will require lots of coordination across different teams. It is a huge commitment of time and money, so most companies will try to optimize what they already have (since SQL is a solid database choice for most use cases) and only when there are hard bottlenecks should companies really start to invest in such a large-scale transition.

This detail, unfortunately, is the material that is neglected more often than not, missing from a lot of system design explanations that talk only about general purpose sharding algorithms and replication methods. The focus of this article is to talk about optimizing SQL databases in the context of distributed systems, before we decide to move on to NoSQL alternatives. There are various techniques to optimize SQL before going into the distributed systems space, which are covered here.

# Replicas for higher read throughput

The easiest optimization with SQL by far is adding replicas. This is used for replication and they offer a few amazing benefits:

• Fault Tolerance
• In the case of leader-follower replications, if the leader dies, a follower or replica node can take over via failover
• More Read Throughput
• Theoretically, if you had one node vs. three nodes all with the exact same copy of data, and you process one read per second with one node, then you should be able to handle 3 reads per second with three nodes. This is because only one of the nodes is needed to handle a single read request, so 3x the nodes means 3x the concurrent reads.
• Lower Latency Reads
• This benefit is slightly different from above. Replicas aren't limited to the same geolocation as the write node. This means you can have a read replica spread across various regions - when a read request comes in, the closest read replica (by region) can be chosen, for faster reads.

Most storage engines come with at least two replication modes: single-leader replication and multi-leader replication. The replication is done asynchronously by default in traditional implementations (since synchronous replication will sacrifice availability by blocking subsequent writes until the replication first completes). This means some replications may be fast, others might be slow, but eventually the states of all the replicas will converge to the same state. This is also known as eventual consistency.

Single-leader replication is probably the most popular leader-follower implementation and it simply means you have one node dedicated to writes and $$n$$ follower (or replica) nodes that are there primarily for reads. Majority of SQL systems use single-leader replication with great success, especially websites that often have orders of magnitude more reads than writes. InnoDB, a storage engine for MySQL, does replication asynchronously by default.

Multi-leader replication involves having more than one node that can handle writes - multiple writers. Each write request will be picked up by any one of the leader nodes, but the changes will be asynchronously replicated to other leader nodes as well as follower nodes. Multi-leader is often used for systems spanning different data centers across regions, i.e. one leader node for us-east-1 region, and another leader node for us-west-1.

Since the network delay between data centers is considerably high, it does not make sense to synchronously replicate updates between leader nodes, each representing a separate data center. For this reason, multi-leader replication is done asynchronously in practice.

# Strong Consistency and Distributed Nodes

Quorum consensus is a popular formula to determine how many writes or reads (minimum) it takes to say that a transaction is valid. However quorum consensus is generally used in systems with leaderless replication, such as Cassandra or DynamoDB where any number of nodes could be writers or readers.

With leader-follower replication, you have a dedicated number of writers and readers - 1 writer and $$n$$ readers for single-leader replication. Having strong consistency, which is arguably the most attractive guarantee from ACID, is slightly more complicated with this setup but is achievable.

In general, with single-leader replication, the leader will broadcast and write down the sequence order of the transactions it executes to all of the read replicas, so that the order will be the same across every node. However, most leader-follow replications take place asynchronously, so that nodes will be highly available (and not block on any pending replication). For this reason, any incoming reads right after a database entry update might give you back up-to-date results or stale results. Outdated results can be seen when the read request hits a replica that has not finished the replication process yet.

This introduces a race condition where transaction $$A$$ had come before transaction $$B$$, but some nodes that are slow at replication can receive transaction $$B$$ first, then $$A$$. In addition, if the leader node happens to die while the replication is going on, we need to have a new leader node somehow.

To solve the incorrect ordering race condition, there are two ways.

The first way is 2PC, or Two-Phase Protocol, which is not recommended for reasons listed here.

The other way is consensus, where all participant nodes can use a consensus algorithm to a. have majority of the nodes agree on the correct ordering of transactions to execute, and b. to elect a new leader and maintain leadership across the nodes if any of the nodes happen to die. If majority of the nodes agree on a certain order, then every node must write the transactions in that exact order. If a leader node dies, there will always be a failover mechanism among the nodes. These changes enable the system to be linearizable (strongly consistent reads) and fault-tolerant since the system will now always be highly available and total ordering of operations is guaranteed by having a majority vote from the quorum.

Note: Guaranteeing the total order with single-leader replication is achievable, but it is based on the assumption that a single leader node can sufficiently handle all of the write-throughput.

To actually implement consensus, an external service like Apache Zookeeper can be used to elect leaders or followers, or detect health changes on any of the nodes. Alternatively, the gossip protocol can be used to detect which nodes are down - each node will incur more storage overhead and logic by keeping track of the health of other nodes via periodic heartbeats. Do note that, with the use of consensus, there will be a performance penalty as quorum consensus may always take more time than no consensus at all - this is naturally the case as there are more network calls needed to achieve consensus.

Multi-leader replication or leaderless replication on the other hand, has added complexity since these replication methods are generally asynchronous. Consensus algorithms, such as leader election above, doesn't really work in this context when there are multiple leaders or no leaders at all. In practice, many multi-leader / leaderless replications do not guarantee linearizable reads.

# Higher write throughput

Single-leader replication at the partition (shard) level

Remember when I mentioned that sharding is often combined with replication? You heard me right - you can have single-leader replication and have good write throughput too by combining sharding.

There is a trade-off however, which is that since writes are now partitioned, we cannot have the total order guarantees, since the order of operations in two different partitions is ambiguous. This is analogous with a single partition for topic $$A$$ inside Apache Kafka having a guarantee of correct ordering, but ordering across two or more partitions for topic $$A$$ is ambiguous.

The sharding idea is simple - you can partition your keys based on some hash so that some rows go to one partition, and other rows go to some other partition. This means that a single node may contain more than one partition. Similar to single-leader replication without scaling, one node is associated to one leader. However, the leader/follower association is now at the partition level, rather than the node level. For example, in the diagram above, each node has 3 distinct partitions, 1 for the leader and 2 for the followers.

In a single-leader replication setup, a single node can now be a leader of some partition, and followers of other partitions. This in turn gives multiple read copies for any given partition, but now you also have lots of data or write throughput be distributed equally among the nodes, if your partitioning algorithm is chosen carefully.

One thing to keep in mind is that for any kind of sharding or partitioning on SQL, SQL databases do not auto-magically take care of it for you. Any kind of sharding or partitioning is done at the application-level - most SQL engines leave it up to the clients to choose how they want to shard/partition. This is different from NoSQL databases where sharding is done internally under the hood.

## Auto-balancing Partitions

Developers will need to come up with a partitioning scheme themselves, and the application will make the decision to route the new writes to the correct database partition nodes. We will need our own algorithms for partitioning schemes, consistent hashing and virtual nodes to ensure that your application can always consistently write to or read from a valid partition, and ensure that some type of failover exists if a partition is dead. I won't get too deep into the details for these techniques here, but their information is widely available on the web.

# Summary

Here is a quick recap on the optimizations we talked about.

For general optimizations:

For higher read throughput:

• Use RDBMS replications (single-leader or multi-leader)
• Most RDBMS replications are done asynchronously, which means reads will be eventually consistent
• Use single-leader for replications within the same data center
• Use multi-leader for replications across different data centers

For higher read throughput with strong consistency reads:

• Use single-leader replication with consensus algorithms to vote on the total ordering of operations
• i.e. using an external service like Apache Zookeeper
• Your single leader node must be able to handle all write traffic.
• If it cannot, you would need to shard, but at that point you cannot have linearizable reads - reads will be eventually consistent.
• Be mindful that network round-trip times for reads will be slower due to consensus operations (which involve communicating with other nodes, one way or another).

For very high read and write throughput:

• Use replication combined with partitioning.
• No strong consistency guarantees with sharding
• Needs to handle cases where nodes may be added or removed when sharding
• Needs to handle fault tolerance cases at the shard level

# A final note

As you could probably see, there is a lot of difficulty and work involved in scaling SQL, so it is easy to see why we often don't talk about scaling SQL so much nowadays. Many of these techniques, such as consistent hashing or the gossip protocol, comes out of the box from NoSQL solutions such as Cassandra. They handle scaling for you in such a way that you, the developer, have less operational things to worry about.

With that said, I argue that it is important to see how SQL can be scaled anyway. Seeing the trade-offs of SQL from a single server to multiple servers gives you good insight on what your data storage requirements are and what trade-offs you can and cannot make. Only after you understand your requirements, will you know whether or not to keep SQL as your data store.