Data Sharding: Twitter Posts

Scenario

Let's begin with a Twitter-like service that allows you to tweet new posts. The service has very high read and write traffic , we'll say ~10k read TPS, or transactions-per-second for starters. We use a relational database for writes since they are stable, atomic, and resilient. Write performance on a single master database server will not be sufficient at this scale, so we need to incorporate a multi-master setup. This means writes will have to be distributed to multiple database servers and we will have to shard the data and try to achieve equal distribution among all the servers.

Our service needs a way to:

  • Quickly find a range of tweets by creation date
  • Handle the write/read requirements at the scale mentioned

Schema

For simplicity we'll focus on a very basic schema that represents a tweet post. We'll skip going into other possible fields, such as media, relationships / foreign keys, and other types of metadata.

  • Tweet ID (longint)
  • Timestamp (longint)
  • Tweet Poster User ID (char 64)

Also for simplicity, we'll not worry too much about how these values may be generated (in the case of Tweet ID) or populated.

Tweet ID can be a simple longint value that is incremented (MySQL auto-increment fashion) so that we can keep the number of bytes down to 8. This means we wouldn't want something like a UUID, since UUIDs are 128 bits, which in hexdigest will take up \(\frac{128 bits}{4 bits} = 32\) characters, since each hexadecimal character only takes up 4 bits. A 32 character long ID can take up anywhere from 32-128 bytes in a SQL row. We'll assume that the problem of uniqueness in the sequence IDs are taken care of already (via a distributed ID generator).

Timestamp can be the epoch timestamp in milliseconds, also represented by a longint. We could alternatively use a DATETIME field that is most likely out-of-the-box with many database solutions.

We'll assume User IDs are around 64 bits in size.

Scaling

We can start off by considering a basic hash of the Tweet ID and modulo that value with the total number of servers, to determine which server to store that Tweet. This has issues with adding/removing servers when scaling up or down, so we can apply consistent hashing to allow our service to horizontally scale.

Partitioning

To determine the shard ID or partition ID, we can use various approaches. Depending on what we shard on, the read/write performance can be deeply impacted.

Hash by Tweet ID

We chose a Tweet ID for the brute force approach, which will equally distribute the tweet posts among all the database servers. However, one problem with this approach is that if we need to search for a range of tweets, then we would have to query all of the servers, collate them, and sort them (assuming that we want to search by creation date) before we give them back to the client. This can be inefficient, because each database server would have to perform a lookup for all tweets that it might have in the given range, and that lookup needs to also sort by timestamp. We can use a secondary index on the timestamp to achieve faster lookups, but this also results in slower writes.

Hash by User ID

If we choose the User ID for the hash, then we would potentially have more problems. When we search for tweets, we might still find tweets from various users, meaning we would still have to potentially query all of the servers. The sorting problem will also be present from the Tweet ID approach. To top it off, the User ID approach introduces a new problem of hotspots, where a very popular user (such as Kanye West) might have an incredibly higher number of reads vs. other servers, causing an unequal distribution of traffic.

Hash by Timestamp

If we choose the Timestamp for the hash, we will still introduce unequal distribution and hotspots since the server that contains the most recent posts will have higher traffic than the older posts.

Hash by Tweet ID + Timestamp

The bottleneck of the Tweet ID hash is the inability to efficiently sort at scale. We could instead create a Tweet ID in such a way that it incorporates both a sequence AND a timestamp, so that we can achieve uniform distribution via sharding and also have efficient lookup (with sort) by the database servers. To do this, we can use the Twitter Snowflake ID pattern to reserve at least the first 41 bits for the epoch timestamp and 12 bits for the sequence ID.

This allows database servers to fetch the latest tweets fast since the timestamp is built into the ID and is inherently sortable by its value. We would not need a secondary index nor another column (i.e. Timestamp), which reduces write speeds significantly.

We can combine the epoch timestamp and sequence number, hash the result and use that to determine where to shard. Keep in mind that we may still need to query all the servers since we are searching for a set of tweets, but overall this is more faster and efficient vs. hashing by Tweet ID alone.