Data stores in Software Architectures

Use Cases

There are many ways to store your data. In this article we'll walk through some examples of data storage in common system designs.

Reminder: There is no single best storage choice and they may vary heavily depending on things such as access patterns and scale. Take these suggestions as a frame of reference, and not as the exact solution to your architectural design!

A social network

Let's say you need to store a user's LinkedIn profile and also maintain the relationship between that user and the user's connections. In this case, a graph database like Neo4j is suitable, since we care about relationship properties at a high scale. It is worth mentioning that at lower levels of scale, SQL also does a good job of maintaining relationships (hence the term relational data), albeit the potential need for complex join queries.

Solutions: Graph Databases (ex: Neo4j)

Runner-ups: Relational Databases (ex: SQL)

Amazon items (real-time product data)

When we think of Amazon, we think of huge item inventories. Items also have many distinct attributes based on the item type. For example, a sofa on Amazon may have attributes such as wood stain, upholstery color, materials (leather), but a book on Amazon will have different attributes such as pages, author, and publishing company. These properties usually don't change once an item is cataloged, however other properties, such as ratings, reviews, and other useful metrics such as the number of times a customer purchased the item, can be very useful attributes to fetch in real-time.

For this reason, NoSQL Document Databases (MongoDB, DynamoDB*) or Columnar Databases (Cassandra) are a good choice. Both of these databases can handle read/writes of large amounts of unstructured data with very good latency and availability, at the cost of consistency. These databases are suitable for real-time operations.

Normally, this kind of data is also stored behind a distributed cache, to reduce the network calls to databases.

It is also worth mentioning that SQL could be used instead to store some of the real-time metadata (i.e. number of purchases, ratings, reviews), fronted by a cache. The unstructured data could then be stored into a different data source. While some SQL databases support JSON, performing operations to fetch data from JSON in rows is not as optimal as making read queries on Document Databases. Other major drawbacks with this approach would be overloaded writes, since some metrics, such as the number of purchases across all products, will have very high TPS. A single master SQL writer is not sufficient at the Amazon scale for this, meaning we would need a multi-master setup to horizontally scale the SQL database with sharding and consistent hashing.

Solutions: Document Databases / Columnar Databases (ex: MongoDB, DynamoDB, Cassandra) + Distributed Cache

Runner-ups: - Relational Databases (ex: SQL) + Distributed Cache

Crypto trends within the last year

In this problem, we want to figure out the statistical data of cryptocurrency within the last year, down to a minute's granularity. We have millions of data points which consist of transactions, which ultimately indicate the price, volume and timestamps. A few ideas that come to mind:

  • If we don't need real-time processing
    • We can input this data into a Hadoop cluster and have a batch ETL (extract, transform and load) job to store the data into a Data Warehouse, such as Amazon Redshift.
    • We could then schedule a daily report job with cron, which might take hours or days. Data Warehouses are a good choice for these queries, since they are optimized for reporting and analysis and not intended for frequent writes.
  • If we do need real-time processing
    • We could input this data into Spark Streaming and store the output into a database like Cassandra or a message queue like RabbitMQ for notifications.
    • Services can then query the output database or receive the messages from the message queue to return real time data back to the user

Since we have big data with respect to time, we can use something like a Time Series Database (TSDB) such as InfluxDB, rather than Data Warehouse or Cassandra. These are more optimized than your traditional databases, provided that your data has time as an axis. If your data isn't a time series, this database would not be a great choice. If your data could ever change, or you see a potential need to add different kinds of metadata (not just time series data), then we should look into other types of databases at that point.

Solutions: For offline storage - EMR/Hadoop and Data Warehouse (ex: Redshift). For real-time processing, Spark Streaming and Columnar Databases (ex: Cassandra)

Runner-ups: A TSDB (ex: Graphite, InfluxDB) for handling time series data.

Your very own "Google Drive" service

When we think about a file storage service or a media streaming service (i.e. Youtube, Netflix, Spotify), we want to think about BLOBs (binary large objects). For that, an object store is the best choice for this, such as Amazon S3. Although this reasoning is ubiquitous and widely accepted, we should also understand that object stores by themselves will have drawbacks at higher scale. To deliver objects quickly back to the user, we can also use CDNs in combination. Do note though, that CDNs are much more expensive. It might make sense to use a CDN for objects that are very popular, and rely on the object store itself for objects that are rarely fetched.

An alternative approach is to store data into actual hard disks (i.e. Amazon EBS), however there are major scaling drawbacks to this, as well as resiliency issues. For that reason, it is hard to list that as a runner-up. :(

Solutions: Object Store (ex: Amazon S3) + CDN

Runner-ups: -

Text Autocomplete Service

Text autocomplete is so popular that a solution was made specifically for it - search engines. These include well-known names like Elasticsearch and Solr, which are powered by Lucene under the hood. These come with nifty features out of the box, such as fuzzy searches and keyword scoring for relevance matches.

Although the solutions above are hard winners, it is worth noting that before these solutions existed, some relational databases (like PostgreSQL) had fuzzy search capabilities out of the box that are also quite good. For smaller scale solutions, this would be just fine.

Solutions: Search Engines (ex: Elasticsearch, Solr)

Runner-ups: SQL with Fuzzy Search capabilities (ex: PostgreSQL)

Collaborative Whiteboard

Collaborative whiteboards are 1. real-time and 2. need to account for synchronization.

For the real-time component, we want very quick latencies when we write on the board and pick up new changes. For this, WebSockets come to mind, since WebSockets allow for very fast bi-directional communication by upgrading the HTTP header. On a single WebSocket server, this is no issue.

When we have a need for a cluster of WebSocket servers to handle high traffic, we can use a load balancer to re-direct the user to one of the servers in the WebSocket cluster or layer. At this point, we'll now need to worry about state and synchronization. Since WebSockets are sticky sessions and are persistent, long-running connections, over time they will be uninformed of changes made to the WebSocket channel (i.e. a room for the current whiteboard) that were made in other WebSocket servers. To combat this, we can use a message queue like Apache Kafka or Redis (Redis can be a cache and a queue too!) to send notifications to other WebSocket servers, so that they will know when to update.

Once the WebSocket servers get a notification that they need to sync, they can then pull the latest changes from the underlying database so that the WebSocket servers can stay in sync (latest state). A NoSQL database would be a good choice if we want to write and read at very high TPS to maintain WebSocket states. [3]

Solutions: Event streams (ex: Apache Kafka) or message queues (ex: RabbitMQ) to send notifications about new WebSocket events. NoSQL database to write/read WebSocket events.

Runner-ups: -

Bank Transactions

Say you want to send money as User A to the recipient User B. The process of debiting X money from User A and crediting X money to User B must be done carefully. And by carefully, I mostly mean atomically. If a transaction fails, we should have the ability to roll back to the previously known good state. We should never be in a state where we oversend money or cause money imbalances due to a system error.

To accomplish this, we want strong consistency for our transactional data. SQL is a strong winner here, albeit the few challenges we would have with scaling. We could scale the SQL using techniques such as indexing, federation, data denormalization (to have less join queries), and moving some data out to another data source (i.e. NoSQL).

Beyond that, we can horizontally scale the SQL system by having read replicas to make reads quicker. Traditionally this can be achieved with the single leader replication approach - one leader node for writes and multiple follower nodes for reads and replication. If there are a lot of write transactions hitting the one single leader node, we might end up with a performance bottleneck.

We can have multiple workers and use sharding and consistent hashing so that we can distribute writes evenly. The drawback of this solution however is that transactions may now be eventually consistent, rather than strongly consistent. To fix that issue, we can use a quorum consensus and enforce a specific number of writes to \(W\) nodes and a specific number of reads from \(R\) nodes to achieve quorum. This is formalized as: \(W + R > N\), meaning that if there are more writes and reads going out to a number of nodes greater than \(N\), then that means at least one read node would have the latest up-to-date data - ultimately, this means that if your \(W + R > N\), then we can have strong read consistency. Conversely, if \(W + R <= N\), then we will have eventual read consistency.

Of course, the work to do all of the above is not trivial and will be very time consuming to implement from scratch. A better solution may be to take advantage of managed database services, such as Amazon Aurora, which will handle replication and scaling for you based on the configuration you supply it. Alternatively, some managed NoSQL databases can be tuned to have strong consistency, such as Cassandra or DynamoDB.

The downside of strong read consistency is that availability may have to be sacrificed, which means a solution needs to be ready to handle that scenario. One way to tackle this is to have fail-over or Active-Passive hosts so that the passive machine can replace the active machine in the case of failures.

Solutions: Relational Databases are tried and true for atomic transactions

Runner-ups: NoSQL databases that can guarantee strong consistency might be easier to scale up

What businesses are near you?

Let's talk about how to find businesses around you, given a location (latitude, longitude) and the search radius. When we talk about the map, we talk about the real world location of where you are and the geocoordinates of your current position. There are many ways to design your system to allow for quick geospatial queries, i.e. geohash, quadtrees.

Since quadtrees are a general purpose, in-memory 4-child tree data structures, there is no specific data store that supports using quadtrees for geospatial queries. Quadtrees are sub-divided into 4 children based on some threshold value \(K\), which is a variable that will differ based on your application's use case.

Geohash, on the other hand, is a specific type of algorithm used for geospatial operations, where two-dimensional coordinates are taken as input and a Base32 string encoding is returned as output. There are many libraries out there for geohashing, and Postgres also has the PostGIS plugin, which allows you to calculate geohashes from coordinates in the query body.

By setting a precision level \(N\) on your geohash, we can split the map into \(N\) subdivisions (see image below). The geohash algorithm guarantees that encoded strings with similar prefix are close together. For example, the grid represented by ja83k is guaranteed to be close in proximity to the grid at ja83b. But keep in mind, the inverse is not necessarily true; two grids with completely different strings such as zz193 and ja83k could potentially be very close to each other, based on happenstance of the largest quadrant character (i.e. the first characters, z and j) being right next to each other.

The geohash can then be used as an index to have fast lookups. Simply feed in your coordinates as input, calculate the geohash mapped for those coordinates + precision level, and then do the SQL lookup for the geohash key. Since locations of places and their geocoordinates are not something that gets written as heavily as their read, this is highly recommended for scaling read throughput. Write throughput is not as important, since the rate at which businesses change their location is relatively low. Your system can also choose to apply business updates only once a day, to lower the overall write throughput.

Since we do many more geocoordinate lookups vs. adding new locations and geocoordinates on the map, this design also lends itself well to having additional read replicas for the PostgreSQL database; something that works very well with SQL as opposed to implementing application based sharding. For this reason, PostgreSQL with PostGIS is a great solution that is battle tested.

Alternatively, we can use NoSQL solutions that offer geospatial query support, like Elasticsearch. These solutions can give super quick lookups, but have some issues with indexing performance that need to be considered at higher scale [2].

Solutions: PostgreSQL + PostGIS

Runner-ups: Search engines with Geospatial query support (ex: Elasticsearch, Solr)

Storing Front-end Website Assets

When you make a network request to a server that returns the front-end webpage as a HTML blob, other HTML/JS/CSS is downloaded from the server. This can be problematic if network requests are made from across the globe, resulting in very slow download times. To make matters worse, front-end assets in modern stacks are growing larger and larger in size. Server-side rendering is one way to alleviate the issue, so that the user can delegate the downloading of these assets to the rendering server. But a well accepted solution is to actually distribute all your static content - images, CSS styling, additional Javascript, videos, etc. to a CDN.

Solutions: CDN (ex: AWS Cloudfront, Akamai) with assets compressed

Runner-ups: Object store (ex: Amazon S3) would be suitable - geolocation can be considered

Price Alerts

Let's say you have an app that allows users to add pricing alerts when a certain item is detected at a cheap price. We can send push notifications to clients running this app, which may be of various device types (browser, android, ios). For any kind of notification system, these are things that could be done in the background asynchronously and for that, one idea comes to mind - message queues. Apache Kafka, RabbitMQ, and Amazon SQS are quite popular solutions for this. Do note that most of the data stored in these message queues will be discarded after a period of time, and the message queues may not be resilient. A service can then subscribe to the message queue and process the message, removing the message out of the queue. If a message fails to process, you can add a resiliency layer by adding a Dead Letter Queue for all messages that failed processing. These messages can then be processed at a later time by a Retry Service.

Solutions: Message Queue (ex: Apache Kafka, Amazon SQS, RabbitMQ)

Runner-ups: 3rd party Notification API Services (i.e. Mailgun) can be a simpler alternative for the sole purpose of delivering notifications to your clients.

Afterword

Most of these solutions are derived from plenty of credible sources (e.g., System Design Primer, System Design Interview) as well as watching Youtube videos on various architecture solutions. Again, there is no true one size fits all solution, and it always depends on the constraint of your problem. If you have any suggestions for updating or correcting the article, feel free to leave a comment!

References

[1]: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0262-8

[2]: https://medium.com/@sandermulders/performing-geospatial-queries-at-scale-7b64795d7704

[3]: https://hackernoon.com/scaling-websockets-9a31497af051