NoSQL - the Radical Databases

NoSQL

NoSQL is a category of databases that aren't relational. For example, MySQL would be a relational database, where as MongoDB would be a NoSQL database.

Back then, relational databases were the tried-and-true, prevalent and reliable data stores. Today however, we have so many non-relational databases that the NoSQL category has simply exceeded the number of relational databases.

Why did things escalate this way? The simplest explanation for this is that application needs have evolved over time. As user traffic grows, relational databases have had significant struggles to scale up to the level of big data.

NoSQL databases are generally equipped with features that allow it to easily scale, at the cost of having less reliable or consistent data. They typically follow the BASE principal:

B - Basically

A - Available (Guaranteed)

S - Soft state (Doesn't have to be 100% write or read consistent across hosts)

E - Eventually Consistent (After some milliseconds, all hosts will be consistent)

Categories

There are four categories of NoSQL databases - that is a lot! Relational databases are generally close enough to all be lumped up into one category, but we can't say the same for these NoSQL databases.

Key-value store

What its like: A hash map

  • Keys are maintained in lexicographic order, allowing efficient retrieval of key ranges
  • \(O(1)\) time lookup and writes
  • Hash maps are pretty basic by itself, so it relies on application code if more operations are needed
  • The basis for more complex NoSQL stores such as document stores

Examples: Memcached, Redis, DynamoDB

Document store

What its like: A hash map but with document objects as values (i.e. JSON, XML)

  • Document stores come with APIs or a query language to help work with the internal structure of the document
    • Less application code to write from scratch
  • Documents are organized or grouped together by collections, tags, metadata or directories
    • Fields from Document A and Document B can be completely different, even if they are grouped together

Examples: Elasticsearch, MongoDB, DynamoDB (partial)

Columnar

What its like: A nested map. ColumnFamily<RowKey, Columns<ColKey, Value, Timestamp>>

  • Column Family = A map of data for a given column. For example, the "address" column family can be represented as follows:
{
    city: 'Los Angeles', 
    state: 'CA', 
    street: 'Hollywood Blvd'
}
  • Column = A key/value pair. For example, in the map above, city: 'Los Angeles' is a Column.
  • Super Column Family = Basically the encompassing table, where row keys, column families, and columns are present.
  • Columns/Column Families from one row can be different from another row. Not bound by schema.
  • Suitable for lots of data (i.e. petabytes) that need to be stored and can take advantage of well-named row keys (due to the nature of lexicographical range lookups)
    • For example, in Bigtable, Google named their row identifiers starting with com.google.*, com.search.*, etc. for better data locality

Examples: Bigtable, HBase, Cassandra

Note: Hadoop and HBase is not the same thing. Hadoop consists of the Hadoop Distributed File System (HDFS), MapReduce, and a management bridge. HBase is a columnar NoSQL data store. HBase can be used on top of a Hadoop cluster to do performant random read/writes, since HDFS doesn't support random read/write. In a nutshell, Hadoop is suited for offline data batch-processing while HBase is suited for real-time data needs.

Graph

What its like: A graph

  • Optimized to represent complex relationships with many foreign keys or many-to-many relationships
  • Suitable if you need to analyze relationships in data. For example, if you want to see if students from different universities also had the same professor. This would be incredibly expensive to determine in relational tables.
  • Pretty new, not as popular or widely used. Not a lot of development tools and resources yet.
  • Most graph DBs are only accessed with REST APIs.

Examples: Neo4j