Secondary Index
A secondary index is used in databases to help speed up queries when we want to grab data from popular columns or if we want to do some type of key range lookup efficiently.
Secondary indices are used in relational databases (e.g. MySQL) as well as NoSQL databases (e.g. DynamoDB). For relational databases, the secondary index is used for columns. For NoSQL databases, indices can be created based on different database constructs - for example, DynamoDB can have secondary indices on item attributes.
In the following example, we take a look at a document DB (e.g. MongoDB) that maintains secondary indices on a particular document field.
In the example diagram above, a secondary index is created on the eyes field for each Document. Thus, if we want to do a look up such as SELECT * FROM table WHERE eyes == "Brown"
, we can quickly refer to the secondary index on eyes and grab the documents for which the eyes
attribute is Brown.
Local Secondary Index (LSI)
A local secondary index takes the idea above in a distributed database setting where we may have 1 or more partitions that are sharded. In this case, a secondary index is partitioned by some document ID. Each partition would now have its own secondary index that is mapped to a document ID that exists in the same partition's primary index.
The upside of this approach is that whenever you need to update a specific document by its document ID, you only need to update one of these partitions. The biggest downside here is the need for scatter-gather, which means that when you need to do reads based on the eyes
field, you need to query all partitions and merge all of the results back to the requester; this can be an expensive and time-consuming operation.
Another database with LSI support is DynamoDB, a key-value database. Conceptually, DynamoDB's partition key / sort key combination for LSIs make sense. The partition key of the LSI must be the same as the partition key of its base table - if the base table is sharded by eyes
, then so should the LSI index. The sort key here would just be eyes
or any other attribute. Typically with DynamoDB, local secondary indices are not preferred for high reads due to:
- The constraint of choosing the same partition key for the index as the base table
- Scatter-gather requirement, which can result in slow reads but fast writes.
To alleviate both of these concerns, a global secondary index (GSI) can be used instead.
Global Secondary Index (GSI)
A global secondary index is also used for distributed databases, but unlike the local secondary index, the global secondary index can be partitioned by some term (taken from search engines such as Elasticsearch). A term here would be a field in a Document database, or a column in RDBMS. The main difference here is that LSI is sharded by the same primary index key, but GSI can be sharded by other values.
hair: Black
, the row 100
is in Partition 1, even though the GSI for the term hair resides in Partition 2. For this reason, it is global and not localized to its own partition.The image above shows each partition with a GSI layer. The GSI layer contains document IDs that are from other partitions. Notice how the partition nodes have additional responsibilities - it maintains two different indices sharded by different values - the primary key index based on Document ID, and the GSI based on the term or field.
The advantage of this approach is that only one secondary index partition needs to be read to locate the documents, hence the term global. This makes reads faster overall, but writes on the other hand become more complicated. If a document is updated, now the reference to the document in each of the partitions' GSI has to be updated too.
For example, in the diagram above, if Arya changes her hair to Brown and eyes to Blue, then we would need to update the GSI for partitions 2 and 1 respectively.
In the DynamoDB world, GSIs are useful to have in a system where we need fast reads. Since we are not constrained by the same shard ID as the base table (the conceptual equivalent of primary key index in the diagrams above), we have the flexibility to index just about any DynamoDB Item attribute. However, like all indexes, an attribute should be chosen wisely to reduce 1. database operation costs and 2. the cardinality of the attribute.
Afterword
A local secondary index (LSI) has the benefit of faster writes since the secondary index is localized to its partition - if we need to update a document, we don't need to touch other partitions to do so. Reads however are slower, since reads based on a column (or field) may need to look at every other partition as well. A LSI also needs to be sharded based on the primary key index or base table shard ID, which adds further limitations if we want to index on completely different columns or attributes.
A global secondary index (GSI) on the other hand, has the benefit of faster reads since only one partition needs to be looked up for the document references. The benefit can especially be seen if the document references only a handful of nodes/partitions and the max number of partitions is very high. However, writes will be slower because every partition needs to be examined to locate if a particular document needs to be updated. A GSI has more flexibility with what column/attribute to index on.