Introduction
As illustrated in this article, indexing is one of the easiest and most effective tweaks you can add to your SQL database.
However, indexing might seem like magic, and you might also not be too sure which field to index in the first place. This article is made to help you understand how indexing works in a nutshell, and which fields you might want to index over others.
What is an index?
An index at the most basic level has at least two columns:
- A search key column that this index is based on
- The pointer column to the actual location of the row in another table
In most RDBMS implementations, you are allowed to create an index based off of a column. Once created, the search key column then is represented by the values of that column.
Example
To better understand indexing, we should understand what an index of a book is. From the wiki:
An index ( plural: usually indexes, more rarely indices; see below) is a list of words or phrases ('headings') and associated pointers ('locators') to where useful material relating to that heading can be found in a document or collection of documents. Examples are an index in the back matter of a book and an index that serves as a library catalog.
TL;DR
Basically, its a list of keywords that tell you where the page numbers of their usage (in the book) are.
The advantage of this is pretty clear. Instead of flipping through every page of the book looking for the keyword "Mangos", you can instead look up the index for "Mangos" to find the page number you're looking for almost instantaneously.
Back into the DB world...
In RDBMS, this works fairly the same way. If we make a SQL query such as the following:
SELECT * FROM table WHERE id='99999'
We would actually be looking through all rows in the table, from top to bottom, until we find the row with id = 99999
. Keep in mind, that the rows can be unsorted, so you might see rows like this being processed one by one:
id | loc | name
193 | NZ | Garfield
758 | NZ | Jones
30 | NZ | Red
1990 | CA | Smith
25 | OH | Chappelle
...
This means that it can be very slow just to find that row with id = 99999
. At the worst case this will search for every possible row, if id = 99999
happened to be at the very last row. This is \(O(n)\) time, which can be really expensive for a humongous database.
How a SQL index works
The DBMS keeps track of a index data structure, which is separate from the table itself. You could think of it as a hash map that represents the values of a field you want to index on. These values would be sorted.
id | loc | name
193 | NZ | Garfield
758 | NZ | Jones
30 | NZ | Red
99999 | CA | Jane
1990 | CA | Smith
25 | OH | Chappelle
In the example below, we create an index based on the search key id
to achieve the following index:
index (id)
25 => (Row 6)
30 => (Row 3)
193 => (Row 1)
758 => (Row 2)
1990 => (Row 5)
99999 => (Row 4)
With the index created, instead of going through each row until we find id = 99999
, we can now do a binary search on the index ids, and find the row we are looking for in \(O(log n)\) time. This works because binary search works on any sorted input.
The \(O(log n)\) is a huge advantage vs. \(O(n)\).
If we have a table with billions of values (\(10^9)\). Instead of potentially going through a billion rows, we now only have to look up 29 rows. That's huge savings!
Note: In some SQL implementations, the primary key can already be indexed for you.
Choosing the index
There are a few things to determine:
- Is the database read heavy or write heavy?
- What is the cardinality of the field?
Read vs. Writes
The primary benefit of indexing is improving query performance. If your database is write-heavy, then that means for each new entry added, the index will have to update its list to account for those new entries as well. This means that writes take longer, and takes up a lot more disk space. Therefore, if you are write-heavy, then you might want to avoid over-indexing, or be wary of which fields you index.
Field Cardinality
Cardinality can be defined as the following: \(\frac{\text{unique values}}{\text{total records}}\). For example, the field 'gender' would have only a handful of possible unique values. On the other hand, the field 'price' may have infinitely much more unique values.
If you had a billion records, and you indexed on 'gender', you would have low cardinality, which makes the index not so suitable.
For fields that have high cardinality, such as price or id, the index will be much more efficient and pragmatic.