@rkenmi - SQL: Indexing

January 24, 2021

# Introduction

However, indexing might seem like magic, and you might also not be too sure which field to index in the first place. This article is made to help you understand how indexing works in a nutshell, and which fields you might want to index over others.

# What is an index?

An index at the most basic level has at least two columns:

• A search key column that this index is based on
• The pointer column to the actual location of the row in another table

In most RDBMS implementations, you are allowed to create an index based off of a column. Once created, the search key column then is represented by the values of that column.

# Example

To better understand indexing, we should understand what an index of a book is. From the wiki:

An index ( plural: usually indexes, more rarely indices; see below) is a list of words or phrases ('headings') and associated pointers ('locators') to where useful material relating to that heading can be found in a document or collection of documents. Examples are an index in the back matter of a book and an index that serves as a library catalog.

TL;DR

Basically, its a list of keywords that tell you where the page numbers of their usage (in the book) are.

The advantage of this is pretty clear. Instead of flipping through every page of the book looking for the keyword "Mangos", you can instead look up the index for "Mangos" to find the page number you're looking for almost instantaneously.

# Back into the DB world...

In RDBMS, this works fairly the same way. If we make a SQL query such as the following:

SELECT * FROM table WHERE id='99999'

We would actually be looking through all rows in the table, from top to bottom, until we find the row with id = 99999. Keep in mind, that the rows can be unsorted, so you might see rows like this being processed one by one:

id | loc | name

193  | NZ | Garfield
758  | NZ | Jones
30   | NZ | Red
1990 | CA | Smith
25   | OH | Chappelle
...


This means that it can be very slow just to find that row with id = 99999. At the worst case this will search for every possible row, if id = 99999 happened to be at the very last row. This is $$O(n)$$ time, which can be really expensive for a humongous database.

# How a SQL index works

The DBMS keeps track of a index data structure, which is separate from the table itself. You could think of it as a hash map that represents the values of a field you want to index on. These values would be sorted.

id | loc | name

193   | NZ | Garfield
758   | NZ | Jones
30    | NZ | Red
99999 | CA | Jane
1990  | CA | Smith
25    | OH | Chappelle


In the example below, we create an index based on the search key id to achieve the following index:

index (id)

25 => (Row 6)
30 => (Row 3)
193 => (Row 1)
758 => (Row 2)
1990 => (Row 5)
99999 => (Row 4)


With the index created, instead of going through each row until we find id = 99999, we can now do a binary search on the index ids, and find the row we are looking for in $$O(log n)$$ time. This works because binary search works on any sorted input.

The $$O(log n)$$ is a huge advantage vs. $$O(n)$$.

If we have a table with billions of values ($$10^9)$$. Instead of potentially going through a billion rows, we now only have to look up 29 rows. That's huge savings!

Note: In some SQL implementations, the primary key can already be indexed for you.

# Choosing the index

There are a few things to determine:

• Is the database read heavy or write heavy?
• What is the cardinality of the field?

Cardinality can be defined as the following: $$\frac{\text{unique values}}{\text{total records}}$$. For example, the field 'gender' would have only a handful of possible unique values. On the other hand, the field 'price' may have infinitely much more unique values.