Preface
Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Key Terms:
- Document - Serialized JSON data. A mapping of fields.
- Index - Roughly equivalent to a database table. An index just contains one or more documents.
Bulk API is not atomic
When we re-index Elasticsearch documents, documents are updated in real-time.
Recall that a transaction is a group of operations that we want to execute in an all-or-nothing fashion; if one operation fails inside a transaction, we rollback the entire transaction by undo-ing the operations already applied. If all operations succeed, the transaction is committed and all operations will take effect atomically.
Let's say for an example, we want to index 100 documents. In this case, we would like to commit 1 transaction, where we have 100 operations - each operation will index one document.
es = Elasticsearch(...)
es.indices.delete(index=SPECIAL_INDEX)
actions = [
{
"_index": SPECIAL_INDEX,
"_type": "_doc",
"_source": {
"rarity": g.group(1),
"id": g.group(2),
"timestamp": datetime.now()
},
}] * 100
helpers.bulk(es, actions)
Elasticsearch will store each document into SPECIAL_INDEX
sequentially.
If the Bulk API fails half-way through, the documents that have been indexed already, stays indexed. There is no rollback functionality here, so this transaction is not atomic. This is problematic due to side effects, such as state inconsistency (i.e. which documents are deleted and which documents aren't?)
Updating in real-time
Suppose that each of the 100 documents above are about 1 GB of data each, and you want to continue to wipe the entire index clean and fill it with 100 GB of fresh new documents. Assuming that your hardware and network throughput isn't so stellar, just the process of deleting itself could take minutes, if not hours.
Let's also say that this index is currently visible to the world wide web, and visitors can see all of the documents in that index in real-time.
While the DELETE
operation is processing, visitors are going to be in shock when they find that their search results are declining in count rapidly. The visitors leave the page in frustration and come back minutes later to find new results popping up in the search engine... but this time, the results count is super low (the INSERT
operation for new documents is still processing).
Unfortunately the search engine is too "real-time" for our visitors, causing visitor uncertainty on what is occurring, while the updating/refreshing of the search indexes are going on. Not only is this misleading and a bad user experience for search clients, if any fatal issues occur during this real-time update, then the search experience would be completely dead, making this entire operation a dangerous thing to do.
Solution - Aliases
To address the issue above, what we can really benefit from is a A/B switch functionality. What if we had two search tables to begin with? One would be for the original table that is live (in production), and another table that is a copy of the live table. Suppose we have a big batch of updates to the search table, daily. Instead of updating the live table, what if we just update the copy table instead, and then switch over to the copy table after the update process is complete?
The underlying idea above is the exact use case for the Elasticsearch Aliases API. Aliases can be used to map indices to an alias, which acts as an index itself. You can think of an alias as a pointer in programming languages or a soft link.
The key importance of aliases here is that you can insert the documents into a separate index, and when the indexing operation is complete, you can simply map the alias to that new index, and it will near-instantly point to that index. This means you can have an alias return documents from various indices at your own will, or even a combination of all indices if you wanted to.
In the diagram above, we handle the scenario of updating a pre-existing index (OLD_INDEX
) that happens to be live in production. We first create a new index called NEW_INDEX
, store all of the new documents inside it, then map it to the alias once everything is done. After we map it to the alias, we can then point the application code to the alias so that live traffic will hit the aliased index for new search results. We can then remove OLD_INDEX
from the alias to either clean it up or delete the entire index if desired.
The code example below illustrates how aliases can be used to streamline the delete and insert operations of the documents in such a way that the visitor search UX would be mostly unaffected.
es = Elasticsearch(...)
# Create the index placeholder for all new documents
es.indices.create(index=NEW_INDEX)
actions = [
{
"_index": NEW_INDEX,
"_type": "_doc",
"_source": {
"rarity": g.group(1),
"id": g.group(2),
"timestamp": datetime.now()
},
}
] * 100
helpers.bulk(es, actions)
# At this point, the documents are indexed into NEW_INDEX, assuming no errors
# Update the alias in real-time! To the end user (viewers), this will look like the index refreshed its data instantaneously!
es.indices.put_alias(name=SPECIAL_INDEX, index=NEW_INDEX)
es.indices.delete_alias(name=SPECIAL_INDEX, index=OLD_INDEX)
# Delete documents from old index as part of clean-up
es.indices.delete(index=OLD_INDEX)
As you can see, the Alias API is very powerful due to the benefit of very swiftly swapping from one index to another.