Skip to content

FancyIndexes.md

LambdaZen edited this page Dec 13, 2017 · 1 revision

Fancy Indexes

Recent progress in indexing and search technology obsoletes old-style database indexes for human searches and lookups. Apache Lucene is the most popular indexing search technology. Lucene indexes can have documents that refer to the vertex/edge IDs in Bitsy. This section discusses how you can integrate Bitsy with Lucene (or an alternate "fancy index") while maintaining ACID guarantees.

Lucene indexes are good for human use. Bitsy indexes are good for for machine (algorithmic) use. For example, a Blueprints key-index is better to find person node by SSN and a Lucene index is better to find a person node based on his/her education background.

For this reason, it is OK to update Lucene indexes with a little bit of delay. Delayed batch updates are also preferable because Lucene performs much better with batched updates followed by commits. Please refer to the bullet on setting autoCommit=false.

Note: There are no plans to integrate Lucene with Bitsy. If you are managing large objects outside Bitsy, it makes sense to keep the indexes on these objects externally as well. Furthermore, the mapping between vertex/edge properties and documents to be indexed by Lucene is application-dependent. If there is one search box for all properties of a person, there will only be one Lucene index against all properties of the person vertex.

Ensuring consistency with Lucene

Bitsy supports ACID transactions, and so does Lucene in the sense that all commits will be fully applied or not applied at all. The problem, however, is that when committing to Lucene in batches, a crash might leave some elements/documents un-indexed.

To avoid this problem, your application can maintain one vertex per Lucene index maintained. Whenever a vertex is added with an impact on the index, you can add an edge from that vertex to the vertex representing the Lucene index. This guarantees that the database always keeps track of documents that need to be added to the indexes.

In the background, a separate thread can wake up (say every minute) and perform these steps:

  1. Find the vertex for the Lucene index managed by this thread, say V, based on a key-index or the UUID of the vertex
  2. Get the incoming edges to V. These link to vertices that have some properties or references to large objects that must be indexed.
  3. Add the document(s) corresponding to the linked vertices. This should be an idempotent operation since it may be executed more than once in case of a crash in step 5/6.
  4. Commit the Lucene index writer
  5. Remove the processed edges from the Bitsy graph
  6. Commit the changes to the Bitsy graph

The above algorithm ensures that the Lucene indexes are updated efficiently, while keeping Bitsy and Lucene consistent with one another within a configurable period of delay. In the event of a crash, the system will pick up the pending vertices to be indexed and add them to the Lucene index.

Clone this wiki locally