Lucene Index Export Service Explained
Introduction
Goal
Understand how the Lucene Index Export Service works.
Background
The optional Lucene Index Export add-on, when added to a Bloomreach Experience Manager implementation project, adds a Repository JAX-RS service which enables on-the-fly export of the Lucene index from a running production instance.
This page describes how the Lucene Index Export Service works.
Lucene Index Export Service Explained
When creating a Lucene export from a running repository, a zip with a structure similar to the example below gets created:
+ _1 + _2 + _3 + ... + indexRevision.properties
The indexRevision.properties file contains the properties indexRevisionBefore and indexRevisionAfter: The (database) revision of the current repository at the start of the Lucene export and the (database) revision at the end the export: We need both values because during the Lucene export, the revision of the repository can advance. When the index export is ready, we cannot know exactly what the revision number is for the export index. All we know for sure is that the revision number is between the start and end revision number. Therefore, when we start a new repository with the exported index, we remove any added documents to the index after the start revision number until the end revision number. Then, during the further repository startup process, the index will be updated from the start revision to where the global (cluster wide) revision is. After this, the repository becomes available. Obviously, the newer the index export is, the faster the new repository will start up.
Lucene Index Export Example with Numbers
Assume the repository cluster node that will do the Lucene export is called 'node1' and we have the following database info at the start of the export:
mysql> SELECT * FROM REPOSITORY_GLOBAL_REVISION; +-------------+ | REVISION_ID | +-------------+ | 2559 | +-------------+ mysql> SELECT * FROM REPOSITORY_LOCAL_REVISIONS; +---------------+-------------+ | JOURNAL_ID | REVISION_ID | +---------------+-------------+ | node1 | 2500 | | node2 | 2559 | +---------------+-------------+
As you can see, node1 is 59 revisions behind (cluster node2 is up-to-date). During the Lucene export done by 'node1', the repository can advance its revision id. When the index export ends, we can for example have the following state:
indexRevision.properties - indexRevisionBefore = 2500 - indexRevisionAfter = 2516
and
mysql> SELECT * FROM REPOSITORY_LOCAL_REVISIONS; +----------------------------------------+-------------+ | JOURNAL_ID | REVISION_ID | +----------------------------------------+-------------+ | node1 | 2559 | | node2 | 2559 | | _HIPPO_EXTERNAL_REPO_SYNC_index-backup | 2500 | +----------------------------------------+-------------+
So, during the export, the revisions advanced from 2500 to 2516, and after that, the 'node1' cluster node advanced to 2559 and is completely up-to-date. What also has been added to the REPOSITORY_LOCAL_REVISIONS is revision
_HIPPO_EXTERNAL_REPO_SYNC_index-backup = 2500
with the revision value that repository node1 has at the start of the index export. See below for the explanation of this extra revision number.
When the Lucene export from the above example is used to startup an extra repository (node3), then during startup the following steps happen when there is an index with indexRevision.properties is present at {storageRoot}/workspaces/default/index :
- The present Lucene index is used as starting point
- From the database table REPOSITORY_JOURNAL all changes from revision 2500 (exclusive) to 2516 (inclusive) are fetched, and removed from the Lucene index. When some changes from revisions (say 2512 tot 2516) were not yet reflected in the index, this is not a problem: Just nothing will be removed
- The repository 'node3' will be added to the REPOSITORY_LOCAL_REVISIONS table with REVISION_ID = 2500
- From {storageRoot}/workspaces/default/index the indexRevision.properties gets deleted
- Further normal Jackrabbit repository startup, resulting in 'node3' updating its index to the REPOSITORY_GLOBAL_REVISION value REVISION_ID
After repository 'node3' has come up and assuming the REPOSITORY_GLOBAL_REVISION did not advance, we can have for example something like:
mysql> SELECT * FROM REPOSITORY_LOCAL_REVISIONS;
+----------------------------------------+-------------+ | JOURNAL_ID | REVISION_ID | +----------------------------------------+-------------+ | node1 | 2559 | | node2 | 2559 | | node3 | 2559 | | _HIPPO_EXTERNAL_REPO_SYNC_index-backup | 2500 | +----------------------------------------+-------------+
Now, what's up with the _HIPPO_EXTERNAL_REPO_SYNC_index-backup? The _HIPPO_EXTERNAL_REPO_SYNC_index-backup contains the REVISION_ID of the start of the last successful Lucene export. After a new successful Lucene export, the revision id of _HIPPO_EXTERNAL_REPO_SYNC_index-backup is updated. The reason why this JOURNAL_ID is added is to avoid that during Repository Maintenance the REPOSITORY_JOURNAL table can get cleaned up beyond the revision id of the last successful Lucene export. If that would happen, the Lucene export zip would not be usable any more because the REPOSITORY_JOURNAL table cannot update the index any more to an up-to-date index. If for some reason a Lucene export zip is used that contains an indexRevision.properties with an indexRevisionBefore property that is older than the lowest REPOSITORY_JOURNAL table, that index export cannot be used to start a new repository with. The startup of the new repository then fails with an error message that contains something like:
The only way to fix this error is to either make sure to use a new Lucene export or start the repository without a present Lucene export.