Sep 172014
 

The solr search engine runs an indexation process on each document added to a collection for the first time, and also every time it is updated. This process analyzes the content of each of the fields in the document, splitting the values of text fields into tokens, etc., according to the document structure defined in the schema.xml configuration file.

Normally, this process needs to carried out only once for every new or updated document. But sometimes it might be necessary to re-generate the index for the whole collection, for instance when schema.xml is edited to modify the structure of the document. This post explains how to carry out this re-indexation operation, and some considerations to be taken into account.

When is a re-indexation of a solr collection required

Reindexing a solr collection is mainly required if the structure of the documents is modified by adding or removing fields, or the characteristics of a field are changed, either directly in the field definition in the schema.xml file, or in some of the configuration files referenced (e.g., stopwords.txt, spellings.txt, synonyms.txt, etc.), in a way that impacts the scoring of the document in the solr index.

How to re-index a solr collection

Re-indexing a solr collection actually means inserting again all documents in the collection, to force solr to run the indexation process for each document, using the new configuration.

In most cases, the document structure includes a “id” field whose value is a unique identifier for the document. In these cases, when a document with the same “id” as an existing document is inserted, the old document is removed and replaced with the document being inserted.

But, if there is no unique identifier defined in the document structure, the collection must be cleared by removing all documents before inserting them again, or else duplicate documents would be added to the collection.

For these reasons, the recommended procedure to reindex a collection “collection1” is by using a new collection, as follows:

  1. Create a new collection “collection1_reindex”, cloning the configuration of the collection “collection1”
  2. Edit the configuration of “collection1_reindex” as desired, adding, removing or changing the configuration of the fields in the document structure.
  3. Load all documents in the new collection “collection1_reindex”
  4. Swap “collection1” and “collection1_reindex”, making the new collection to become the active collection.
  5. Delete the old collection.

Create a new collection, cloning the configuration of other collection

A new, empty collection “collection1_reindex” can be created making a copy of the directory tree of the original collection “collection1”, excluding the “data” directory. For instance, on a linux system, the “rsync” command can be used to make the copy from the command line, as follows:

After making the copy, the name of the collection in the core.properties file must be changed to “collection1_reindex”:

Finally, we need to restart the Java container (Jetty, Tomcat,…) where solr is being run. If no error has happened, the new collection must be displayed in the solr admin panel:

solr-reindex-collection

Editing the configuration files of the new collection

The configuration of a document collection in solr is kept in a set of configuration files located in the collection’s “conf” directory.

The two main configuration files as solrconfig.xml and schema.xml. solrconfig.xml contains generic configuration parameters, and schema.xml contains the definition of the structure of documents in the collection.

Loading the documents in the new collection

Usually the original documents to be loaded in solr are stored somewhere outside of solr, as a set of files, or in a database, etc. The procedure that needs to be followed to load the documents into solr depends on this, and is specific for each collection.

As a general rule, the best way to load the document in the new collection would be to reproduce exactly the procedure that was used to load the documents in the original collections. But sometimes, this might be too costly, or even impossible. In these cases, an alternative might be to read the documents from the original solr collection. This is only possible is all the fields in that collection are defined with the attribute stored="true", because the original value of the fields in a document is only stored in solr, and can be retrieved later, if the “stored” attribute is set to “true”.

Swapping “collection1” and “collection1_reindex”

Once all documents have been successfully loaded into the the collection, both collections can be swapped to make the new collection become the active collection. This is done using the “Swap” option in the solr admin panel:

Select “Core Admin” from the left menu, and select “collection1” in the listing of collections (cores). Clicking on the “Swap” button at the top, in the dialog box that pops up, select the collection “collection1_reindex”. Clicking on “Swap Cores”, the names of both collections are swapped, and the new collection, renamed as “collection1”, becomes the active collection.

Note: Be aware that the “Swap” command swaps the names of the collections, but not the names of the directories containing them in the file system. After the two collections have been swapped, the active collection is names “collection1”, but is still under the directory “solr/collection1_reindex”. Likewise, the old collection is renamed as “collection1_reindex”, but is still under the directory “solr/collection1”.

References

Index of posts related to solr

 Posted by at 10:57 am

 Leave a Reply

(required)

(required)