Jul 052014
 

Our previous post Introduction to solr already covered the basics of the procedure to install an instance of this search engine, and how to index documents and perform queries on the indexed content.

But using solr efficiently for any purpose other than a demo requires adapting its configuration to the specific type of content that will be indexed.

This post describes the structure of the main configuration file schema.xml in a solr collection. schema.xml specifies the fields that may appear on a document, and their data types.

A good way to get acquainted with the configuration directives included in schema.xml is to analyze the content of the sample schema.xml file included in the installation package for the sample collection “collection1”

The structure of schema.xml can be represented, in pseudo-XML code, as follows:

1. A set of data types is defined by means of <fieldType> elements.

2. Next, the fields that may appear in a document is defined as a set of <field> elements. Each field must have a field name, and a data type from the set of defined data types.

3. One of the fields may be optionally defined as the unique key. This is most useful to make it easy to update/delete documents.

4. Usually, data from several fields is gathered into a field that will be used to perform full-text searches.

Data types

There is an entry <fieldType> in schema.xml for each of the data types of fields that may appear in documents that are indexed in the collection.

The <fieldType> element must have a “name” attribute to give it a name, and a “class” attribute that specifies the java class that implements the functionality required for handling that data type.

There may be also other attributes assigned to the <fieldType> element, to specify configuration parameters specific to the data type being defined.

Primitive data types

The definition of the most common primitive data types can be found in the schema.xml file of the sample “collection1”:

Data types for handling text fields

One of the main characteristics of solr is the full-text search functionality. To perform this kind of search, solr analyzes the content of documents being indexed, in a process called “tokenization”. The text in a field is decomposed in elements called tokens. Then, solr can return for a given query results that contain not only the exact terms being searched, but also synonyms, phonetically equivalent terms, or elements having the same root (stem).

This kind of analysis is performed on fields whose data type is based on the “solr.TextField” class. The definition of a text data type requires a fair amount of additional configuration parameters, to detail the specifics of the tokenization to be carried out.

In the sample schema.xml file there are many text data type definitions for different types of tokenization. For instance:

text_general

phonetic

Performs a phonetic analysis of the text.

Geospatial data types

solr also implements a set of classes that allow the definition of specialized data types, such as points and polygons, for the handling of geospatial data. There will be a post devoted to this topic later in this series.

 

 Fields

The sample schema.xml in the installation package defines a set of fields that can appear in documents indexed in the “collection1” collection, as follows:

Attributes that may be used in the definition of fields

As we see in the example above, all field definitions must include a “name” attribute giving a name to the field, and a “type” attribute assigning a data type from the set of data types previously defined. Field definitions may optionally use some additional attributes:

  • indexed: true if this field should be indexed (searchable or sortable)
  • stored: true if this field should be retrievable
  • docValues: true if this field should have doc values. Doc values are useful for faceting, grouping, sorting and function queries. Although not required, doc values will make the index faster to load, more NRT-friendly and more memory-efficient. They however come with some limitations: they are currently only supported by StrField, UUIDField and all Trie*Fields, and depending on the field type, they might require the field to be single-valued, be required or have a default value (check the documentation of the field type you’re interested in for more information)
  • multiValued: true if this field may contain multiple values per document
  • omitNorms: (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms. Norms are omitted for primitive (non-analyzed) types by default.
  • termVectors: [false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.
  • termPositions: Store position information with the term vector. This will increase storage costs.
  • termOffsets: Store offset information with the term vector. This will increase storage costs.
  • required: The field is required. It will throw an error if the value does not exist
  • default: a value that should be used if no value is specified when adding a document.

Unique key

It is advisable to include an identifier in the documents to be indexed, that will uniquely identify each document in the collection. This makes it easier to carry out updates and deletions of documents. This identifier field is usually named “id”. In the schema.xml file, the name of the field to be used as identifier is specified in the <uniqueKey> element:

 

References

http://www.solrtutorial.com/schema-xml.html

http://wiki.apache.org/solr/SchemaXml

https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties

 Posted by at 7:30 pm

 Leave a Reply

(required)

(required)