Solr: Indexing XML with Lucene and REST
by Bertrand Delacretaz
|
Pages: 1, 2, 3
Index Management
Besides the <add> and <commit> operations, <delete> can be used to remove
documents from the index, either by using the document's unique ID:
<delete><id>MA147LL/A</id></delete>
or by using a query to (carefully) erase a range of documents:
<delete><query>category:camera</query></delete>
As with add/update operations, a <delete> must be followed by a <commit> to make the
resulting changes visible in queries.
The last operation available at the update URL is <optimize>,
which triggers an optimization of the Lucene indexes, as explained in the Lucene
FAQ.
This must be called from time to time to speed up searches
and reduce the number of segment files created by Lucene, thereby avoiding possible "out
of file handles" errors.
These four simple operations, described in more detail on the Solr UpdateXmlMessages wiki page, are all we need to maintain our indexes.
Searching and Sorting
As shown above, querying is simply a matter of making a GET request with the appropriate query parameters.
The query language used by Solr is based on Lucene queries, with the addition of
optional sort clauses in the query. Asking for video; inStock asc, price
desc, for example, searches for the word "video" in the
default search field and returns results sorted on the inStock field,
ascending, and price field, descending.
The default search field is specified in Solr's schema.xml configuration file, as in this example:
<defaultSearchField>text</defaultSearchField>
A query can obviously refer to several fields, like handheld AND category:camera which searches the category
field in addition to the default search field.
Queries can use a number of other parameters for result set paging, results highlighting, field selection, and more. Specialized
query handlers can be implemented in Java, and Solr currently provides a very useful DisMaxRequestHandler to search several fields
at once, with configurable ranking weight for each one. Selecting a handler is done via a request parameter, assuming the handler has
been made available to Solr by configuration.
As with indexing, Solr's searching and sorting are quite powerful and very customizable. Starting with the supplied query handlers should be sufficient for the vast majority of applications, yet one can fairly easily fine-tune the querying process, should the need arise.
Field Types
In Lucene indexes, fields are created as you go; adding a document to an empty index with a numeric field named "price," for example, makes the field instantly searchable, without prior configuration.
When indexing lots of data, however, it is often a good idea to predefine a set of fields and their characteristics to ensure consistent indexing. To allow this, Solr adds a data schema on top of Lucene, where fields, data types, and content analysis chains can be precisely defined.
Here are some examples based on Solr's default schema.xml configuration file. First, a simple string type, indexed and stored as is, without any tokenizing or filtering:
<fieldtype
name="string"
class="Solr.StrField"
sortMissingLast="true"/>
The sortMissingLast="true" attribute means that documents where this field is
missing will be sorted after documents with a value.
Then, a numeric and a date field type. The Solr SortableFloatField type causes floats to be converted
to formatted strings for indexing to get a natural sort order:
<fieldtype
name="sfloat"
class="Solr.SortableFloatField"
sortMissingLast="true"/>
<fieldtype
name="date"
class="Solr.DateField"
sortMissingLast="true"/>
For text fields, we generally need to configure some content analysis (more on that later). Here's a simple example where text is split in words on whitespace, for exact matching of words:
<fieldtype
name="text_ws"
class="Solr.TextField"
>
<analyzer>
<tokenizer
class="Solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldtype>
We can now configure the actual field names, mapping them to the field types that we have defined:
<field
name="id"
type="string"
indexed="true"
stored="true"/>
<field
name="category"
type="text_ws"
indexed="true"
stored="true"/>
<field
name="weight"
type="sfloat"
indexed="true"
stored="true"/>
To avoid losing the free-form indexing provided by Lucene, dynamic fields can also be defined, using field name patterns with wildcards to specify ranges of field names with common properties. Here's another example where all field names ending in _tws are treated as text_ws fields, and field names ending in _dt are treated as date fields.
<dynamicField name="*_tws" type="text_ws" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
The combination of strict schema-based data types with looser wildcard-based types helps in building consistent indexes, while allowing the addition of new strongly typed fields on the fly.