XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Solr: Indexing XML with Lucene and REST
by Bertrand Delacretaz | Pages: 1, 2, 3

Index Management

Besides the <add> and <commit> operations, <delete> can be used to remove documents from the index, either by using the document's unique ID:

<delete><id>MA147LL/A</id></delete>

or by using a query to (carefully) erase a range of documents:

<delete><query>category:camera</query></delete>

As with add/update operations, a <delete> must be followed by a <commit> to make the resulting changes visible in queries.

The last operation available at the update URL is <optimize>, which triggers an optimization of the Lucene indexes, as explained in the Lucene FAQ. This must be called from time to time to speed up searches and reduce the number of segment files created by Lucene, thereby avoiding possible "out of file handles" errors.

These four simple operations, described in more detail on the Solr UpdateXmlMessages wiki page, are all we need to maintain our indexes.

Searching and Sorting

As shown above, querying is simply a matter of making a GET request with the appropriate query parameters.

The query language used by Solr is based on Lucene queries, with the addition of optional sort clauses in the query. Asking for video; inStock asc, price desc, for example, searches for the word "video" in the default search field and returns results sorted on the inStock field, ascending, and price field, descending.

The default search field is specified in Solr's schema.xml configuration file, as in this example:

<defaultSearchField>text</defaultSearchField>

A query can obviously refer to several fields, like handheld AND category:camera which searches the category field in addition to the default search field.

Queries can use a number of other parameters for result set paging, results highlighting, field selection, and more. Specialized query handlers can be implemented in Java, and Solr currently provides a very useful DisMaxRequestHandler to search several fields at once, with configurable ranking weight for each one. Selecting a handler is done via a request parameter, assuming the handler has been made available to Solr by configuration.

As with indexing, Solr's searching and sorting are quite powerful and very customizable. Starting with the supplied query handlers should be sufficient for the vast majority of applications, yet one can fairly easily fine-tune the querying process, should the need arise.

Field Types

In Lucene indexes, fields are created as you go; adding a document to an empty index with a numeric field named "price," for example, makes the field instantly searchable, without prior configuration.

When indexing lots of data, however, it is often a good idea to predefine a set of fields and their characteristics to ensure consistent indexing. To allow this, Solr adds a data schema on top of Lucene, where fields, data types, and content analysis chains can be precisely defined.

Here are some examples based on Solr's default schema.xml configuration file. First, a simple string type, indexed and stored as is, without any tokenizing or filtering:

<fieldtype 
  name="string" 
  class="Solr.StrField" 
  sortMissingLast="true"/>

The sortMissingLast="true" attribute means that documents where this field is missing will be sorted after documents with a value.

Then, a numeric and a date field type. The Solr SortableFloatField type causes floats to be converted to formatted strings for indexing to get a natural sort order:

<fieldtype 
  name="sfloat" 
  class="Solr.SortableFloatField" 
  sortMissingLast="true"/>
  
<fieldtype 
  name="date" 
  class="Solr.DateField" 
  sortMissingLast="true"/>

For text fields, we generally need to configure some content analysis (more on that later). Here's a simple example where text is split in words on whitespace, for exact matching of words:

<fieldtype 
  name="text_ws" 
  class="Solr.TextField" 
>
  <analyzer>
    <tokenizer 
      class="Solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldtype>

We can now configure the actual field names, mapping them to the field types that we have defined:

<field 
  name="id" 
  type="string" 
  indexed="true" 
  stored="true"/>
  
<field 
  name="category" 
  type="text_ws" 
  indexed="true" 
  stored="true"/>
  
<field 
  name="weight" 
  type="sfloat" 
  indexed="true" 
  stored="true"/>

To avoid losing the free-form indexing provided by Lucene, dynamic fields can also be defined, using field name patterns with wildcards to specify ranges of field names with common properties. Here's another example where all field names ending in _tws are treated as text_ws fields, and field names ending in _dt are treated as date fields.

<dynamicField name="*_tws" type="text_ws" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>

The combination of strict schema-based data types with looser wildcard-based types helps in building consistent indexes, while allowing the addition of new strongly typed fields on the fly.

Pages: 1, 2, 3

Next Pagearrow