XML.com 
 Published on XML.com http://www.xml.com/pub/a/2006/08/09/solr-indexing-xml-with-lucene-andrest.html
See this if you're having trouble printing code examples

 

Solr: Indexing XML with Lucene and REST
By Bertrand Delacretaz
August 09, 2006

Solr (pronounced "solar") builds on the well-known Lucene search engine library to create an enterprise search server with a simple HTTP/XML interface. Using Solr, large collections of documents can be indexed based on strongly typed field definitions, thereby taking advantage of Lucene's powerful full-text search features. This article describes Solr's indexing interface and its main features, and shows how field-type definitions are used for precise content analysis.

Solr began at CNET Networks, where it is used to provide high-relevancy search and faceted browsing capabilities. Although quite new as a public project (the code was first published in January 2006), it is already used for several high-traffic websites.

The project is currently in incubation at the Apache Software Foundation (ASF). This means that it is a candidate for becoming an official project of the ASF, after an observation phase during which the project's community and code are examined for conformity to the ASF's principles (see the incubator homepage for more info).

Solr in Ten Minutes

The tutorial provided on the Solr website gives a good overview of how Solr works and integrates with your system.

To run Solr, you'll need a Java 1.5 virtual machine and, optionally, a scripting environment (a bash shell, or Cygwin if you're running Windows) to run the provided utility and test scripts.

The HTTP/XML interface of the indexer has two main access points: the update URL, which maintains the index, and the select URL, which is used for queries. In the default configuration, they are found at:

To add a document to the index, we POST an XML representation of the fields to index to the update URL. The XML looks like the example below, with a <field> element for each field to index. Such documents represent the metadata and content of the actual documents or business objects that we're indexing. Any data is indexable as long as it can be converted to this simple format.

<add>
  <doc>
    <field name="id">9885A004</field>
    <field name="name">Canon PowerShot SD500</field>
    <field name="category">camera</field>
    <field name="features">3x optical zoom</field>
    <field name="features">aluminum case</field>
    <field name="weight">6.4</field>
    <field name="price">329.95</field>
  </doc>
</add>

The <add> element tells Solr that we want to add the document to the index (or replace it if it's already indexed), and with the default configuration, the id field is used as a unique identifier for the document. Posting another document with the same id will overwrite existing fields and add new ones to the indexed data.

Note that the added document isn't yet visible in queries. To speed up the addition of multiple documents (an <add> element can contain multiple <doc> elements), changes aren't committed after each document, so we must POST an XML document containing a <commit> element to make our changes visible.

This is all handled by the post.sh script provided in the Solr examples, which uses curl to do the POST. Clients for several languages (Ruby, PHP, Java, Python) are provided on the Solr wiki, but, of course, any language that can do an HTTP POST will be able to talk to Solr.

Once we have indexed some data, an HTTP GET on the select URL does the querying. The example below searches for the word "video" in the default search field and asks for the name and id fields to be included in the response.

$ export URL="http://localhost:8983/solr/select/"
$ curl "$URL?indent=on&q=video&fl=name,id"

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <responseHeader>
    <status>0</status><QTime>1</QTime>
  </responseHeader>

  <result numFound="2" start="0">
   <doc>
    <str name="id">MA147LL/A</str>
    <str name="name">Apple 60 GB iPod Black</str>
   </doc>
   <doc>
    <str name="id">EN7800GTX/2DHTV/256M</str>
    <str name="name">ASUS Extreme N7800GTX</str>
   </doc>
  </result>
</response>

As you can imagine, there's much more to this, but those are the basics: POST an XML document to have it indexed, do another POST to commit changes, and make a GET request to query the index.

This simple and thin interface makes it easy to create a system-wide indexing service with Solr: convert the relevant parts of your business objects or documents to the simple XML required for indexing, and index all of your data in a single place--whatever its source--combining full-text and typed fields. At this point, the data-mining area of your brain should start blinking happily--at least mine does!

Now that we have the basics covered, let's examine the indexing and search interfaces in more detail.

Index Management

Besides the <add> and <commit> operations, <delete> can be used to remove documents from the index, either by using the document's unique ID:

<delete><id>MA147LL/A</id></delete>

or by using a query to (carefully) erase a range of documents:

<delete><query>category:camera</query></delete>

As with add/update operations, a <delete> must be followed by a <commit> to make the resulting changes visible in queries.

The last operation available at the update URL is <optimize>, which triggers an optimization of the Lucene indexes, as explained in the Lucene FAQ. This must be called from time to time to speed up searches and reduce the number of segment files created by Lucene, thereby avoiding possible "out of file handles" errors.

These four simple operations, described in more detail on the Solr UpdateXmlMessages wiki page, are all we need to maintain our indexes.

Searching and Sorting

As shown above, querying is simply a matter of making a GET request with the appropriate query parameters.

The query language used by Solr is based on Lucene queries, with the addition of optional sort clauses in the query. Asking for video; inStock asc, price desc, for example, searches for the word "video" in the default search field and returns results sorted on the inStock field, ascending, and price field, descending.

The default search field is specified in Solr's schema.xml configuration file, as in this example:

<defaultSearchField>text</defaultSearchField>

A query can obviously refer to several fields, like handheld AND category:camera which searches the category field in addition to the default search field.

Queries can use a number of other parameters for result set paging, results highlighting, field selection, and more. Specialized query handlers can be implemented in Java, and Solr currently provides a very useful DisMaxRequestHandler to search several fields at once, with configurable ranking weight for each one. Selecting a handler is done via a request parameter, assuming the handler has been made available to Solr by configuration.

As with indexing, Solr's searching and sorting are quite powerful and very customizable. Starting with the supplied query handlers should be sufficient for the vast majority of applications, yet one can fairly easily fine-tune the querying process, should the need arise.

Field Types

In Lucene indexes, fields are created as you go; adding a document to an empty index with a numeric field named "price," for example, makes the field instantly searchable, without prior configuration.

When indexing lots of data, however, it is often a good idea to predefine a set of fields and their characteristics to ensure consistent indexing. To allow this, Solr adds a data schema on top of Lucene, where fields, data types, and content analysis chains can be precisely defined.

Here are some examples based on Solr's default schema.xml configuration file. First, a simple string type, indexed and stored as is, without any tokenizing or filtering:

<fieldtype 
  name="string" 
  class="Solr.StrField" 
  sortMissingLast="true"/>

The sortMissingLast="true" attribute means that documents where this field is missing will be sorted after documents with a value.

Then, a numeric and a date field type. The Solr SortableFloatField type causes floats to be converted to formatted strings for indexing to get a natural sort order:

<fieldtype 
  name="sfloat" 
  class="Solr.SortableFloatField" 
  sortMissingLast="true"/>
  
<fieldtype 
  name="date" 
  class="Solr.DateField" 
  sortMissingLast="true"/>

For text fields, we generally need to configure some content analysis (more on that later). Here's a simple example where text is split in words on whitespace, for exact matching of words:

<fieldtype 
  name="text_ws" 
  class="Solr.TextField" 
>
  <analyzer>
    <tokenizer 
      class="Solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldtype>

We can now configure the actual field names, mapping them to the field types that we have defined:

<field 
  name="id" 
  type="string" 
  indexed="true" 
  stored="true"/>
  
<field 
  name="category" 
  type="text_ws" 
  indexed="true" 
  stored="true"/>
  
<field 
  name="weight" 
  type="sfloat" 
  indexed="true" 
  stored="true"/>

To avoid losing the free-form indexing provided by Lucene, dynamic fields can also be defined, using field name patterns with wildcards to specify ranges of field names with common properties. Here's another example where all field names ending in _tws are treated as text_ws fields, and field names ending in _dt are treated as date fields.

<dynamicField name="*_tws" type="text_ws" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>

The combination of strict schema-based data types with looser wildcard-based types helps in building consistent indexes, while allowing the addition of new strongly typed fields on the fly.

Content Analysis

As with most Lucene-based search engines, Solr uses configurable Analyzers, Tokenizers, and Token Filters to process each field's content before indexing it.

In Solr, analysis chains are configured as part of the <fieldtype> definitions shown above. The example configuration is fine for English text, but in my tests I had to configure the following chain for indexing French text:

<fieldtype name="text_fr" class="Solr.TextField">
  <analyzer>
    <tokenizer 
      class="Solr.StandardTokenizerFactory"/>
      
    <filter 
      class="Solr.ISOLatin1AccentFilterFactory"/>
      
    <filter 
      class="Solr.StandardFilterFactory"/>
      
    <filter 
      class="Solr.LowerCaseFilterFactory"/>
      
    <filter 
      class="Solr.StopFilterFactory" 
      words="french-stopwords.txt" 
      ignoreCase="true"/>
      
    <filter 
      class="Solr.SnowballPorterFilterFactory" 
      language="French"/>
      
  </analyzer>
</fieldtype>

With this definition, the content of fields having the text_fr type will be processed as follows:

This processing makes the index insensitive to case, accents, and singular and plural forms, and avoids indexing irrelevant words.

Solr allows you to define as many such field types as needed, using the provided analysis components, additional components from Lucene, or your own. This gives full control on the way the index is built, and field copy operations during indexing allow you to index the same field in different ways if needed--for example, to allow both case-sensitive and case-insensitive searches.

The same analyzers are applied to queries before passing them to the index, so both the indexed data and queries are "reduced" to their bare essentials to find similarities.

Solr's content analysis test page
Solr's content analysis test page

The Field Analysis tool, part of Solr's web-based admin interface, is invaluable in testing and debugging content analysis configurations. As shown on the screenshot, entering a field name, field value, and query text displays the output of the analysis chains one step at a time, and shows where matches would be found.

For our example, we have added the following dynamic field definition to the schema.xml configuration, to have fields with names ending in _tfr use our text_fr field type definition:

<dynamicField name="*_tfr" type="text_fr" indexed="true" stored="true"/>
The screenshot example shows that querying on
A.B. leur dit: montez sur mon cheval cet été!

matches a field of type text_fr containing

Le Châtelain monta sur ses grands chevaux

due to a match on the tokenized, accent-filtered, lowercased and stemmed "mont" and "cheval" words. Both "monta" and "monte"" forms of the "monter" verb (to climb) have been reduced to "mont" by the stemmer, while the plural form of the word "chevaux" (horses) has been reduced to the singular "cheval" form.

Many sophisticated analyzers and filters are provided by Lucene for various human languages, and the Java interfaces required for embedding your own analysis algorithms are simple to implement, should the need arise.

Other Document Formats

This is all well and good, but we've only been indexing XML until now. How about word processor documents, spreadsheets, images, weather data, or DNA sequences?

As we've seen, the XML format used by Solr for indexing is quite simple. Extracting the relevant metadata to create these XML documents from the many formats floating around, however, is another story. Fortunately, Lucene users have the same problem and have been working on it for quite a while; the Lucene FAQ lists a number of references to parsers and filters which can be used to extract content and metadata from many common document formats.

Solr won't index spreadsheets or other formats out of the box, but that is not its role: you should see Solr as the "search engine" component of a broader "search system," where extraction of content and metadata is handled by other components. This will help to keep your search system maintainable and testable, and it helps the Solr team focus on doing one thing well.

What Next?

At this point, we've seen the basics, but there's much more to Solr. First, there's all the indexing power of Lucene under the hood, with its highly customizable analyzers, similarity searches, controlled ranking of results, faceted browsing, etc.

Also, having been designed for high-traffic systems means that Solr's performance and scalability is already up there with the best. Index replication between search servers is available, Solr's no-nonsense HTTP interface makes it possible to create search clusters using common HTTP load-balancing mechanisms, and powerful internal caches help get the most out of each Solr instance.

For now, the most complete source of information is the Solr wiki and, depending on your tastes, the source code itself, which is fairly readable and well-structured.

If you use Solr, do the team a favor: join the mailing lists, show your support, ask questions, and offer help if you have the time and skills. An important criterion for exiting incubation at the ASF is the creation of a diverse community to guarantee the project's future, so please consider joining--the more the merrier!

References

Acknowledgments

The author thanks Yonik Seeley (cnet.com, Solr's original author), Andrew Savory (luminas.co.uk), and Marc Chappuis (ludomedia.ch) who reviewed the article, and the Solr community for their great software and support.

XML.com Copyright © 1998-2006 O'Reilly Media, Inc.