XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Solr: Indexing XML with Lucene and REST

August 09, 2006

Solr (pronounced "solar") builds on the well-known Lucene search engine library to create an enterprise search server with a simple HTTP/XML interface. Using Solr, large collections of documents can be indexed based on strongly typed field definitions, thereby taking advantage of Lucene's powerful full-text search features. This article describes Solr's indexing interface and its main features, and shows how field-type definitions are used for precise content analysis.

Solr began at CNET Networks, where it is used to provide high-relevancy search and faceted browsing capabilities. Although quite new as a public project (the code was first published in January 2006), it is already used for several high-traffic websites.

The project is currently in incubation at the Apache Software Foundation (ASF). This means that it is a candidate for becoming an official project of the ASF, after an observation phase during which the project's community and code are examined for conformity to the ASF's principles (see the incubator homepage for more info).

Solr in Ten Minutes

The tutorial provided on the Solr website gives a good overview of how Solr works and integrates with your system.

To run Solr, you'll need a Java 1.5 virtual machine and, optionally, a scripting environment (a bash shell, or Cygwin if you're running Windows) to run the provided utility and test scripts.

The HTTP/XML interface of the indexer has two main access points: the update URL, which maintains the index, and the select URL, which is used for queries. In the default configuration, they are found at:

  • http://localhost:8983/solr/update
  • http://localhost:8983/solr/select

To add a document to the index, we POST an XML representation of the fields to index to the update URL. The XML looks like the example below, with a <field> element for each field to index. Such documents represent the metadata and content of the actual documents or business objects that we're indexing. Any data is indexable as long as it can be converted to this simple format.

<add>
  <doc>
    <field name="id">9885A004</field>
    <field name="name">Canon PowerShot SD500</field>
    <field name="category">camera</field>
    <field name="features">3x optical zoom</field>
    <field name="features">aluminum case</field>
    <field name="weight">6.4</field>
    <field name="price">329.95</field>
  </doc>
</add>

The <add> element tells Solr that we want to add the document to the index (or replace it if it's already indexed), and with the default configuration, the id field is used as a unique identifier for the document. Posting another document with the same id will overwrite existing fields and add new ones to the indexed data.

Note that the added document isn't yet visible in queries. To speed up the addition of multiple documents (an <add> element can contain multiple <doc> elements), changes aren't committed after each document, so we must POST an XML document containing a <commit> element to make our changes visible.

This is all handled by the post.sh script provided in the Solr examples, which uses curl to do the POST. Clients for several languages (Ruby, PHP, Java, Python) are provided on the Solr wiki, but, of course, any language that can do an HTTP POST will be able to talk to Solr.

Once we have indexed some data, an HTTP GET on the select URL does the querying. The example below searches for the word "video" in the default search field and asks for the name and id fields to be included in the response.

$ export URL="http://localhost:8983/solr/select/"
$ curl "$URL?indent=on&q=video&fl=name,id"

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <responseHeader>
    <status>0</status><QTime>1</QTime>
  </responseHeader>

  <result numFound="2" start="0">
   <doc>
    <str name="id">MA147LL/A</str>
    <str name="name">Apple 60 GB iPod Black</str>
   </doc>
   <doc>
    <str name="id">EN7800GTX/2DHTV/256M</str>
    <str name="name">ASUS Extreme N7800GTX</str>
   </doc>
  </result>
</response>

As you can imagine, there's much more to this, but those are the basics: POST an XML document to have it indexed, do another POST to commit changes, and make a GET request to query the index.

This simple and thin interface makes it easy to create a system-wide indexing service with Solr: convert the relevant parts of your business objects or documents to the simple XML required for indexing, and index all of your data in a single place--whatever its source--combining full-text and typed fields. At this point, the data-mining area of your brain should start blinking happily--at least mine does!

Now that we have the basics covered, let's examine the indexing and search interfaces in more detail.

Pages: 1, 2, 3

Next Pagearrow







close