Solr: Indexing XML with Lucene and REST
Solr (pronounced "solar") builds on the well-known Lucene search engine library to create an enterprise search server with a simple HTTP/XML interface. Using Solr, large collections of documents can be indexed based on strongly typed field definitions, thereby taking advantage of Lucene's powerful full-text search features. This article describes Solr's indexing interface and its main features, and shows how field-type definitions are used for precise content analysis.
Solr began at CNET Networks, where it is used to provide high-relevancy search and faceted browsing capabilities. Although quite new as a public project (the code was first published in January 2006), it is already used for several high-traffic websites.
The project is currently in incubation at the Apache Software Foundation (ASF). This means that it is a candidate for becoming an official project of the ASF, after an observation phase during which the project's community and code are examined for conformity to the ASF's principles (see the incubator homepage for more info).
Solr in Ten Minutes
The tutorial provided on the Solr website gives a good overview of how Solr works and integrates with your system.
To run Solr, you'll need a Java 1.5 virtual machine and, optionally, a scripting environment (a bash shell, or Cygwin if you're running Windows) to run the provided utility and test scripts.
The HTTP/XML interface of the indexer has two main access points: the update URL, which maintains the index, and the select URL, which is used for queries. In the default configuration, they are found at:
http://localhost:8983/solr/updatehttp://localhost:8983/solr/select
To add a document to the index, we POST
an XML representation of the fields to index to the update URL. The XML looks like
the example below, with a <field> element for each field to index. Such documents
represent the metadata and content of the actual documents
or business objects that we're indexing. Any data is indexable as long as it can be converted to this simple format.
<add>
<doc>
<field name="id">9885A004</field>
<field name="name">Canon PowerShot SD500</field>
<field name="category">camera</field>
<field name="features">3x optical zoom</field>
<field name="features">aluminum case</field>
<field name="weight">6.4</field>
<field name="price">329.95</field>
</doc>
</add>
The <add> element tells Solr that we want to add the document to the index (or replace it
if it's already indexed), and with the default configuration, the id field is used as a unique
identifier for the document. Posting another document
with the same id will overwrite existing fields and add new ones to the indexed data.
Note that the added document isn't yet visible in queries. To speed up the addition of multiple
documents (an <add> element can contain multiple <doc> elements),
changes aren't committed after each document, so
we must POST an XML document containing a <commit> element to make our changes visible.
This is all handled by the post.sh script provided in the Solr examples, which uses
curl to
do the POST. Clients for several languages (Ruby, PHP, Java, Python) are provided on the
Solr wiki, but,
of course, any language that can do an HTTP POST will be able to talk to Solr.
Once we have indexed some data, an HTTP GET on the select
URL does the querying. The example
below searches for the word "video" in the default search field and asks
for the name
and id fields to be included in the response.
$ export URL="http://localhost:8983/solr/select/"
$ curl "$URL?indent=on&q=video&fl=name,id"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<responseHeader>
<status>0</status><QTime>1</QTime>
</responseHeader>
<result numFound="2" start="0">
<doc>
<str name="id">MA147LL/A</str>
<str name="name">Apple 60 GB iPod Black</str>
</doc>
<doc>
<str name="id">EN7800GTX/2DHTV/256M</str>
<str name="name">ASUS Extreme N7800GTX</str>
</doc>
</result>
</response>
As you can imagine, there's much more to this, but those are the basics: POST an XML document to have it indexed, do another POST to commit changes, and make a GET request to query the index.
This simple and thin interface makes it easy to create a system-wide indexing service with Solr: convert the relevant parts of your business objects or documents to the simple XML required for indexing, and index all of your data in a single place--whatever its source--combining full-text and typed fields. At this point, the data-mining area of your brain should start blinking happily--at least mine does!
Now that we have the basics covered, let's examine the indexing and search interfaces in more detail.