Solr: Indexing XML with Lucene and REST
by Bertrand Delacretaz
|
Pages: 1, 2, 3
Content Analysis
As with most Lucene-based search engines, Solr uses configurable Analyzers, Tokenizers, and Token Filters to process each field's content before indexing it.
In Solr, analysis chains are configured as part of the <fieldtype> definitions shown above.
The example configuration is fine for English text,
but in my tests I had to configure the following chain for indexing French text:
<fieldtype name="text_fr" class="Solr.TextField">
<analyzer>
<tokenizer
class="Solr.StandardTokenizerFactory"/>
<filter
class="Solr.ISOLatin1AccentFilterFactory"/>
<filter
class="Solr.StandardFilterFactory"/>
<filter
class="Solr.LowerCaseFilterFactory"/>
<filter
class="Solr.StopFilterFactory"
words="french-stopwords.txt"
ignoreCase="true"/>
<filter
class="Solr.SnowballPorterFilterFactory"
language="French"/>
</analyzer>
</fieldtype>
With this definition, the content of fields having the text_fr type will be processed as follows:
- Tokenization splits the text in words.
- Accented characters are converted to their non-accented forms.
- Dots in acronyms are removed by the StandardFilter.
- Words are lowercased.
- Stopwords ("noise words") are removed based on a supplied list of words.
- Words are "stemmed," reducing plural and singular forms to common root forms.
This processing makes the index insensitive to case, accents, and singular and plural forms, and avoids indexing irrelevant words.
Solr allows you to define as many such field types as needed, using the provided analysis components, additional components from Lucene, or your own. This gives full control on the way the index is built, and field copy operations during indexing allow you to index the same field in different ways if needed--for example, to allow both case-sensitive and case-insensitive searches.
The same analyzers are applied to queries before passing them to the index, so both the indexed data and queries are "reduced" to their bare essentials to find similarities.
Solr's content analysis test page
The Field Analysis tool, part of Solr's web-based admin interface, is invaluable in testing and debugging content analysis configurations. As shown on the screenshot, entering a field name, field value, and query text displays the output of the analysis chains one step at a time, and shows where matches would be found.
For our example, we have added the following dynamic field definition to the schema.xml configuration, to have fields with names ending in _tfr use our text_fr field type definition:
<dynamicField name="*_tfr" type="text_fr" indexed="true" stored="true"/>
The screenshot example shows that querying on
A.B. leur dit: montez sur mon cheval cet été!
matches a field of type text_fr containing
Le Châtelain monta sur ses grands chevaux
due to a match on the tokenized, accent-filtered, lowercased and stemmed "mont" and "cheval" words. Both "monta" and "monte"" forms of the "monter" verb (to climb) have been reduced to "mont" by the stemmer, while the plural form of the word "chevaux" (horses) has been reduced to the singular "cheval" form.
Many sophisticated analyzers and filters are provided by Lucene for various human languages, and the Java interfaces required for embedding your own analysis algorithms are simple to implement, should the need arise.
Other Document Formats
This is all well and good, but we've only been indexing XML until now. How about word processor documents, spreadsheets, images, weather data, or DNA sequences?
As we've seen, the XML format used by Solr for indexing is quite simple. Extracting the relevant metadata to create these XML documents from the many formats floating around, however, is another story. Fortunately, Lucene users have the same problem and have been working on it for quite a while; the Lucene FAQ lists a number of references to parsers and filters which can be used to extract content and metadata from many common document formats.
Solr won't index spreadsheets or other formats out of the box, but that is not its role: you should see Solr as the "search engine" component of a broader "search system," where extraction of content and metadata is handled by other components. This will help to keep your search system maintainable and testable, and it helps the Solr team focus on doing one thing well.
What Next?
At this point, we've seen the basics, but there's much more to Solr. First, there's all the indexing power of Lucene under the hood, with its highly customizable analyzers, similarity searches, controlled ranking of results, faceted browsing, etc.
Also, having been designed for high-traffic systems means that Solr's performance and scalability is already up there with the best. Index replication between search servers is available, Solr's no-nonsense HTTP interface makes it possible to create search clusters using common HTTP load-balancing mechanisms, and powerful internal caches help get the most out of each Solr instance.
For now, the most complete source of information is the Solr wiki and, depending on your tastes, the source code itself, which is fairly readable and well-structured.
If you use Solr, do the team a favor: join the mailing lists, show your support, ask questions, and offer help if you have the time and skills. An important criterion for exiting incubation at the ASF is the creation of a diverse community to guarantee the project's future, so please consider joining--the more the merrier!
References
- The Solr project homepage at apache.org.
- The Solr tutorial.
- The Solr wiki.
- The list of Solr public servers.
- The Lucene project and its FAQ.
Acknowledgments
The author thanks Yonik Seeley (cnet.com, Solr's original author), Andrew Savory (luminas.co.uk), and Marc Chappuis (ludomedia.ch) who reviewed the article, and the Solr community for their great software and support.
- Thanks
2009-08-25 08:15:47 gedavies - UIMA conformity?
2006-08-15 13:16:18 Mark6 - UIMA conformity?
2006-08-23 11:53:21 Bertrand Delacrétaz - Where is REST?
2006-08-10 06:10:48 SylvainH - Where is REST?
2006-08-22 12:04:13 Bertrand Delacrétaz - Where is REST?
2006-09-17 19:44:25 Taylor Cowan