XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Using Python, Jython, and Lucene to Search Outlook Email

Using Python, Jython, and Lucene to Search Outlook Email

May 13, 2003

A few days ago, during three different phone calls, I had to wait while somebody fumbled around in their email looking for a message. And one of the fumblers was me. Something had to give. Last fall, I wrote about ZOň, an innovative (and free) Java-based local web service that indexes your email in a client-independent fashion, building usefully cross-linked views of messages, attachments, and addresses. I've used it with Outlook on Windows and with Mail and Mozilla on Mac OS X. But I haven't managed to integrate it with the fairly elaborate foldering and filtering that I do in Outlook.

As the comments on the ZOň article indicate, there are lots of ways to search email. Several people mentioned Lotus Notes, which indeed has always had an excellent indexed search feature. Nobody mentioned Evolution, which does too. But until Chandler becomes dogfood, Outlook seems likely to remain my primary personal information manager. I like it pretty well as a PIM, actually, and I like it a whole lot more since I started using Mark Hammond's amazing SpamBayes Outlook addin.

The problem with using ZOň for search is that it sees my messages before my Outlook filters can move them into folders and before SpamBayes can separate the ham from the spam. Outlook itself, of course, is shockingly bad at search. I never thought I'd find myself digging around in my Outlook message store, but Mark's SpamBayes addin -- which is written in Python -- turns out to be a great Python/MAPI tutorial. Borrowing heavily from his examples, I came up with a script to extract my Outlook mail to a bunch of files that I could feed to a standalone indexer. It relies on the standard MAPI support in PythonWin, and also on the mapi_driver module included with SpamBayes.

Here's the extractor:

import os, sys
import binascii
import win32com.client
from win32com.mapi import mapi, mapiutil
from win32com.mapi.mapitags import *
import mapi_driver
import re

home = "/jon/mail"

fields = (
  'PR_SENDER_NAME_A',
  'PR_DISPLAY_TO_A',
  'PR_SUBJECT_A',
  'PR_RECORD_KEY',
  'PR_BODY_A',
  'PR_ENTRYID',
  'PR_CREATION_TIME'
  )

skipfolders = (
  'Spam',
  'MaybeSpam',
  )

def getFields(obj):
  tags = obj.GetPropList(0)
  hr, data = obj.GetProps(tags)
  ret = []
  for tag, val in data:
    name = mapiutil.GetPropTagName(tag)
    if ( name in fields ):
      ret.append((name, val))
  return ret

def getValues(obj, longname):

  longname = re.sub ('\\\\', '/', longname)
  fldr = "Folder: %s\n" % longname

  for prop_name, prop_val in getFields(obj):

    if ( prop_name == 'PR_SENDER_NAME_A' ):
      fro = "From: %s\n" % prop_val

    if ( prop_name == 'PR_DISPLAY_TO_A' ):
      to = "To: %s\n" % prop_val

    if ( prop_name == 'PR_SUBJECT_A' ):
      subj = "Subject: %s\n" % prop_val

    if ( prop_name == 'PR_CREATION_TIME' ):
      date = "Date: %s\n" % prop_val.Format()

    if ( prop_name == 'PR_ENTRYID' ):
      eid = "EntryID: %s\n" % binascii.hexlify(prop_val)

    if ( prop_name == 'PR_BODY_A' ):
      body = "\n%s" % prop_val
      body = re.sub ( '\r+', '', body )

    if ( prop_name == 'PR_RECORD_KEY' ):
      id = binascii.hexlify(prop_val)

  try:
    ret = (id, fro + to + subj + date + fldr + eid + body)
  except:
    return False

  return ret

def scan(driver, longname, mapi_folder):
  for item in driver.GetAllItems(mapi_folder):
    l = getValues(item, longname)
    if ( l == False ):
      continue
    id, msg = l[0], l[1]
    path = "%s/%s.txt" % (home, id)
    if ( os.path.exists (path) == False):
      print "Adding %s" % path
      f = file ( path, 'w+' );
      f.write (msg)
      f.close()

def enum( driver, folder, path):
  folders = getattr(folder, "Folders")
  for i in range(1, len(folders)+1):
    subfolder = folders[i]
    name = getattr(subfolder, "Name")
    if ( name in skipfolders ):
      print "skipping %s" % name
    else:
      print "%s\%s" % ( path, name )
      longname = path + '\\' + name
      enum( driver, subfolder, longname)
      scan (driver, longname, driver.FindFolder( longname ) )

def main():
  outlook = win32com.client.Dispatch("Outlook.Application")
  driver = mapi_driver.MAPIDriver()
  root = outlook.GetNamespace("MAPI")
  enum( driver, root, '')

if __name__ == "__main__":
  main()

There are, to be sure, other ways to haul your messages out of an Outlook PST file. There's LibPST, though it's ominously on hold due to legal issues at the moment. You can also use Mozilla. But I'm glad to see that there's a clean Python encapsulation of the hideous MAPI interface. Using it enables me to add some ersatz headers (Folder:, EntryID:) to the messages I write out to the filesystem. These extra headers make convenient hooks for navigation and search.

Indexing and searching with Lucene and Jython

I've been wanting to give Lucene a try and this seemed the perfect opportunity. The project's own documentation and examples make it easy to get started. See also Otis Gospodnetic's articles for more details on indexing files with Lucene.

At the same time, I've been wanting to get familiar with Jython, the Java-based implementation of Python. In his article Tips for Scripting Java with Jython, Noel Perrin wrote:

Maybe you're writing a standalone program, and you'd like to use the large variety of tools already written for Java, but you'd also like to use a tool that makes your program 25-50 percent shorter, and easier to write and maintain.

Yup, that was me. I wanted to use the Lucene JAR files, but programming a search engine requires a lot of fiddling around with options and a lot of ancillary text processing, and I knew I'd rather script those things than write them in Java.

The sample indexer provided in \lucene-1.2\src\demo\org\apache\lucene\demo is pretty simple. In Jython, it's even simpler:

from org.apache.lucene.analysis.standard import *
from org.apache.lucene.index import *
from org.apache.lucene.demo import *
import java.io.File
path = '/jon/mail'
writer = IndexWriter("index", StandardAnalyzer(), 1)
dir = java.io.File (path)
list = dir.list()
for item in list:
    name = "%s/%s" % (path, item)
    file = java.io.File( name )
    writer.addDocument(FileDocument.Document(file))
print writer.docCount()
writer.optimize()
writer.close()

It's so nice to be able import Java classes and use them in Python, collapsing Java idioms like

String[] list = file.list();
for (int i = 0; i < list.length; i++)
  ...

into Python idioms like

list = dir.list()
for item in list:
  ...

The real payoff for me has been the search script. I've been tweaking it almost continually since I wrote it, because there are so many variables in play, including:

  • The number of hits to display, and the amount of context to show for each.

  • How to display hits containing complex terms, like "Java Python"~20 (i.e., Java within 20 words of Jython). Currently, I report these hits as headers only.

  • Whether and how to do alternative sorting of results. Currently, I just take Lucene's default relevance ranking.

  • How to link from the results back into the corpus. Currently, I use outlook: URLs to point to the folders containing the found messages, and to the messages themselves.

  • How to adapt when searches produce too few or too many results. I like the idea of automatically trying multiple search strategies -- for example, boolean and proximity -- and suggesting a "best" strategy for a given search.

A personal search engine is a wonderful thing, because you get to play around with all these variables. You definitely want to be doing that experimentation in an agile language, though. Jython seems like an ideal way to interact with Lucene. Here's the current version of the search script.

import re, sys
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document
from org.apache.lucene.analysis.standard import *
from org.apache.lucene.search import *
from org.apache.lucene.queryParser import *

searcher = IndexSearcher ( "index" )
analyzer = StandardAnalyzer ()
term = sys.argv[1]

hits = searcher.search ( query )

print "<div>Searching for: %s, " % query.toString ( "contents" )
print "%s matching docs</div>" % hits.length()

for i in range ( hits.length() ):

  path = hits.doc(i).getField('path').stringValue()
  f = open ( path )
  s = f.read()
  f.close() 

  subj = re.findall ( 'Subject:.+' , s )[0]
  to   = re.findall ( 'To:.+'      , s )[0]
  date = re.findall ( 'Date:.+'    , s )[0]

  fro  = re.findall ( 'From:.+'    , s )[0]
  fro  = re.sub ( ' (.+)', ' <b>\g<1></b>' , fro )

  fldr = re.findall ( 'Folder:.+'  , s )[0]
  fldr = re.sub ( ' (.+)', 
          ' <a href=\"outlook:\g<1>">\g<1></a>' , fldr )

  eid  = re.findall ( 'EntryID:.+' , s )[0]
  eid = re.sub ( ' (.+)', 
          ' <a href=\"outlook:\g<1>">\g<1></a>' , eid )

  print "<pre>%s\n%s\n%s\n%s\n%s\n%s\n</pre>"  % (to, 
          fro, date, fldr, eid, subj)

  s = re.sub ( '[\n\r]+', ' ', s )
  s = re.sub ( '[\"]+', '', s )
  term = re.sub ( '[\"~]+', '', term )

  kwicpat = re.compile ( '\s[^\x00]{10,50}' + term + 
           '[^\x00]{10,50}\s' , re.I )

  kwic = kwicpat.findall ( s )

  if ( len(kwic) > 0 ):
      print "<ul>"
      for i in range ( len(kwic) ):
          print "<li>%s</li>" % kwic[i] 
      print "</ul>"

And here's a page of output:

Lucene Output
A Page of output from Lucene

Two nations divided by a common language

    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

It's interesting to note that the two halves of this project -- the mail extractor, and the indexer/searcher -- share a common language but inhabit two very different environments. Jython can't talk directly to Outlook's COM interface, and Python can't talk directly to Lucene. In this case, I've interposed the file system between the Windows and Java environments. With a lot more work, I could (well, someone could) attach both to a web services bus that either dialect of Python could easily connect to.

A few years back, I guessed that by now we'd be seeing aggressive exposure of .NET interfaces in Office apps, better support for dynamically-typed languages (such as Python) in the .NET Common Language Infrastructure, and .NET versions of open source Java projects such as Lucene. Well, we're one for three: NLucene exists.

Maybe the web services bus is the only solution. Of course, libraries such as Lucene, and applications such as Outlook, are today delivered as JAR/DLL/EXE files, not as services. How that might change, and how (or whether) the Java and .NET virtual machines can natively support agile languages, are two questions I wish I could answer.



1 to 5 of 5
  1. now their's two of the three for .NET
    2004-08-22 19:55:11 michaelstanley
  2. Lucene and IMAP
    2004-03-22 10:35:52 david spencer
  3. Both worlds, no effort
    2003-10-01 09:33:29 John Lee
  4. nice, but I wonder...
    2003-05-16 04:09:06 Robert Barta
  5. Lupy
    2003-05-14 05:26:19 serge boiko
1 to 5 of 5