Menu

Working with Bayesian Categorizers

November 19, 2003

Jon Udell

Every once in a while, but never as often as you'd wish, a technology comes along that profoundly improves your life. For me the most recent example of such a thing is SpamBayes. More specifically, it's the combination of the SpamBayes engine, an open source email categorizer with some innovative twists on Paul Graham's original plan, and Mark Hammond's Outlook addin which smoothly integrates training with normal use of the email client.

Months ago I wrote about how SpamBayes has solved my spam problem more effectively than I thought a pure content-based filter could. Time was the ultimate test, though. Would this razor lose its edge? It hasn't. Every day I sharpen it. A frightening majority of my messages -- on the order of several hundred per day -- are clearly spam. A depressing minority -- maybe fifteen or twenty per day -- are clearly ham (i.e., not-spam). And then there's a scattering of in-betweens, messages that SpamBayes can't confidently classify one way or the other. It routes these to a MaybeSpam folder and offers buttons for two actions: Delete as Spam, Recover from Spam.

This arrangement is a wonderful example of the kind of synergy that's possible between an automated assistant and a human overseer. Although I still review my Spam folder for false positives, it's devolved into a routine exercise that never requires thought and only very rarely requires action. Reviewing the MaybeSpam folder, on the other hand, always requires thought and action. In theory that should be a dead-end trip down the path of least resistance. In practice it isn't, for two reasons. First, it's not much effort to click one or the other of the buttons a handful of times a day. Second, and more profoundly, the classification puzzles that SpamBayes presents in my MaybeSpam folder are interesting. Only sometimes do I think: "Why was it confused about that?" More often I think: "Yup, I can see it both ways."

If it were only a matter of the typical body-part-enlargement and Nigerian-scam messages that everyone gets, there wouldn't so much grey area. But as a high-tech journalist I'm the target of lots of tech-oriented promotional email too. I want (actually, need) to prioritize the tech-oriented messages that are interesting to me and suppress the ones that aren't. It's a subtle discrimination because on a given day messages in both categories might arrive from the same legitimate sender, with the same trail of SMTP headers. By making a choice between Delete as Spam and Recover from Spam I teach SpamBayes about my interests and non-interests in a way that would be quite difficult to articulate, particularly since my interests and non-interests change over time.

Categorizing blog content

There's been some discussion in the blog world about using a Bayesian categorizer to enable a person to discriminate along various interest/non-interest axes. I took a run at this recently and, although my experiments haven't been wildly successful, I want to report them because I think the idea may have merit.

For starters, I looked for tools that would enable me to train and test a categorizer. I found two that were pretty easy to work with. The first is called Bow (for Bag of Words), a code library for statistical language modeling written by Carnegie-Mellon's Andrew McCallum. Rainbow, a program based on the Bow library, can train and test Bayesian (or other) classifiers. This software is widely available; I used fink to install Bow and Rainbow on Mac OS X.

The second tool was Ken Williams' Perl CPAN module, AI::Categorizer, which relies on Williams' Algorithm::NaiveBayes and also on Benjamin Franz's and Jim Richardson's Lingua::Stem, a framework for word stemming that's localized for a few different languages. With all these dependencies the formidable CPAN installer had to chug for a while, but in the end it succeeded.

In order to test both tools on the same dataset, I decided to let Rainbow take the lead and adapt AI::Categorizer to Rainbow's directory-oriented style. So I started with a directory called $HOME/train, and created per-category subdirectories under it.

For test data I started with a set of items from my weblog content. I keep a single XML file containing the entries written since I began enforcing a strict XHTML discipline -- about 150 so far. I use the file for XPath-based search, but it's handy for other things too. For this experiment, I wrote a small Python script to break out the entries into individual XHTML files, using the titles of the entries (which are long and descriptive) as filenames. This arrangement made it pretty easy to review entries in the Finder, read them in a browser when the titles weren't sufficiently descriptive, and copy them into appropriate subdirectories under the training directory.

After classifying a first batch of entries, I initialized Rainbow's training database like so:

bash$ rainbow -H -i $HOME/train/*
Class `blogging'
  Gathering stats... files : unique-words ::     12 :     1743
Class `books'
  Gathering stats... files : unique-words ::      4 :     1945
...

The -H argument tells Rainbow to skip HTML tokens, and -i tells it to index the specified set of subdirectories. Subsequently, I followed this procedure:

  1. Copy a new entry to the category (i.e. subdirectory) where I thought it belonged.

  2. Test to see if Rainbow predicted that category.

  3. Retrain.

For example, I fed Rainbow the contents of this column as I was writing it. The category I would have picked for it is data_management. After naming my working draft of the column-in-progress 'autoCategorize.html' -- and copying it to $HOME/train/data_management/autoCategorize.html -- I ran this command:

rainbow -x $HOME/train | grep autoCategorize.html

That produced a classification for each of the files under the training directory; grepping for autoCategorize.html isolated just the classification of that file:

/Users/jon/train/data_management/autoCategorize.html data_management rss:0.9999999395 data_management:6.043669638e-08 email:1.076055135e-11 swdev:4.499116623e-12 blogging:7.956561148e-18 markup:1.973268953e-26 services:8.758472658e-34 identity:0 calendaring:0 browser:0 opensource:0 voice_video_communication:0 collaboration:0 security:0 networking:0 os:0 people:0 hci:0 books:0 vm:0 policy:0 zope:0 location:0

Since the system wasn't yet trained on the file, this result was a prediction. It says that the three most likely categories are rss, data_management, and email. The first, rss, is a poor result. My choice, data_management, came second. Given the discussion of SpamBayes in this column, it seems reasonable that the email category came third. As for the remaining categories seen as having some relationship to this column, the connections are weak but plausible. Conversely the categories scoring zero are plausibly unrelated to this column, with the exception of the opensource category which, ideally, SpamBayes would have triggered. Given that there were only 150 documents in the training set at the time, spread across 24 categories, I'm sure it's unreasonable to expect better precision. My SpamBayes database, by contrast, has only two categories to worry about and has thousands of samples in each category.

Next I retrained the system to incorporate the new file (rainbow -H -i $HOME/train/*) and reran the classification dump. Now the results for autoCategorize.html looked like this:

/Users/jon/train/data_management/autoCategorize.html data_management data_management:1 rss:0 email:0 services:0 swdev:0 blogging:0 markup:0 identity:0 calendaring:0 browser:0 collaboration:0 opensource:0 voice_video_communication:0 networking:0 os:0 security:0 people:0 hci:0 books:0 policy:0 vm:0 zope:0 location:0

In other words, the system now knows unambiguously how I would prefer to classify this column. The frequencies of words appearing in this column will influence future classification.

As you test and classify files one at a time, you get a general sense of how you're doing, but Rainbow can provide a more explicit scoreboard. For example, this command:

rainbow --test-percentage 50  --test 1 | rainbow_stats

asks Rainbow to run a single iteration (--test 1) of a test that randomly chooses half the categorized files (--test-percentage 50). The report looks like this:



Correct: 24 out of 73 (32.88 percent accuracy)



 - Confusion details, row is actual, column is predicted

                   classname   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  :total

 0                  blogging   3   .   .   .   .   2   1   .   .   .   1   .   .   .   .   .   .   .   .   .   .   .   .  :  7  42.86%

 1                     books   .   .   .   .   .   1   .   .   .   .   .   .   .   .   .   .   .   .   .   2   .   .   .  :  3   0.00%

 2                   browser   .   .   .   .   .   4   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .  :  4   0.00%

 3               calendaring   .   .   .   1   .   1   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .  :  2  50.00%

 4             collaboration   .   .   .   .   1   1   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .  :  2  50.00%

 5           data_management   .   .   .   .   .   3   .   .   1   .   2   .   .   .   .   .   2   .   .   .   .   .   .  :  8  37.50%

 6                     email   .   .   .   .   .   1   1   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .   .  :  3  33.33%

 7                       hci   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2   .   .   .  :  2   0.00%

 8                  identity   .   .   .   .   .   .   .   .   1   .   .   .   .   .   .   .   1   .   1   .   .   .   .  :  3  33.33%

 9                  location   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .  :  .

10                    markup   .   .   .   .   .   3   .   .   .   .   1   .   .   .   .   .   .   .   .   .   .   .   .  :  4  25.00%

11                networking   1   .   .   .   .   1   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .  :  2   0.00%

12                opensource   .   .   .   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .   .   1   .   .   .  :  2  50.00%

13                        os   .   .   .   .   .   1   .   .   .   .   .   .   .   .   .   .   .   .   .   1   .   .   .  :  2   0.00%

14                    people   1   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .  :  1   0.00%

15                    policy   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .   .  :  1   0.00%

16                       rss   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   6   .   .   .   .   .   .  :  6 100.00%

17                  security   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1   .   1   .   .   .  :  2  50.00%

18                  services   .   .   .   .   .   1   .   .   .   .   .   .   .   .   .   .   1   1   .   4   .   .   .  :  7   0.00%

19                     swdev   .   .   .   .   .   1   .   .   .   .   2   .   1   .   .   .   .   .   .   4   .   .   .  :  8  50.00%

20                        vm   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1   .   .   .  :  1   0.00%

21 voice_video_communication   1   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1   .  :  2  50.00%

22                      zope   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1   .   .   .  :  1   0.00%

At this point in the training, the picture varied a lot from one run to the next because of the small sample size. But it was already possible to see which categories were well-defined and which were blurry. In this run, 7 documents from the blogging category were tested. Three were correctly predicted to belong in category 0 (blogging), two incorrectly but plausibly in category 5 (data_management), and one incorrectly but not implausibly in categories 6 (email) and 10 (markup). By contrast category 16 (rss) was consistently well-defined across runs.

You'd expect that the acronym RSS, appearing most frequently in the articles I've assigned to that category, would carry a lot of weight. This command, which shows which words most influence the category, proves that it does:

bash$ rainbow --print-word-weights rss | sort -r
0.0016115852 rss
0.0003317731 feed
-0.0069265488 xml
-0.0032424503 http
...

However, the acronym RSS also shows up frequently in documents assigned to other categories:

bash-2.05a$ rainbow --print-word-counts rss | sort -r

       96 /      2706  (  0.03548) rss

       16 /      3104  (  0.00515) blogging

       12 /      3436  (  0.00349) data_management

        7 /      1582  (  0.00442) markup

        6 /      1131  (  0.00531) calendaring

        4 /      1989  (  0.00201) email

        2 /       933  (  0.00214) collaboration

        2 /       775  (  0.00258) browser

        1 /      3613  (  0.00028) swdev

        1 /      2855  (  0.00035) services

        1 /      1969  (  0.00051) security

        1 /      1082  (  0.00092) voice_video_communication

        1 /       691  (  0.00145) networking

        1 /       509  (  0.00196) books

And sure enough, if we check which words most influence the blogging and data_management categories, rss ranks highly among them:

bash$ rainbow --print-word-weights blogging | sort -r | more
-0.0036468446 rss
-0.0033331176 time
-0.0032126101 people
-0.0029204316 blog
-0.0027708460 web
...
bash$ rainbow --print-word-weights data_management | sort -r | more
-0.0069353367 xml
-0.0034907475 web
-0.0031235600 data
-0.0030926130 rss
-0.0029832243 time
...

With my 150-document database, the overall accuracy scores ranged from about 20% to about 40%. That seems terrible compared to SpamBayes. Of course, the judgment as to whether the best category for an item is blogging or data_management is much fuzzier than the spam/not-spam determination. Even with many more samples, I'm not sure the accuracy score would improve by much. Still, there's some correlation going on here, and I suspect that it could provide some benefits. For an author who manually categorizes items, automatic categorization -- even if overridden -- can help clarify the boundaries of emergent categories. For a reader it could be a way to overlay a personal taxonomy on inbound items. But without an application that makes training and testing a seamless part of the blogging experience, it's hard to say whether or not these scenarios will really pan out.

Using AI::Categorizer

A different experiment reinforces that conclusion. Here's a Perl script that uses AI::Categorizer to classify the columns I've written for the O'Reilly Network, using the same set of training files as before.

#! perl -w

use strict;
use AI::Categorizer;
use AI::Categorizer::KnowledgeSet;
use LWP::Simple;
 
my $traindir = '/Users/jon/train';

my @oracols = (
'http://www.oreillynet.com/lpt/a/52',
'http://www.openp2p.com/lpt/a/1351',
'http://www.xml.com/pub/a/ws/2002/01/01/topic_map.html',
'http://www.xml.com/pub/a/ws/2002/03/01/udell.html',
'http://www.xml.com/pub/a/ws/2002/04/01/outlining.html',
'http://www.xml.com/pub/a/ws/2002/05/03/udell.html',
'http://www.xml.com/pub/a/ws/2002/06/04/udell.html',
'http://www.xml.com/pub/a/ws/2002/07/09/udell.html',
'http://www.xml.com/pub/a/ws/2002/08/02/flashcomm.html',
'http://www.xml.com/pub/a/ws/2002/09/03/udell.html',
'http://www.oreillynet.com/lpt/a/2767',
'http://www.oreillynet.com/lpt/a/2889',
'http://www.xml.com/pub/a/ws/2002/12/09/udell.html',
'http://www.xml.com/pub/a/ws/2003/01/13/udell.html',
'http://www.xml.com/pub/a/ws/2003/02/11/udell.html',
'http://www.xml.com/pub/a/ws/2003/03/04/spring.html',
'http://www.xml.com/pub/a/ws/2003/04/15/semanticblog.html',
'http://www.xml.com/pub/a/ws/2003/05/13/email.html',
'http://www.xml.com/pub/a/ws/2003/06/10/xpathsearch.html',
'http://www.xml.com/pub/a/2003/07/09/udell.html',
'http://www.xml.com/pub/a/2003/08/13/udell.html',
'http://www.xml.com/pub/a/2003/09/17/udell.html',
'http://www.xml.com/pub/a/2003/10/08/udell.html',
);

sub training_docs
{
opendir (D, $traindir);
my @l = grep (! /^\./, readdir(D));
closedir (D);

  my $ret = {};
  
  foreach my $cat (@l)
    {
    my $d = "$traindir/$cat";
    opendir (D, $d);
    foreach my $f (grep (/html/,readdir(D)))
      {
      open (F, "$d/$f");
      my $content = join('',<F>);
      $content =~ s/<[^>]+>//g;
      close F;
      $ret->{$f} =  { 
                    categories => [$cat],
                    content => $content
                    }
      }
    closedir (D);
    }
  return $ret;
  }

my $docs = training_docs();

my $c = new AI::Categorizer(collection_weighting => 'f');

while (my ($name, $data) = each %$docs) 
    {    $c->knowledge_set->make_document(name => $name, %$data)   }

$c->learner->train( knowledge_set => $c->knowledge_set );

foreach my $d (@oracols)
  {
  my $content = get $d;
  $content =~ m#<title>\s*([^<]+)\s*</title>#;
  my $title = $1;
  $title =~ s/ /_/g;
  $content =~ s/<[^>]+>//g;
  my $doc = AI::Categorizer::Document->new  ( content => $content );
  my $h = $c->learner->categorize( $doc );
  print sprintf ( qq(%20s | <a href="%s">%s</a>\n), $h->best_category, $d, $title);
  }

In the output, I've boldfaced the categories that I would have chosen:

blogging | Peer and Web Services are Technologies of Connection and Coordination

data_management | Googling Your Email

data_management | Speakable Web Services

data_management | Three Faces of XML in Zope

            rss | The Document is the Database

            rss | XSLT Recipes for Interacting with XML Data

data_management | Language Instincts

         markup | Interactive Microcontent

data_management | Quick and Dirty Topic Mapping

       blogging | Jon Udell: Radio UserLand 8.0 Is a Lab for Group-Forming

data_management | Jon Udell: Instant Outlining, Instant Gratification

       blogging | Blogspace Under the Microscope

       blogging | Seeing and Tuning Social Networks

       identity | Control Your Identity or Microsoft and Intel Will

data_management | Scripting Collaborative Applications with Flash Communication Server MX

       blogging | Interaction Design and Agile Methods

data_management | Scripting Groove Web Services

       services | Services and Links

       blogging | Applied Network Theory

       services | Think Spring

data_management | The Semantic Blog

          email | Using Python, Jython, and Lucene to Search Outlook Email

            rss | Structured Writing, Structured Search

Not bad, but not great either. Subtracting the hits, here's how I'd have classified the misses:

collaboration | 
          email | 
       services | 
           zope | 
data_management | 
          swdev | 
         markup | 



  collaboration | 
  collaboration | 
          swdev | 
       services | 

  collaboration | 

         markup | 

         markup | 
    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

It wasn't hard to train the system on these choices. I tweaked the script to save the files locally, again using the HTML doctitles to create descriptive names. Then it took just a minute's worth of shuffling between two Finder windows to do the training. Clearly, though, this awkward procedure fails the test of normal use.

We know that autocategorization succeeds in the narrow domain of spam filtering. Whether it can succeed more generally -- for example, by helping blog authors and readers manage flows of items -- is yet unclear. The raw tools are available, but until they're well integrated into authoring and reading software, it will be hard to get a good sense of what's possible.