XML.com: XML From the Inside Out

XML.comWebServices.XML.comO'Reilly Networkoreilly.com
  Articles | Weblogs | Newsletter | Safari Bookshelf
advertisement

Article:
 Working with Bayesian Categorizers
Subject: Structured Data
Date: 2003-11-21 01:50:56
From: dave scotson

I've been looking into this recently and I can't find a good intro to using Bayes with (semi-)structured data. Most examples just break large areas of text into tokens.


Now, for (a hypothetical) example, imagine you had the title, author, abstract, and journal name for a large amount of published articles and you wanted to apply a simple binary classification e.g. (not-)interesting to cardiology students.


Now the abstracts will of course hold a lot of key terms but surely the author, and journal name hold vital linking information that will be greatly diminished if you just dumped all the text into one big string.


I assume that spam filters already do this, saying *this* is text from the subject, *this* is text from the body, *this* is from header X but I haven't seen any introductory articles on how to go about this.


Any links would be greatly appreciated.


Previous Message Previous Message   Next Message Next Message


Titles Only Titles Only Newest First
  • Structured Data
    2003-11-21 03:11:23 dave scotson [Reply]

    I think I've found the answer to my own question by reading http://www.paulgraham.com/better.html


    It appears you can just stick a prefix onto the word to indicate where it came from (e.g. "subject*FREE!" means "FREE!" within the subject line or in my example "author*Smith") and then carry on as normal.


    Seems kind of ugly, but hey, if it works who am I to complain.

Sponsored By:


Contact Us | Our Mission | Privacy Policy | Advertise With Us | | Submissions Guidelines
Copyright © 2008 O'Reilly Media, Inc. | (707) 827-7000 / (800) 998-9938