In last month's installment I showed a simple search service that uses libxslt to reduce a file of XML content (my weblog writing) to just the elements matching an XPath expression. This month's challenge was to scale up to a database-backed implementation using Berkeley DB XML. Since my own XML corpus doesn't present much of a challenge to DB XML, I also decided to include all of the 200+ blogs I read. This proved interesting because only a handful of them provide well-formed XHTML content. So before I dive into the details of the DB XML implementation, here's the rundown on how I built an RSS aggregator that renders feeds as well-formed XHTML.
The short answer is that a pair of modules did all the heavy lifting. The first is HTML Tidy, originally by Dave Raggett and now maintained by "a quorum of developers" on SourceForge. The second is Mark Pilgrim's universal feed parser which, as it turns out, neatly encapsulates HTML Tidy with the help of Marc-Andre Lemburg's mxTidy.
Databasing the feeds
The complete loader script is here. First, the setup:
import string, sys, urllib, re, traceback, pickle, os, \
feedparser, mx.DateTime, libxml2
from dbxml import *
Of the non-standard modules, feedparser is trivial to install: just drop it in the current directory. Installers for mx.DateTime and mx.Tidy (used by feedparser) are available at egenix.com. Creating the Python bindings for libxml2 was, as mentioned last time, not quite a no-brainer. Likewise the Python bindings for DB XML on Windows and Mac OS X, my two everyday environments. For OS X, see Kimbro Staken's recipe. For Windows, I realized that Chandler contains the bits I needed, so I cheated and used those. Mea culpa.
More laziness: my aggregator, like everyone's first aggregator, initially ignored the two popular ways to check if a feed has changed since last fetched: ETag and "conditional GET". But feedparser makes it easy to use both methods. You just need to remember the values of the ETag and Last-Modified headers sent from the server, and hand them to feedparser's parse() method when you ask it to fetch a feed. I could have stored the headers in DB XML, but it seemed easier to just pickle a Python dictionary:
try:
os.stat(dictEtagFile)
except:
print "CreatingDictEtag"
f = open(dictEtagFile,'wb')
dictEtag = {}
pickle.dump(dictEtag,f)
f.close()
try:
f = open(dictEtagFile, 'rb')
dictEtag = pickle.load(f)
f.close()
except:
raise "CannotLoadDictEtag"
The list of feeds to fetch is avaiable from an OPML file that's rewritten whenever I add or remove an RSS subscription using my Radio UserLand feedreader. The combination of Python and libxml2 makes quick work of pulling the URLs for those feeds into a Python list:
opmldata = urllib.urlopen(opml).read()
opmlxml = libxml2.parseDoc(opmldata)
urls = opmlxml.xpathEval('//@xmlUrl')
urls = map ( libxml2.xmlNode.getContent, urls)
Using the ETag and Last-Modified headers stored in the pickled dictionary, you can avoid fetching feeds that haven't changed. But when a feed has changed, which of its items should be added to the database? RSS 2.0 provides a "guid" that signals item-level changes, but RSS .9x and 1.0 feeds don't. Yet more laziness: for now I'm just hashing the item content and keeping track that way. Items are stored in the database like so:
<item channel="Jon's Radio" hash="-125385335">
<title>...</title>
<date>...</date>
<link>...</link>
<content>...</content>
</item>
To fetch these into a Python dictionary, the script opens the database and runs an XPath query:
container = XmlContainer(None, './dbxml/blog.dbxml')
container.open(None,DB_CREATE)
context = XmlQueryContext(1,0)
result = container.queryWithXPath(None, '/item/@hash', context)
hashValues = {}
try:
for i in range(r.size()):
val = result.next().asString(None)
hashValues[val] = val
except:
print "CannotCreateHashDict"
container.close()
DB XML supports Berkeley DB transactions, but I'm not using them here, so the first argument to container.open() is None. Queries, by default, return whole matching documents, and we'll use that mode later in the search engine. But here, we just want to extract hash values, so the first argument to XmlQueryContext() -- the returnType -- is 1. (The second argument -- evaluationType -- defaults to "eager" but can be switched to "lazy," though I haven't found a reason to do that yet.)
Now here's the main loop. For each URL, it looks up the ETag and Last-Modified headers, calls the feedparser, calls a helper method to unpack the Python data structure returned by the feedparser, and -- if any new items showed up -- it plugs them into the database.
for url in urls:
print "%s " % url
try:
try:
dictEtag[url]
except KeyError:
dictEtag[url] = {}
try:
etag = dictEtag[url]['etag']
except:
etag = None
try:
mod = dictEtag[url]['mod']
except:
mod = None
result = feedparser.parse(url, etag, mod)
try:
etag = result['etag']
dictEtag[url]['etag'] = etag
except:
etag = None
pass
try:
mod = result['modified']
dictEtag[url]['mod'] = mod
except:
mod = None
pass
if ( etag == None and mod == None ):
print "%s: no etag or mod" % url
if ( result['status'] == 304 ):
continue
items = unpack ( result )
if len ( items ):
container.open(None,DB_CREATE)
for item in ( items ):
doc = XmlDocument()
doc.setContent(item)
print container.putDocument(None, doc)
container.close()
except:
print formatExceptionInfo()
continue
The unpack() method bails on items found in the dictionary of item hashes, and normalizes the data in a couple of ways. For example, the content can show up in a couple of different ways in the Python dictionary returned by feedparser.parse(). Likewise, there are a few different date formats. For now I'm converting all content to this style:
<content><body>...</body></content>
And all dates to this style:
<date>2004/02/04></date>
Finally, the script pickles the dictionary of item hashes for next time.
Searching the Feeds
As before, the server is based on Python's BaseHTTPServer. Last month, the server's query() method used libxslt's XPath engine. This month it uses DB XML's queryWithXPath(). The complete search script is here.
The default DB XML search context, which returns documents rather than just matching nodes, is the right strategy here. Each found document provides the channel, title, date, and link used to contextualize one or more hits within the document.
The hits still need to be isolated from their containing documents, though. So libxml2's xpathEval() does subqueries, within each found document, to yield one or more contextualized fragments.
The HTML output is accumulated as a Python list of date-content pairs, and then sorted by date. The reason for this is that XPath itself can't sort results.
Here's the query() method:
def query(self,q):
css = getCss()
script = getScript()
preamble = getPreamble(q)
try:
container = XmlContainer(None, db)
container.open(None)
xmlResults = container.queryWithXPath(None, q)
except:
print "QueryFailure"
container.close()
return "bad query: %s" % q
if ( xmlResults.size() == 0 ):
return """<html><head><title>XPath query of Jon's feeds</title>
<style>%s</style>
<script>%s</script></head>
<body>%s <b>no results</b></body></html> """ % (css, script, preamble)
l = []
for i in range(xmlResults.size()):
result = xmlResults.next().asString(None)
try:
xmlFragment = libxml2.parseDoc( result )
except:
print "NotWellFormed"
xmlFragment.freeDoc()
container.close()
return "bad query: %s" % q
xpathChannel = "//item/@channel"
xpathTitle = "//title"
xpathDate = "//date"
xpathLink = "//link"
xpathHits = q
try:
channel = xmlFragment.xpathEval(xpathChannel)[0].content
title = xmlFragment.xpathEval(xpathTitle)[0].content
date = xmlFragment.xpathEval(xpathDate)[0].content
link = urllib.unquote(xmlFragment.xpathEval(xpathLink)[0].content)
hits = xmlFragment.xpathEval(xpathHits)
except:
print "CannotSearchWithinFoundDocument"
xmlFragment.freeDoc()
container.close()
return "bad query: %s" % q
try:
frags = ''
for j in range(len(hits)):
frags = frags + '<div>' + hits[j].serialize() + \
'</div><hr align="left" width="20%"/>'
except:
print "CannotSerializeHits"
xmlFragment.freeDoc()
container.close()
return "bad query: %s" % q
xmlFragment.freeDoc()
xhtml = '<p><a href="%s">%s</a> (<b>%s</b>, <b>%s</b>)</p> %s' % (
link,
title,
channel,
date,
frags
)
l.append ( ( date, xhtml ) )
container.close()
l.sort( lambda x, y: cmp ( y[0], x[0] ) )
xhtml = ''
for i in range(len(l)):
xhtml = xhtml + l[i][1]
page = """
<html><head><title>XPath query of Jon's feeds</title><style>%s</style>
<script>%s</script></head><body>%s %s</body></html>
""" % (css, script, preamble, xhtml)
return page
More from Jon Udell | |
Not shown in this example is DB XML's indexing. I've tinkered with it, but the database of blog items I've built so far -- about 4000 documents, and 7MB -- doesn't really require indexes yet. As the database grows, I plan to learn more about DB XML's (somewhat bewildering) set of indexing strategies.
I'm also looking forward to using this growing repository to try out other XML-aware databases, and to explore XQuery as well as XPath implementations. It's been a bit of a revelation to see how easily random HTML scooped up from RSS feeds can be XHTML-ized and exposed to structured search. Can this approach scale to the level required to create network effects? I'd love to see Technorati, Feedster, and eventually Google give it a try.
Comment on this article in our forums.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Oldest First |
- AA Water Damage Restoration Emergency 24 hour open 7 Days Call 1-877-364-5264
2008-10-16 10:01:07 orellytos [Reply]
Water Damage Restoration Emergency 24 hour open 7 Days Call 1-877-364-5264
When disaster strikes such as a washing machine overflow, bath
overflow, burst pipe, rainwater from balcony etc, water damage
restoration procedures must be followed.
It is not as simple as just sucking up the water and cleaning the
carpet and letting the area dry naturally.
It is not the wet carpet that is the problem it is the underlay and
the wood or concrete floor that needs to be dried.
If the underlay is not dried or replaced you will end up with mould
and odour problems.
Our procedure for wet carpet is as follows.
1/ Call out to the premises and extract water using powerful vacuum
extraction machinery.
2/ Apply browning treatment to carpet to prevent water stains.
3/ Move furniture on to foam blocks to avoid further damage.
4/ Lift corner of carpet and inspect for damage underneath carpet.
5/ Install drying equipment such as air movers and / or dehumidifiers
and leave these on site until area is dry.
6/ Call back to premises on daily basis to check drying progress.
7/ Relay carpet and underlay when dry.
8/ Steam clean and deodorise carpet on completion of restoration.
Notes on water restoration:
1/ An area such as a bedroom will normally take between 24-48 hours to
dry completely using drying equipment.
2/ When a premises has been affected by sewage overflow we do not
attempt carpet restoration. The carpet and underlay must be removed
from the premises and disposed and the flooring treated and
deodorised. USA Standards does not allow restoration of sewage
affected carpet.
Water Damage Restoration Emergency 24 hour open 7 Days Call 1-877-364-5264
- #123 Locksmith Services Company Los Angeles call 1-323-678-2704
2009-06-11 14:48:55 whats [Reply]
#1 Locksmith Services Company Los Angeles call 1-323-678-2704 Locksmith services Los Angeles, including locks installation, doors locks repair, doors locks rekey, locks and keys products or services the best value and commitment to customers 100 satisfaction guaranteed.
On this website you can find local los angeles locksmith in your area.
Commercial Locksmithing, Specializing in: Banks Office, Apartment Building, New Homes, Condominiums, Retail Stores, Banks Industrial Facilities, Locks and Padlocks, Pharmacies Grocery Stores, Restaurants, Retail stores, Schools, Storage Warehouse.
Service includes Lockout, locked out need locksmith fast response, Deadbolt Locks repair installation Changed, Installed & Repaired, Re-keys & Master Key Systems, rekey ddors locks, High Security Locks Systems, Home Security Safes, Intercom Systems Repair & Installation, Fire Proof Panic Bars Repaired & Installed, Peephole Installation, High Security Cylinder Changed & Re-Keyed, Closed Curcuit Television CCTV, Card Access Control Systems, Panic Lock Devices, Safes,Windows,Glass Doors & Gates, Padlock, Combination Lock, Electronic Key, Magnetic Keys, Electronic Keypad and Keyless Entry, File Cabinet & Lock Picking.
Residential Locksmithing, Specializing in: Levers Entrance Sets Keypad Locks Deadbolts,
Knobs and Levers Entrance Sets Deadbolts,
Levers Knobs Entrance Sets Hardware Deadbolts,
Knobs and Levers Entrance Sets Deadbolts.
master key, deadbolts, Knobs sets, Hardware, Weiser Locks, Locksets, Knobsets, Handlesets, lock kits Cam Cabinet Locks Door Closers Key Blanks Key Cabinets Key Rings Padlocks Pushbutton Locks garage door openers mailbox locks cylinders and high security locks, keyless Entry, Key Cutting & Key Replacement, Emergency Vehicle Opening, Emergency Trunk Opening, Extraction of Broken Keys Locksmith, GM VAT Keys Duplication, High Security Vehicle Key Duplication Locksmith, New Ignition key and Transponder Chip Key. that fit your lifestyle.
Automobile Locksmith for Car lockout Truck opening Re-key all doors Ignition keys made on-site Transponder keys made on-site Ignition replacement,
