Building a Better Metasearch Engine
Applied XML Tutorial
Applied XML is intended for those of you who want to learn XML by seeing examples. Over the course of several columns, I'll demonstrate examples that show you how to apply XML. The goal of each example is to suggest what you can do, and provide you with code to download. I want to show you what works! I want to keep it simple and pragmatic: for those of you who have to earn a living providing solutions now. Let me know how I'm doing.
Our first example is a metasearch engine, which demonstrates how to automate interaction with a search engine database and compile the results from a number of different sites into a single, useful output. Our examples will use ASP and VBScript but, of course, they can be easily adapted to other environments.
What's Wrong With Metasearch Engines
| Applied XML Tutorial: Metasearch | |
| Download Code: | Zip file |
| Demo (requires IE 5): | Metasearch |
Metasearch engines offer users a central point for querying multiple databases. www.metasearch.com is an example of a metasearch engine for compiling data stored in "ordinary" Internet search engines. This metasearch engine does not return a page listing the combined results; instead, it lists the queries to be submitted individually to different search engines. It's not a hard problem to submit queries simultaneously to different search engines but processing the results consistently is. A search engine's results are formatted in HTML; it can be very hard to extract useful information (e.g. page title, link, excerpt, etc.) automatically from these pages.
For example, look at some results from a search for "monica," which I submitted to two search engines. In the illustration below is the HTML for one of the listings, "Sista Monica: Bringing the blues to the web."
To process these different results, a metasearch engine would require a specific parser for each database it searches. Only then it would be able to sort out data returned from several databases and display everything in a homogeneous manner.
This problem of parsing HTML-pages (either static or generated from a database) was even deemed so important, that www.junglee.com and www.webmethods.com set out to provide remedies. Yet despite their services - and there´s much more to them than we are concerned with right here - the remedies are only a partial cure. The underlying cause is that the data returned is obscured in an HTML format.
Making Searching Easier Using XML
How much easier would our life be if richer data was returned from search engine databases? XML can make this possible. Imagine a metasearch engine that sees the "sistamonica.com" item returned in XML, as shown below.
<page href="http://www.sistamonica.com/" last-modified="1999-01-01"
language="English" sizeKB="2"> |
Imagine the additional control that a metasearch engine could have in processing this information and making it available for users.
Well, I hope that's enough insight into the kinds of problems that XML is supposed to help solve. Now, let's look at writing our own metasearch engine that demonstrates the power and flexiblity of an XML solution.
Pages: 1, 2 |
