Building a Better Metasearch Engine

June 8, 1999

Applied XML Tutorial

Applied XML is intended for those of you who want to learn XML by seeing examples. Over the course of several columns, I'll demonstrate examples that show you how to apply XML. The goal of each example is to suggest what you can do, and provide you with code to download. I want to show you what works! I want to keep it simple and pragmatic: for those of you who have to earn a living providing solutions now. Let me know how I'm doing.

Our first example is a metasearch engine, which demonstrates how to automate interaction with a search engine database and compile the results from a number of different sites into a single, useful output. Our examples will use ASP and VBScript but, of course, they can be easily adapted to other environments.

What's Wrong With Metasearch Engines

Applied XML Tutorial: Metasearch
Download Code:	Zip file
Demo (requires IE 5):	Metasearch

Metasearch engines offer users a central point for querying multiple databases. www.metasearch.com is an example of a metasearch engine for compiling data stored in "ordinary" Internet search engines. This metasearch engine does not return a page listing the combined results; instead, it lists the queries to be submitted individually to different search engines. It's not a hard problem to submit queries simultaneously to different search engines but processing the results consistently is. A search engine's results are formatted in HTML; it can be very hard to extract useful information (e.g. page title, link, excerpt, etc.) automatically from these pages.

For example, look at some results from a search for "monica," which I submitted to two search engines. In the illustration below is the HTML for one of the listings, "Sista Monica: Bringing the blues to the web."

To process these different results, a metasearch engine would require a specific parser for each database it searches. Only then it would be able to sort out data returned from several databases and display everything in a homogeneous manner.

This problem of parsing HTML-pages (either static or generated from a database) was even deemed so important, that www.junglee.com and www.webmethods.com set out to provide remedies. Yet despite their services - and there´s much more to them than we are concerned with right here - the remedies are only a partial cure. The underlying cause is that the data returned is obscured in an HTML format.

Making Searching Easier Using XML

How much easier would our life be if richer data was returned from search engine databases? XML can make this possible. Imagine a metasearch engine that sees the "sistamonica.com" item returned in XML, as shown below.

<page href="http://www.sistamonica.com/" last-modified="1999-01-01" 

      language="English" sizeKB="2">


      <title>Sista Monica: Bringing the blues to the web!</title>


      <abstract>She's blues! She's soul. She's rock. She's bold! Sista Monica's 

      bringing her blues to the...</abstract>


      </page>

Imagine the additional control that a metasearch engine could have in processing this information and making it available for users.

Well, I hope that's enough insight into the kinds of problems that XML is supposed to help solve. Now, let's look at writing our own metasearch engine that demonstrates the power and flexiblity of an XML solution.

Part 2: A Metasearch Engine for an Address Database

We are going to develop a metasearch engine for address data [download the code examples]. It will use XML to compile database data from several sites and then format the data using XSL. All the metasearch engine processing will be done on the client!

Let's assume that there are several sites on the Internet with databases containing lots of address data, e.g. some kind of company addresses. All sites, of course, provide their own search engines. For our sample here we´ll work with just two different sites, called All Addresses and Best of Addresses. Right now they are hosted on the same server, but they actually work on different databases. I´d think this is close enough to reality.

Below is a form for entering a query. The user can choose the database field to search on, enter a search word (yes, wildcards work), and define the order in which the results are to be listed. This, of course, is a very simple query form.

Sort by

When using the search form, both sites return their addresses as HTML. Here´s a sample record:

 

<tr>

  <td width="20%" valign="top"><small><font face="Arial">

  Firma Karl-Heinz Rosowski</font></small> </td>

  <td width="20%" valign="top"><small><font face="Arial">

  Maikstraße 14</font></small> </td>

  <td width="20%" valign="top"><small><font face="Arial">

  22041 Hamburg</font></small> 

  </td>

  <td width="20%" valign="top"><small><font face="Arial">

  721 99 64</font></small> 

  </td>

  <td width="20%" valign="top"><small><font face="Arial">

  21110111</font></small> 

  </td>

</tr>

Can you tell much about the address record? Neither could a metasearch engine. Only the code generating the result can reveal the "secret of the returned information":


<%

  dim conn, rs, sql

  

  sql = "select top " & (maxRecords + 1) & " * from addresses where 

  " & request.form("Field") & " like '" & 

  replace(request.form("SearchWord"),"*","%") 

  & "%' order by " & request.form("SortField")

  

  conn = 

  "Provider=Microsoft.Jet.OLEDB.4.0;Persist Security Info=False;Data Source="

  & dbPath

      

  set rs = createobject("ADODB.Recordset")

   rs.open sql, conn, 1, 1

  if not rs.EOF then

%>



<table border="1" width="100%">

    <tr>

        <td width="20%" bgcolor="#000000"><strong>Name</strong></font></td>

        <td width="20%" bgcolor="#000000"><strong>Stree</strong></font></td>

        ...

    </tr>

      

  <%

     Do While not rs.EOF

  %>

    <tr>

        <td width="20%" valign="top"><% = "" & rs.fields("name") %></td>

        <td width="20%" valign="top"><% = "" & rs.fields("street") %></td>

        <td width="20%" valign="top"><% = "" & rs.fields("zip") 

      & " " & rs.fields("city") %></td>

        <td width="20%" valign="top"><% = "" & rs.fields("tel") %></td>

        <td width="20%" valign="top"><% = "" & rs.fields("fax") %></td>

    </tr>

<%

    rs.MoveNext

   Loop

%>

   </table>

It´s straightforward ASP-code. A connection to a SQL database is opened, records get selected, and the table header is written. Finally, a loop runs over all selected records and outputs HTML table rows containing information like name, ZIP, telephone number. Seeing the code that produces the output helps us understand the HTML output. However, most of us don't have the opportunity to look into the back-end production system to see these details.

Returning Result Sets Using XML

Since the providers of our address data are smart, they not only store the data in a database. On top of that, they also provide an interface for metasearch engines (or any software that wants to query the site). Besides the main search pages, they offer a second page, accepting query criteria as URL parameters. The syntax is very simple:

"?whereField=" fieldname "&pattern=" search word "&orderbyField=" fieldname

If you are using Microsoft Internet Explorer 5.0, you can try this URL:

.../searchxml.asp?whereField=ZIP&pattern=220&orderbyField=ZIP

The same data is returned as in the earlier example. However, as shown below, it looks quite different because it is presented in XML.

<?xml version="1.0" encoding="ISO-8859-1"?>

      <Addresses>

        <Address id="12359">

          <Name>Firma Karl-Heinz Rosowski</Name>

          <Street>Maikstraße 14</Street>

          <ZIP>22041</ZIP>

          <City>Hamburg</City>

          <Tel>721 99 64</Tel>

          <Fax>21110111</Fax>

        </Address>

... </Addresses>

Now, it´s pretty clear what´s what. Generating this kind of result set in XML wasn´t any more difficult than the HTML page, as shown by the ASP listing below:

<%

        ...

        sql = "select * from addresses where " & whereField & " like '" 

        & replace(pattern,"*","%") & "%' order by " & orderByField

        

        Set rs = CreateObject("ADODB.Recordset")

        rs.Open sql, conn, 0, 1

        

        Dim xml

        Set xml = CreateObject("Microsoft.XMLDOM")

        

        With  xml.insertBefore(xml.createElement("Addresses"), Nothing)

          Do While Not rs.EOF

            With  .insertBefore(xml.createElement("Address"), 

        Nothing)

              ...

              .insertBefore(xml.createElement("Name"), 

        Nothing).insertBefore xml.createTextNode("" & rs.Fields("Name").Value), 

        Nothing

              ...

            End With

            rs.MoveNext

          Loop

        End With

        

        rs.Close

        

        Response.ContentType = "text/xml"

        Response.Write "<?xml version=""1.0"" encoding=""ISO-8859-1""?>" 

        & xml.xml

        %>

This example uses the XML DOM to construct a XML tree as the result set. It iterates through the record set and inserts <Address>-elements below the <Addresses>-document element; below them it puts elements for each field in the recordset, e.g. <Name>, <ZIP>. At the end, to actually return XML to the client, it sets the MIME type to text/xml and lets the XML DOM serialize itself into a string. A special character encoding is used since the address data contains special European characters like umlaute.

Compiling Result Sets

Our example uses the Microsoft MSXML XML-parser component included in Internet Explorer 5.0. This will allow us to do XML parsing on the client side. In a future column, we´ll see how to do this processing on the server instead, which is a better way to provide the same functionality across different browsers.

If you are running IE5, then you can look at how the better metasearch engine looks and feels. The user interface looks the same, and the table with the resulting data does too - but the resulting table contains addresses from both (!) address database sites sorted accross their separate resultsets.

So how does the metasearch engine accomplish this feat? At its heart are IE5's XML islands.

<html>

      

      <head>

      <meta name="GENERATOR" content="Microsoft FrontPage 3.0">

      <title>Client-side Metasearch</title>

      </head>

      

      <xml id="xml1"></xml>

      <xml id="xml2"></xml>

      ...

IE5 allows you to insert XML islands into HTML pages. We use this feature to set up placeholders (xml1 and xml2) for the XML data to be returned from the two address search engines.

Then, when the user starts the metasearch, we call the VBScript function MetaSearch and use the XML DOM behind the XML islands to transparently query the search engines. Below is the code for the MetaSearch function.


sub MetaSearch()

  xml1.xmlDocument.async = false

  xml1.xmlDocument.Load "http://www.alladdr.com/searchxml.asp?whereField=" 

  & document.forms(0).Field.value 

  & "&pattern=" & document.forms(0).SearchWord.value 

  & "&orderbyField=" & document.forms(0).SortField.value

      

  xml2.xmlDocument.async = false

  xml2.xmlDocument.Load "http://www.bestofaddr.com/findxml.asp?whereField=" 

  & document.forms(0).Field.value 

  & "&pattern=" & document.forms(0).SearchWord.value 

  & "&orderbyField=" & document.forms(0).SortField.value

      ...

The code assembles the query URLs from the data in the metasearch engine search form and instructs the XML DOM to call the query interface pages. The resultset from searchxml.asp gets loaded into XML island xml1, and the resultset from findxml.asp gets parsed into xml2.

The next step is to assemble the address data returned into just one XML tree so we can sort them together:

 Dim xml, adr, adrList

      Set xml = CreateObject("Microsoft.XMLDOM")

      

      xml.appendChild xml.createElement("SearchResult")

      

      Set adrList =  xml1.selectNodes("Addresses/Address")

      For Each adr In adrList

        xml.documentElement.appendChild adr.cloneNode(True)

      Next

For this we create a separate XML DOM and copy the data from the XML islands into it, thereby building one homogeneous data structure. To be able to copy only the <Address>-elements, we run an XQL query using MSXML´s XQL-method selectNodes; thus adrList gets filled with a list of just <Address>-elements that can readily be iterated over. The resulting XML tree looks almost like the result set data:

<SearchResult>

        <Address id="12359">

          <Name>Firma Karl-Heinz Rosowski</Name>

          <Street>Maikstraße 14</Street>

          <ZIP>22041</ZIP>

          <City>Hamburg</City>

          <Tel>721 99 64</Tel>

          <Fax>21110111</Fax>

        </Address>

      ...

      </SearchResult>

It just starts with <SearchResult> instead of <Addresses>, and it contains the addresses from all sites queried.

Transforming Result Sets

What if the sites return XML data, but the XML vocabulary used is not the same? The address search engines seem to not have worked out a standard DTD yet. Here´s an extract from data returned by the "Best of Addresses" site:

<?xml version="1.0" encoding="ISO-8859-1" ?>

      <Companylist>

        <Company id="12309">

          <Name><![CDATA[Firma Hans Lehner Tischlereibetrieb 

      GmbH]]></Name>

          <StreetAddr>

            <Street><![CDATA[Bebelallee 132]]></Street>

            <City><ZIP>22111</ZIP>Hamburg</City>

          </StreetAddr>

          <Communication>

            <Tel>511 79 17</Tel>

            <Fax>685088</Fax>

          </Communication>

      </Company>

      ...

      </Companylist>

Looks quite different, doesn´t it. Some tagnames differ, and the element hierarchy does too. And it´s generated a different way too:


<% Response.ContentType = "text/xml" %>

  <?xml version="1.0" encoding="ISO-8859-1" ?>

  <Companylist>

  <% 

  ...

  Dim conn, rs, sql

  conn = 

   "Provider=Microsoft.Jet.OLEDB.4.0;Persist Security Info=False;Data Source="

   & dbPath

  Set rs = CreateObject("ADODB.Recordset")

  sql = "select * from addresses where " & whereField & " like '" 

  & replace(pattern, "*", "%") & "%' order by " & orderByField

  rs.Open sql, conn, 0, 1

  Do While Not rs.EOF

  %>

           <Company id="<% = rs.Fields("ID").Value %>">

            <Name><![CDATA[<% = "" & rs.Fields("Name").Value 

        %>]]></Name>

            <StreetAddr>

               <Street><![CDATA[<% = "" 

        & rs.Fields("Street").Value %>]]></Street>

                ...

           </Company>

         <%

        rs.MoveNext

        Loop

        ... 

        %>

         </Companylist>

Instead of building a XML DOM, the ASP-page simply outputs XML text.

The difference in the XML vocabulary used makes it a bit more difficult for our metasearch engine to handle the sites data. But only a tad. Before it can add the data to the unified result set, our program has to transform the data into another XML vocabulary. We´ll use the XML language used by the "All Addresses" site as our target language. And as the term "transform" suggests, we´ll use XSL to do the transformation. Here´s the stylesheet to apply:

<?xml version="1.0" encoding="ISO-8859-1" ?> <xsl:stylesheet
              xmlns:xsl="http://www.w3.org/TR/WD-xsl">   <xsl:template match="/">
                  <Addresses>      
              <xsl:apply-templates/>     </Addresses>   </xsl:template>
                <xsl:template match="Companylist">     <xsl:for-each
              select="Company">       <Address>
                      <xsl:attribute
              select="id"><xsl:value-of select="@id"/></xsl:attribute>
                      <Name><xsl:value-of select="Name"
              /></Name>         <xsl:apply-templates
              select="StreetAddr" />        
              <xsl:apply-templates select="Communication" />       
              </Address>     </xsl:for-each>   </xsl:template> 
                <xsl:template match="StreetAddr">    
              <Street><xsl:value-of select="Street" /></Street>    
              <xsl:apply-templates select="City" />    </xsl:template>   
              <xsl:template match="City">     <ZIP><xsl:value-of
              select="ZIP" /></ZIP>     <City><xsl:value-of
                select="textNode()[0]" /></City>   </xsl:template>   
              <xsl:template match="Communication">     <Tel><xsl:value-of
              select="Tel" /></Tel>     <Fax><xsl:value-of select="Fax"
              /></Fax>   </xsl:template>  </xsl:stylesheet>

It´s very straightforward. The only thing to mention is the technique for extracting the city name from the <City>-element, which also contains the <ZIP>-element. It took me some 15 minutes to figure that one out before I got it working with textNode()[0].

Once the stylesheet has been worked out, transforming the deviant data is easy:

 dim xml2Transformed

      set xml2Transformed = CreateObject("Microsoft.XMLDOM")

       xml2.transformNodeToObject ssBestOfAddr.XMLDocument.documentElement, 

      xml2Transformed

       Set adrList = xml2Transformed.selectNodes("Addresses/Address")

      ...

We use the MSXML transformNodeToObject method to apply the above stylesheet (stored in another XML island, ssBestOfAddr) and get the resulting XML in a DOM of its own, xml2Transformed, which then can be handled like xml1 above.

That´s it. Both XML resultsets are now combined in one XML tree: xml. It´s especially noteworthy how easy XML made it for us to handle even data not adhering to a "standard format". Whereas current metasearch engines querying HTML pages have to implement different parsers to extract the data, we didn´t have to introduce complicated algorithms. Instead we used the declarative way of XSL. Like substituting code to query ISAM databases with SQL, it´s far easier to formulate transformations using XSL than programming a transformation routine or a whole parser.

Displaying the Results

What´s left is the easiest part: displaying the combined data. Again we use XSL. The XSL stylesheet produces HTML from XML, so it´s easy to display the resulting table by simply assigning the HTML to a <SPAN>-element.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">

        <xsl:template match="/"> 

          <table border="1" width="100%">

            <tr>

              <td width="20%" bgcolor="#C0C0C0"><strong><small><small><font 

      face="Arial">Name</font></small></small></strong></td>

              <td width="20%" bgcolor="#C0C0C0"><strong><small><small><font 

      face="Arial">Street</font></small></small></strong></td>

              ...

           </tr>

           <xsl:apply-templates/>

         </table>

        </xsl:template>

      

        <xsl:template match="SearchResult">

          <xsl:for-each select="Address" order-by="+ZIP">

            <tr>

              <td width="20%" valign="top"><small><font 

      face="Arial"><xsl:value-of select="Name" /></font></small></td>

              <td width="20%" valign="top"><small><font 

      face="Arial"><xsl:value-of select="Street" /></font></small></td>

              ...

            </tr>

          </xsl:for-each>

        </xsl:template> 

      </xsl:stylesheet>

The stylesheet has also been loaded in a XML island, ss, so it´s readily available for use in the final transformation. By using the order-by feature, we get the resulting table sorted by ZIP-code.

But what if the user chose a different sort order? We change it on the fly in the loaded stylesheet. The below code shows how: Since there´s only one element containing an order-by attribute, it´s easy to select it using XQL. We change the attribute value to the sort field name selected.

 ss.XMLDocument.selectSingleNode("//@order-by").nodeValue 

      = "+" & document.forms(0).SortField.value

      searchResults.innerHTML = xml.transformNode(ss.XMLDocument.documentElement)

Conclusion

This is how easy the future of metasearch engines could look with XML. Some XML here, a bit of XML DOM there, a spoon full of XSL, plus a trifle of XQL. It´s all standards and drafts, little coding, mostly declarations. So, be cool and brush up your database sites with a programmatic query interface returning XML data.

Next time we´ll look at how a metasearch engine can be implemented on the server. That provides a browser independent solution that is usually faster, since servers tend to be connected to the Internet with more bandwidth.

Please let me know how you liked this tutorial.