Building an XML-based Metasearch Engine on the Server
July 8, 1999
Applied XML Tutorial
In my last article I showed you how XML can make life for metasearch engines so much easier. I set up a scenario where two database driven address directory sites (called "All Addresses" and "Best of Addresses") allowed access to their data through a simple query interface. But in addition to a regular search engine user interface they also returned results as an XML format.
These XML formats then were the foundation on which I built a client side metasearch engine. After the user entered some search criteria it queried the address directory sites and consolidated the returned XML data into one homogenous result list (see Figure 1 for a layout of this scenario).
The metasearch engine worked just fine using XML, XSL and XQL (XSL pattern matching)—it only had one drawback: it was very dependent on Internet Explorer 5.0 and its MSXML XML-engine. Only clients running IE5 were able to use it. Today I'd like to show you, how we can move the metasearch process to the server and deliver browser independent HTML to any client (I hope you don't mind that this solution will also rely on the MSXML component; but this time it's only needed in one place: on the server.) [Download the code samples]
1. Moving the Metasearch to the Server
First let's have a look at the server side metasearch engine. It's implemented as an ASP-page. When loading serverside.asp it displays the same user interface as the client side metasearch engine did. Let's try it:
You'll notice that some of the addresses are displayed with a yellow background. This is to distinguish the data coming from the different address directory databases. Addresses retrieved from site "Best of Addresses" are highlighted; the unmarked ones are from "All Addresses".
|Firma Karl-Heinz Rosowski||Maikstraße 14||22041 Hamburg||721 99 64||21110111|
|Fa. Kehlenbeck & Marquardt GmbH||Kanalstr. 47 a||22041 Hamburg||280 68 17||354827|
|Firma Dieter Schreyack||Zum Meeschensee 65||22041 Hamburg||04193/783 90||2514250|
|Firma Willi H. Matschuck & Sohn||Poppenbütteler Weg 90||22041 Hamburg||538 20 24/25||6429234|
|Firma Hans-Jürgen Knaak||Am Schiffbeker Berg 10||22041 Hamburg||732 77 44||6547105|
Table 1: Sample result of the server side metasearch engine
From a user's point of view the difference between the client side and the server side metasearch engine is small. The user interface looks the same. So where's the difference? For one, the server side solution can be used with any browser—if it produces browser independent HTML, e.g. HTML 3.2. But let's have a closer look at the information flow of the server side solution (see Figure 2).
As you can see, there's a bit more traffic between the client and the server as is normal for server side database applications. The server sends a form to fill out to the client, the client sends back some information to the server, the server then does its database work and returns a result page to the client. Nothing unusual here.
Unusual in this scenario is, that in order to produce a result to send back to the client, the server contacts other servers on the internet! The metasearch engine server thus temporarily becomes an internet client itself. Where in Figure 1 most of the traffic went on between the client and the database servers, now the traffic is between the metasearch engine server and the database servers.
Retrieving XML Data from other Servers
The similarity in functionality and information flow between our former client side metasearch engine and the server side solution suggests that there should not be too much a difference in how the server side metasearch engine is working. Let's take a look at the code:
This should look very familiar to you. We are reading XML results from the database sites into XML DOMs (xml1 and xml2) and then consolidating them in the XML DOM xml.
But as you can see, I've left out a very important point: how do we read in the XML results? On the client side we used IE5 XML islands. But there is no (D)HTML page on the server. It's only in the process to be generated.
Instead we can use the MSXML component directly:
This technique works as long as the servers we want to query provide a HTTP GET request "interface". That means as long as we can pass any query parameters as URL parameters, we can ask the MSXML parser to retrieve the XML data from a URL (instead of a local file on the server).
Things become more complicated when the servers to be queried have only a HTTP POST request "interface". I'll tackle a solution for that in a future article when I want to talk about more bidirectional XML-communication, e.g. in B2B-scenarios.
Transforming and Formatting
As you can imagine, the lack of XML islands also makes changes to the use of the XSL stylesheets necessary. We need a stylesheet for transforming the XML data from "Best of Addresses" to our "canonical" XML address format (which "All Addresses" already provides). And we need another stylesheet for sorting the consolidated data in xml as well as transforming it into plain HTML.
As in the above example, we compensate the lack of XML islands with the explicit use of a XML DOM object: ssBestOfAddr. The metasearch engine loads the stylesheet and xml2 applies it to itself thereby producing a XML DOM (xml2Transformed) containing the transformed XML element tree.
The last task remaining is transforming the consolidated address list in xml to HTML. Like on the client side we do this by applying another stylesheet and sending back to the client the resulting HTML <table>:
Of course we only need to do the transformation if there were any addresses returned from the database sites we queried. The stylesheet serverSideAddresses.xsl looks just like the stylesheet we used on the client. And as before we need to tweak it a little bit by inserting the requested sort order.
But there's a small thing I added. As you noticed in Table 1 above the addresses are color coded according to their origin. This is accomplished in two steps:
1. After receiving the XML address data each address "record" is tagged. The consolidation process simply adds an attribute (source) to each <Address>-element while copying it to xml.
2. Within the stylesheet for transforming xml to a HTML table the source-attribute is checked, and if it designates an address from site "Best of Addresses" the name column is highlighted by adding a background color to its <td>-element:
With <xsl:attribute> the attribute is added to the XML output node it is located in, which is the <td> node.
We've finished moving the metasearch engine to the server. It wasn't necessary to change any of it's workings—except for replacing the XML islands with explicit XML DOM objects. This demonstrates nicely how easy it is to set up a client-server communication using XML, as well as a server-to-server communication. Given a well defined interface (how to pass parameters to the server plus a XML data format for the resulting data) an XML DOM component like Microsoft's MSXML COM-component is sufficient for the job.
2. Paged Display of XML Result Sets
Since it was so easy to put the metasearch engine on the server, I'd like to add a little twist to it before I leave you alone with it. The question I'd like to raise is, how can we limit the addresses displayed to a certain number per page? It's a must-have feature for all search engines—not to throw thousands of result items at the user, but to show just a subset of them at a time.
So far we are transforming all the address data we retrieved from several database sites into an HTML table and sending it to the client. But how can we limit the addresses to display to a certain number of addresses at a time without sacrificing our XSL solution?
Using XSL to Display Subsets
In a "traditional" ASP-solution we'd have a recordset and a loop to iterate over it, for example:
XSL however essentially is descriptive, not algorithmic. Still though, it provides looping constructs—and we are already using them:
<xsl:for-each> implicitly iterates through the list of <Address>-elements below the document element. What we now have to do is finding a way to output the content of the <xsl:for-each>-element only for a specific number of records, e.g. addresses 10 to 20.
First I thought the solution would come easily by adding a twist of XQL to the select-attribute. XSL patterns provide a function (index()) to get at the index of a node in its parent nodelist. I added a filter to the query and felt very confident:
The select-attribute now limits the <Address>-elements to the ones with indexes from 0 to 9. So far no problem. But when I looked at the first page, although it was limited to just a couple of addresses, it contained the wrong ones. The XSL engine had worked properly—but I had misjudged the order of processing of the select- and order-by-attributes—stupid me. Instead of first applying the sort-clause and then selecting the first couple of records, of course it worked the other way around. I was presented with a selection of addresses in unsorted order which contained only entries from the first database site. So I had to go back to the drawing board an see how I could 1. sort all addresses, and 2. select just the ones I wanted.
Rescue came by means of the <xsl:if> element.
<xsl:for-each> selects and sorts <Address>-elements as before and itereates over all (!) of them. But now we decide which ones we'll actually output by checking their index within the loop. However, since <xsl:if> and its test-attribute are independent of the elements selected by <xsl:for-each> we have to explicitly grab that list—the context of the current XSL-element—with the context()-function.
What was left was setting the range of indexes dynamically according to the page requested.
It works like setting the sort order before. Find the <xsl:if>-element in the XSL stylesheet and set its test-attribute to a XSL pattern containing a range of indexes depending on the current page and the page size.
Now let's have a look at how this is working out.
MSXML and the Standards
The code presented here heavily relies on the Microsoft XML COM-component MSXML which is included with Internet Explorer 5.0. Unfortunately, the component does not implement the current XSLT draft. For example the above used order-by-attribute is not XSL draft compatible. Instead the <xsl:sort>-element should be used—which is not yet implemented in MSXML. Also Microsoft has added some proprietary methods to the XML DOM, e.g. selectSingleNode.
But still I'm using the component in my examples. Why is that, you might ask? Because it works—for the things I want to demonstrate here. This column is concerned with "How and where to use XML, XQL etc.?", not with "Which tool is the best?" or "Which tool most closely adheres to the standards?" (have a look at http://www.webstandards.org if you are concerned with this question). Please take the sample code I'm showing you as a bag of ideas. For example: Using the proprietary method selecSingleNode instead of reaching the same effect with standard XML DOM methods simply helps to get the point across: "A XSL stylesheet is XML data and you can manipulate it using the XML DOM." This is what you should take home from it.
Please get me right, I'm all in favour of standards. But waiting for standards is no excuse for not learning of the benefits of the concepts and technologies to be standardized—before they get standardized. Plus, if there are working, pragmatic solutions out there, why not use them?
Keeping the Result Set across Page Calls
Only one problem remains to be solved: How do we keep the search result across page calls? Sure we don't want to requery all the database sites whenever the user just wants to flip to another page in the same result set. One way would be to store the consolidated XML DOM as raw XML in a session variable. That would be a trivial solution—costing quite a bit of performance. We'd have to serialize the XML DOM and deserialize it for each page change. On the other hand there would be no thread problems, since only plain text would get stored in an ASP session variable.
Fortunately there's a much better solution. MSXML provides a free threaded version of its XML-parser component. So we can actually keep the whole XML DOM alive across page calls by storing an object reference in an ASP session variable.
Whenever the user issues a new query the metasearch engine retrieves data from other sites, consolidates it in xml and stores xml as well as the stylesheet (ss) to transform xml into a HTML table in session variables. Both XML DOM objects are created using the progID Microsoft.FreeThreadedXMLDOM. By creating them as free threaded objects they don't glue the ASP page to a certain execution thread within the Internet Information Server. This is very important for high performance web sites.
But be careful: If you use the free threaded version of the XML DOM you can't exchange nodes with instances of the not free threaded version. So be sure to use only one threading model for the XML DOM within an ASP page.
There isn't more to server side paged display of XML data: Keep the data around across page calls and know thy stylesheet—or maybe you want to use the XML DOM directly to generate HTML data, which can be faster, since you don't have to iterate over all the elements to display.
Oh, and there is one last thing you could improve: You could sort the consolidated XML data in xml only once before saving a reference to the object in the session variable. Then in the stylesheet you could do without the order-by-attribute and gain performance. But I'll leave that to you as an excercise ;-)
If you like, let me know if this article was of any value to you.