XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Junglee Tries to Tame the Data Jungle

August 05, 1998

The Seybold Report on Internet Publishing
Vol. 1, No. 12

Creates unified access to information scattered across many different resources. Online classifieds are its first target.

August, 1997
Mark Walter

AT A TIME WHEN many Internet start-ups tout their ability to bring information to the Web, one company is talking about the opposite: the potential for extracting data from the Internet and delivering it to enterprise applications.

The company is Junglee, a California start-up that has just announced its initial product, Canopy, a service for creating virtual databases published to the Web. Junglee's initial customers include the Washington Post, the Wall Street Journal Interactive Edition, Knight-Ridder New Media and Westech.

Junglee's unique development is its virtual database (VDB) technology. VDB treats external data sources-such as the World Wide Web-as extensions of an enterprise's relational database. (An analogy may be drawn to virtual memory, which treats conventional storage as an extension of physical memory.) With external sources looking like part of the main database, users can more easily query and retrieve information from several sources.
click to load illustration

The company

Junglee was founded in June 1996 by five engineers, four of whom were computer scientists at Stanford University.

According to company lore, VP of engineering Venky Harinarayan saw the need for the product while helping his girlfriend search the Internet for job listings. Harinarayan teamed with several Stanford colleagues-Junglee chief architect Anand Rajaraman and VP of engineering Ashish Gupta-to create a prototype in April 1996. That year they piqued the interest of Rakesh Mathur, an entrepreneur who already had two start-ups under his belt.

Mathur, now president and CEO, got the company off the ground, securing $6.6 million in funding from Nichimen Corp., Kyocera and the Washington Post. By early 1997, Junglee had secured its first customers, with investor Washington Post leading the list.

Junglee currently employs 40 people and expects to generate about $2 million in revenues this year.

The company's two marketing executives are Michael George and Craig Olson. George, VP of business development, is the former VP of technology at Digital Ink, the new media subsidiary of the Washington Post. A veteran newspaper systems man who had worked at both the Baltimore Sun and USA Today, George became smitten by Junglee's technology while working at Digital Ink. "I have deep roots in Maryland," George said, "but Junglee's technology was the one thing I've seen in the past 15 years in this industry compelling enough for me to uproot my family and move to California."

Olson, VP of marketing, has 15 years' experience in high-tech marketing, including stints at Silicon Graphics, Auspex and most recently BayStone Software.

The firm has some high-profile backers. The board of directors includes Fred Gibbons, founder of Software Publishing; Jeffrey Ullman, the former chairman of the computer science department at Stanford University; Tsuyoshi Taira, the past chairman of Sanyo Semiconductor; and Ralph Terkowitz, the VP of technology at the Washington Post who created its Digital Ink subsidiary.

Technology

A complete Junglee VDB system has two components, one for creating the virtual database and another for publishing it. The output of Junglee's system is an integrated table that can be fed to a database.

The core technology: VDB and the data integration system. In order to create the virtual database, Junglee creates a "wrapper" for each unique data source that makes it look like a set of relational tables to a database management system. Using the wrapper, these external data sources can be queried using SQL queries.

For its approach to work, Junglee has to make the external data format fit the constraints of SQL. One obvious application is to convert freeform text (such as salary ranges from a paragraph listing) to fielded data (a numeric range in the salary field). For this purpose, Junglee has written an Extract Description Language in which developers can create extraction rules for converting freeform text into fielded data.

Another typical application is to capture URL hyperlinks or the titles of Web pages into database fields.

Because different sources are likely to have widely varying schemata and vocabularies, Junglee has also developed a mapper facility that does data transformations. For example, if most sources reported salary as dollars per month, you could convert those that are reported in dollars per week or dollars per hour into dollars per month.

Setting up the publishing rules. The "brain" behind a VDB system is its data publishing system. The publisher sets up publishing rules to schedule data acquisition, transformation and dissemination. The wrapping, mapping and extraction processes result in a snapshot of the data that is fed to a data warehouse, which is the repository from which the publisher's customers retrieve the documents. (Because Junglee builds relational tables, the simplest way to store the snapshot is to load it into a relational database.)

For now, Junglee runs the VDB system and all of the associated administration software. With Junglee's help, the publisher creates a single user interface for users to query multiple feeds and controls how often and in what way material is captured from external sources. Standard database development tools, such as PowerBuilder, Delphi or Visual Basic, can be used to create the database applications, which are independent of the periodic loading of data that takes place through Junglee's software.

Each data source has its own data schedule, and transformations can be part of the extraction and mapping process. If a data source becomes unavailable (e.g., a Web server has crashed), Junglee's system will poll the site until it comes back up.

Going live with listings

Junglee's aggregating VDB technology has many potential applications, but the company is focusing right now on classified listings that marry public Web site data to private repositories.

Web job board. The first application of Junglee is JobCanopy, which is being used by the Washington Post for its CareerPost online employment section. JobCanopy aggregates job listings posted on Web sites of employers and provides a query-and-result interface that integrates this information with the print classified ads that newspapers already have.

A rich source of job postings is needed to make the service compelling. Behind its firewall, Junglee has been building an inventory of jobs posted by the largest employers in the U.S. The database currently collects from several hundred firms, and by the end of September, Olson expects Junglee to have more than 1,000 sources set up.

Although it could sell the technology that points to the employer Web sites, Junglee's present business model is to sell the data feed and the integrated query mechanism to "enterprises that want to dominate the regional and national online recruitment markets." Newspapers, which dominate local recruitment ad markets and are interested in fending off potential online competitors, are a natural audience.

Revenue potential. There appears to be decent revenue potential, for both publishers and Junglee. Olson predicts that employers will pay publishers $1,000 to $5,000 a month to have their ads posted at a newspaper's Web site, in addition to buying ads for the print editions. Those numbers were confirmed by both the Post and the Wall Street Journal.

At the beginning of August, The Post was charging 46 firms an introductory price of $500 per month. Tim Ruder, director of online classifieds at the Post, said the rate would jump to the $1,000-to-$5,000 range at the end of September. He declined to mention a sales target, but clearly the Post is on track to easily make back its investment.

Ruder also confirmed that the Junglee postings have been an additional revenue stream for the Post, not a drain on conventional ad income. "We have not experienced cannibalization of the print classified ads," he said.

The Wall Street Journal Interactive launched its Junglee-powered service (careers.wsj.com) in early August with more than 20 advertisers. Thomas Baker, business director at the interactive Journal, said Dow Jones (publisher of the Journal) is charging $1,500 per month, with lower rates for long-term commitments and for advertisers who also buy print classifieds in the Journal. Baker added that working with Junglee has been a postitive experience: "It's been a good partnership. They're very attuned to the classified advertising area, and they've delivered on their promises," Baker said.

Westech Expo Corp., a firm that runs high-tech job fairs in the U.S., is using Junglee to automate the collection and update processes for its Virtual Job Fair (www.vjf.com), which fields 200,000 job queries daily. The site publishes job listings for more than 600 firms, charging the employers between $300 and $3,800 per month.

For VJF, Junglee has not produced a new revenue stream, but the automation is helping to distinguish it from other online job-posting services. VP Paul Burrowes noted that in the little more than a month since Junglee has been running at VJF, its automated routines have enabled his firm to implement a flat-rate package and made it easier for the advertisers' personnel departments to prepare their listings.

Does it work? A key question, of course, is whether Junglee's technology is effective in converting free text to fielded data. Investor Terkowitz, chief technology officer at the Washington Post and acting CEO of Digital Ink during its inception, said, "It's not magic, but it does work, once the knowledge rules are set up." As an example, Terkowitz noted that while testing the product, the Post noted an unusual number of jobs with a salary of $401,000. It turned out that Junglee was inadvertently pulling 401K (a reference to U.S. retirement plans) as a salary figure. Developing and adjusting the extraction rules are critical to making Junglee's VDB system useful.

Looking at the Post's listings, there's no question that they provide a much larger job pool and more interesting search possibilities than your typical class-ad listing service. We also found that the vast majority of the listings are very clean. But anomalies still crop up. Even now, after six months of getting the kinks worked out, the Post site occasionally has listings with odd, abbreviated titles that aren't expanded into English.

Up next: more listings. Junglee plans to follow JobCanopy with Canopy products for automotive, computer, real estate and apartment rental listings.

In the computer market, for example, Junglee sees potential for creating a new aggregate catalog culled from the Web sites of computer vendors. By tapping into the vendors' sites, the publisher of such a catalog would be able to keep its own model and price information current, with the vendor doing most of the data entry.

A similar application could create parts catalogs in the military, engineering or other industries.
click to load illustration

Products, pricing and futures

Each Canopy product contains three basic components: a license to use the VDB database Junglee has created, an end-user query application and an integrated data table, which feeds the application the updated content on a customer-defined periodic basis.

A typical installation with 50 Web input sources costs $150,000 for software and licenses, plus $150,000 annually for an integrated data table published once per week.

Although it probably won't do so before 1998, Junglee does plan to sell the VDB system through indirect sales channels, such as specialty resellers and integrators. Doing so would enable customers to build knowledge bases on topics of their choice. Again, the first target applications will be structured listings, such as jobs posted on a company intranet instead of on public Web sites.

Another future direction is to develop extractors for languages other than English in order to take the technology overseas. Nichimen has expressed interest in helping to bring it to Japan.

Conclusion

As both publishers and consumers become increasingly overwhelmed by the flow of information from disparate data sources, content aggregation is rising in prominence and value. On the Web, publishers are aggregating content to produce subject-specific services and products (MD Consult, ZD Net, BioMedNet and so forth) as well as to create general-interest sites (Infoseek, Yahoo and others). On the customer side, business consumers are showing increasing interest in aggregation services and systems (PointCast, DataChannel and so forth) that pull together information from various sources and present them to users through a single user interface. Junglee's service lends itself to publishers who want to aggregate data taken from the Web into a single database.

From a technology vantage, some of the search engines (e.g., Fulcrum) provide a way to query both full-text and relational data sources from a single user interface. In effect, these systems aggregate content from the user's point of view, without requiring that the data itself be brought together and made consistent. In other words, they provide unified access, but they do not modify the data to improve relevancy.

Junglee's system is very expensive, but so is the cost of making freeform text behave like normalized relational data. If the knowledge domain lends itself to relational tables, then the precision and efficiency of the search is much greater if the data can be pulled into the tables instead of being indexed and queried as full text. Online classifieds are a natural fit for this technology, and so far Junglee's initial customers are having some success in using the service to derive a new source of revenue for their online efforts. Anyone who is serious about online classifieds in the U.S. should take a look at whether this technology would help their business.

Will Junglee's technology spill over to other applications? Certainly there are other searches that might benefit from the VDB approach, but it won't be until the company begins putting its underlying system out in the field that its general-purpose utility will be proven.

Junglee Corp., 1250 Oakmead Parkway, Suite 310, Sunnyvale, CA 94086; Phone (408) 522-9494; Fax (408) 522-9470; www.junglee.com