The Future of Search

Stephen E. Arnold
Consultant
Postal Box 320
Harrod's Creek, Kentucky 40027
Voice: 502-228-1966
E-mail: sa@arnoldit.com

Abstract

Pay-for-placement is becoming one of the most important developments in search and retrieval. The cost of indexing and providing access to content is high. In order to generate revenue, the practice of selling key words and guaranteeing traffic to the purchaser of these key words is moving from the public Internet to other content domains. The result is that online searchers must focus on the authority, accuracy, and pedigree of the online service and the search results it displays. The future is not advanced linguistics and visualization. The future is a greater emphasis by the user on the mechanisms for indexing, determining relevancy, and providing useful pointers to information that answers the user's questions.

Introduction

The future of search has arrived. The challenge to information professionals is to accept that there is an elephant in the results.

The cost of spidering Internet and commercial sites, indexing these content sources, and fulfilling search requests is an expensive activity. In order to index 50 million sites and provide 99.9 percent uptime for public access costs somewhere in the range of $80,000 to $200,000 per month. The variance is a result of the cost associated with bandwidth "spikes." As long as traffic remains within the contracted bandwidth, no additional telecommunications fees are assessed. When a spike in usage occurs, the cost of bandwidth is adjusted accordingly.

Now, how much does it cost to do one billion Web sites? The cost unfortunately does not decrease in the way that costs decline as one purchases larger and larger quantities of dish-washing detergent or other fungible products. The costs associated with search and retrieval have a characteristic of rising as the content searched, indexed, and served goes up. A good rule of thumb is to take the baseline costs for 50 million Web pages-in our example, $100,000 per month for 12 months or $1.2 million-and calculate the costs for the number of pages you need to spider, index, and make searchable. The result is $24 million per year to index one billion content sources. This result must be multiplied by a factor of 1.3 to 1.6 (based on the data to which the Arnold IT team has had access). The result is $31.2 million to $38.4 million.

The reason these costs rise are largely a result of the following expenses (excluding marketing) that accompany a large commercial search infrastructure. These costs are:

Network costs. The infrastructure must be upgraded and maintained
Code tweaking. The software used to perform the functions associated with search must be tuned, enhanced, and fixed.
Hardware. The "plumbing" used for the service requires maintenance and upgrades.

Why are the costs of doing search and retrieval surprising to most people? The reason is that most of the experts in search, online information, and indexing have never worked on the scale associated with the type of indexing that must operate on a large scale in a high-demand environment.

Most of the search engines that index the Internet are recycling content from utility services. These are commercial operations that provide their "hits" to licensees. Search Microsoft Network, America Online, Free.fr (France), or any of the more than 1,500 search services operating in the U.S. and Europe today. The results are coming from Google, FAST Search & Retrieval, Inktomi, Overture, and a handful of other services.

The blunt fact of the matter is that Internet search is largely a story of indexing sites that answer the most frequently asked questions and the sites that draw the most traffic.

There has been a surge of interest in sites that accept a user's query, pass that query to other sites, and aggregate the results. One well-received service performing metasearching is Ixquick. Ixquick is interesting because the service uses algorithms loosely based on the type of mathematics associated with index funds and a variation of popularity voting and link analysis.

Consultancies offer reports-sometimes for free and other times at mind-boggling prices-for reviews and analyses of commercial search engines. Most of these reports provide profiles of Verity (the current market leader in full text search due to the company's aggressive licensing program), Open Text (the search technology is a hybrid of Tim Bray's work at acquisition of other technology such as the BASIS technology), and some newly-introduced engines such as iPhrase, ClearForest and Stratify (a taxonomy generation and content classification service at.

Some Key Differences

The difference between the search engine used at FAST Search & Retrieval and a specialist engine such as Tacit Knowledge Systems is scale and purpose. FAST's technology can handle high volumes of content, perform specialized functions such as variable crawler scheduling and real-time index updating, and support millions of queries per hour. The specialized engines focus on new technology (associative search that finds related content in roughly the way people perceive the Teoma technology), specialized classification (roughly similar to Vivísimo's service), visualization (the Flash-centric Kartoo service I wrote about in a recent Information World Review column), free text access to content stored in databases (Easy Ask), or advanced linguistic analysis to perform classification of specific content domains (the type of services provided by Applied Semantics, formerly Oingo).

The points of difference boil down to four, although some of the search engine experts will find my list woefully incomplete or just plain wrong:

Scale. Most of the new services cannot handle indexing tens of thousands of servers, more than 40 million "pages" of content, and support thousands of simultaneous queries. They can't. These services never will. Period.
Indexing updating. The weak link in most search and retrieval engines is building an index and making it available each time new content is discovered, parsed, and submitted to the index refresh subsystems. In the past, most users of the free Architext engine (originally used at the pre-crash Excite) simply could not update its base index when more than 40,000 new sites were added. There are similar indexing update "problems" in most of the search engines on offer today. Discovering an index refresh problem after a license has been signed is what one might call an "unfortunate circumstance."
Speed. The majority of the search and retrieval systems demonstrate some sluggishness. The bottlenecks are in the updating of the index, handling high volumes of requests and the corresponding serving of content, and chokepoints associated with maintenance of various system components. An example would be improving the performance of spidering and indexing special purpose content such as one of the four or more types of Adobe Portable Document Format content objects. Note: Some of the Adobe PDF format documents require converting an embedded Tagged Image File Format object so that it can be processed by an optical character recognition file. That ASCII file is then indexed and included in the index. (Caution: If your search vendor does not or did not explain which type of PDF documents the system can index, you may want to find out how much Adobe PDF content you think you are indexing yet are not.)
Integration. Anyone who has tried to plug a third-party classification scheme into a commercial indexing engine knows that the job is not easy, quick, nor cost-free. Inktomi purchased Quiver to have a classification technology that could be "baked in" to the Inktomi search and retrieval system. Inktomi's previous use of Stratify was sufficient exposure to the cost and time required to "bolt on" a system without access to the classification engine's source code.

Snapshot: Taxonomy

Inktomi illustrates the importance of taxonomy in commercial search engines. Two years ago, taxonomy was a remote technology province. On the cusp of 2003, search and retrieval requires a mechanism to place content into what can be compared to Yahoo!-style categories.

Most organizations are discovering or admitting that the information on their internal system is largely uncharted territory. A search engine allows one to locate information, but search engines work best when the searcher knows what is in the domain. For example, one does not look for a consultant's report on shareholder value in Compendex.

A classification of internal content allows a searcher to look at top-level categories and get a sense of what is available to explore via pointing and clicking or searching via key words. Taxonomy is one of the key technologies in Inktomi's "enterprise" search service.

With the acquisition of Quiver's classification software, Inktomi has enhanced what it markets as Inktomi's Enterprise Search. Those with a long search memory will recall that the core of Enterprise Search is Infoseek's software. With the addition of taxonomy technology from Quiver Inktomi is arguing that taxonomy helps bolster relevance and query features, particularly for multiple word queries. One interesting new feature is the addition of a "use for" list of suggestions when a query does not return results.

Inktomi has added a spell-check capability that suggests alternative spellings for misspelled queries. In addition, because the dictionary is built from terms found in documents being indexed, the spell check can suggest terms that match commonly used words in the corporate network.

Inktomi has also added what it calls a "Quick Summaries" feature. The user sees the query term in the context of the overall document being searched. The Inktomi Classifier 2.0 adds what Inktomi calls "topic directories." Like Verity, Inktomi has shifted from "the system does it all" to a design that glues together auto-classification with human intervention into the categorization process. Inktomi, however, has tossed in a touch of workflow or what it calls "distributed editorial approval."

What search and retrieval vendors are fast learning is that customers want to be able to do more than bang in a few key words and get a list of "hits." A richer, more mature software environment is now needed. This demand for tighter integration is likely to be the demise of search and retrieval as we now know it. Search is being componentized and subsumed into enterprise software that performs backoffice and front office functions.

I offer Inktomi as one example of what is needed to keep traditional search and retrieval viable. The astute reader will note, however, that I have not yet acknowledged the presence of the elephant that sits calmly at the search and retrieval conference table. "What elephant?" many ask. Let's look at her.

The Elephant

The elephant in search and retrieval is paid listings. The reason is that the economics of search and retrieval are such that money is needed to pay the bills. The leaders in pay-for-traffic are Overture, Google, Espotting, and a number of lesser contenders.

Most professional searchers accept paid listings as a fact of life. One well-known search expert who writes product reviews for search firms to use as "white papers" said to Arnold IT, "Paid listings are one way to generate income. As long as the listings are clearly identified, these results are not too much of a concern."

Arnold IT's view of paid listings is a bit different. Consider Overture. The company is profitable, netting $20 million from revenue of $288 million in 2001. It operates an infrastructure that is as large as or larger than Google's. The company provides pay-for-traffic hits to most of the major search "engines" and public portals in the world. The few services that Overture does not have as clients, Google provides a similar service.

When a person searches Lycos (U.S. or Europe), where are the hits coming from? This is an important question to ask about any of the search engines. The reason is that what is known as "preferential placement" is making its presence felt in search and retrieval services operating behind firewalls and in government-sponsored search services.

These "objective" services are in fact moving toward the "pay for placement model" as a result of financial pressures. In the case of internal services, senior management may want to emphasize certain products or services. In the case of Extranets, the type of service that Arrow Electronics is offering invites for certain vendors to "buy" placement on certain pages. In the case of government agencies, departments or governmental units with funds are already exerting pressure on certain government search services to give them eyeball real estate on the splash page or a top listing in a user's query.

The elephant to which I am referring is the technology that permits the following functions to be implemented within a search and retrieval service:

Selling words and related terms. A marketer identifies the words or terms that are pertinent to his or her Web page or service. These words-assume airplane, vacation, holiday, travel-are wanted by an online travel agency. "Buying these words" guarantees that when a user's query is entered in a participating search engine, the top hit in the list of "relevant items" will be that travel agency. The computational horsepower needed to parse a query, pass it to a "pay for placement" module, and interleave the paid listings with the unpaid listings is now readily available.
Search tracking. Most online searchers are blissfully unaware that their queries are monitored, parsed, written to a database, and tracked. One can see this service in operation by looking at the free key work tracking service offered to anyone visiting the Overture site. It is possible to see how many queries are launched for specific words such as "travel" and "discount hotel". Search and retrieval services operating behind firewalls can be monitored using the native tools or supplemented with special purpose tools that allow a system manager to know what content is searched, how often, and by whom. These data are used to "bubble up" relevant content.
Slipstreaming content. A good example of this is the display of advertisements on the Google result pages. A flagged or purchased word is linked to a content object which is a Google advertisement. This is a variation of the personalization technology in use at many sites. The idea is that when a particular term or related term is searched, the system displays a related content object. Google differentiates its paid advertisements on its site. However, Google licensee the BBC does not. When a query is run on the BBC site, the hits are those that are selected by the BBC system administrator. The content displayed is, therefore, also displayable in much the same way.

Conclusion

The future of search is here. It has the following components.

First, searchers must recognize that the brutal costs associated with search and retrieval services that actually work are forcing those operating them to seek revenue sources. The revenue source that is working at the present time is selling hits. Any list of hits from any search engine must be scrutinized in order to determine if the hit list is objective.

Second, the technology of parsing queries and mapping query terms to "pay for placement" content is reliable and available. More and more uses of this technology will be used in order to generate revenue or promote specific products and services. Whenever I look at results from a Dialog or Lexis Nexis search, I look closely for preferential placement of those two very large firms' own content sources. So far, the search results seem objective but the Web page displaying the results is more difficult for me to analyze. Will these services results remain objective? Are the results of searches on the Financial Times, Factiva, or John Wiley sites completely objective, "untuned," and designed to answer the users' question without bias?

Finally, the information professionals may want to reconsider the impact of the "pay for placement" trend. With more and more users looking only at the first page of "hits," it is possible that none of the results are objective reflections of content that the site has indexed. The likelihood that all of the hits are biased is difficult to discern when the information professional perceives "pay for placement" as a standard component of the search process.

In the days of paper-based research, the concept of the validity, accuracy, and reputation of the source was a cardinal precept. When online was in its salad days, the developer of the database and its editorial policy were useful indicators of the integrity of the data in the file. A search of ABI/INFORM, for instance, meant that the user would get abstracts and indexing of selected, high-quality articles from a carefully screen set of business and economics journals. Does today's Web search or query run on an online service from a commercial vendor have a comparable pedigree?

The future of search is not visualization, linguistic processing, or automatic classification. The future of search is the responsibility of the user to understand where the results are coming from, how the query is processed and what content is automatically displayed regardless of its relevancy, and the mechanics of what is indexed, how often the index is refreshed, and what algorithms are used to determine what is displayed and how.

The future, then, is a commitment to a back-to-basics about the sources, their authority, and the accuracy of results.

List of Web sites mentioned in this essay:

MSN Search: search.msn.com
AOL Search: search.aol.com
Free: search.free.fr
FAST: www.fastsearch.com
Google: www.google.com
Inktomi: www.inktomi.com
Overture: www.overture.com
Teoma: www.teoma.com
Vivisimo: vivisimo.com
KartOO: www.kartoo.com
EasyAsk: www.easyask.com
Applied Semantics (formerly Oingo): www.appliedsemantics.com
Espotting: www.espotting.com
Dialog: www.dialog.com
Lexis-Nexis: www.lexisnexis.com/search
Financial Times: search.ft.com
Factiva: www.factiva.com
John Wiley: www.wiley.com

[ Top ] [ AIT Home ] [ Beargrass ] [ Site Map ]