Guest Editorial


The Guest Editor
Steven E. Arnold is based at Arnold Information Technology. Kentucky. USA.

Search-and-frustrate: pseudofinding
information online 2000

The Electronic library
Volume 18 - Number 1 - 2000 pp. 6-11
(c) MCB University Press ISSN 0264-0473

The numbers are fuzzy, of course. Year 2000 is dead center in the midst of
Internetmania. Worldwide there are somewhere between 130 and 200 million
Internet users. The number of pages available as calendars clicked to
01-01-2000 is in the neighborhood of one billion (one followed by nine zeros
or 1,000 millions).

With new users signing up at record rates, Year 2000 promises to launch
millions of people worldwide through an "Internet perception trajectory"
(see Figure 1).

What happens is that a person begins with little or no insight into the
Internet. Within a few weeks, that person knows that search engines do
indeed search. The problem is that the results may or may not be germane to
the word, concept, or question the user wants answered.

It is not simply a question of precision and recall. The more fundamental
problem is that the Internet contains dynamic information in many digital
alcoves. Finding the alcoves and then figuring out what is in them is a
difficult, tedious, and intellectually exhausting task.

The new user becomes increasingly demanding. In short, the user becomes a
more sophisticated customer in a very short span of time. Search engines and
other services available to help a person navigate the Internet do not
evolve as rapidly.

We have, one might say, a search engine crisis. Little wonder that
human-constructed directories like Yahoo! get more clicks than robots.
Comprehensive, no. Understandable, yes.

It is, therefore, not surprising that users embrace e-mail enthusiastically.
Messaging traffic is skyrocketing, the firms tracking Internet behavior
suggest that personal e-mail is growing at a 50 percent per year rate. New
users are signing up at a rate of about 21 percent per year.

Once a Web newcomer has worked through the tricky parts of
pointing-and-clicking, users go through a period of surfing or following
links for the serendipity of the experience.

At some point along the trajectory, the user forms the idea that finding a
Web page on a particular topic should be a service a search engine happily
performs. Only a dullard continues clicking mindlessly in the hope of
stumbling upon a page that answers a particular question about the species
of spiders inhabiting Central America.

This third stage of Internet use has moved the user into a more purposeful
use of the Internet. The Web has assumed some tool-like features. The user
wants to answer a specific question or meet a particular need at a specific
point in time. When the user moves from surfing to hunting and gathering, an
important transition is taking place. The user becomes keenly interested in
ways to make the process of meeting a specific need easier, faster, and more
enjoyable.

The fourth stage characterizes a user who can exploit Web resources in a
range of media for many different applications. The mature Web user does not
abandon any of the skills developed in the first three stages. The more
mature Web user is more combinatorial and able to respond to Web experiences
in his/her stride.

The final stage moves the Web user from a consumer of content to a creator
of content. It is true that many newcomers to the Internet will create a
home page as soon as the portal displays an icon for "Build Your Own Home
Page". But once the basics of Internet and Web interaction have been
internalized, users begin to use the Web as a rich communication medium.
Content becomes the focus of the mature Web users' actions (see Figure 2).
  

In practical terms, spiders have to dig through what is available on the
Web, determine what is new and index that, and locate changes to existing
pages that have been indexed and then index them. These "deltas" must be
added to the index that the user searches. From a computational point of
view, most indexing companies lack the resources to undertake the type of
technical effort necessary to index more than one-fifth of the available
pages. Other disincentives include: Most users ask the same questions; for
example, popular entertainers, stock quotes, maps, etc. A large index is not
needed to handle most users' needs.

As the Internet gets more popular, more pages are added, and more pages are
changed, indexing Web sites presents some challenges:

(1) Network Latency contributes to spider inefficiencies. If a site is not
available, spiders have to revisit the site which may limit the reach of the
indexing service. When a spider does not index a site, either the pointers
are not in the index or they are not updated.

(2) Spider operators limit the depth or number of links a spider can follow.
As a result, large sites may never be indexed. If a spider follows three
levels of links and a site has more levels of links, the untraversed levels
are not indexed and, therefore, may not be available via a Web search.

(3) The usefulness and user acceptance of "voting" engines like Google end
Direct Hit build their indexes based on user behavior or the number pages
that point to another page. The most visited or the most linked to sites are
those included in the index. Google is extremely useful despite the small
size of the number of pages it indexes. Indexing, therefore, does not mean
the same thing to each Internet indexing or directory service.

As the Internet moves to its next "evolutionary stage", several observations
are warranted (see Figure 3).



First, with a large user population, many different needs exist at any one
time. Services no longer can offer one-size-fits-all portals. Fragmentation
of user communities means that finding something becomes more and more
difficult. The fault is not a failing of indexers and search engines. The
flaw is inherent in the Internet. Vendors need to own up to these
difficulties in 2000, not hide behind meaningless statements that a
particular service indexes 220 million pages. With a billion pages out
there, which 20 percent are we talking about? What's left out? Who does
what? Specifics, not statistics, are needed.

Second, today's Web indexes serve the large English-language user base. As
non-English Web sites expand, different types of indexing strategies will be
necessary. The need for multilingual indexes will become increasingly acute.
(Alta Vista's translation service is a promising step toward language
neutrality.)

Third, spiders alone cannot do the type of indexing that makes sense to
users. Hence, human indexing is grafted on to the output of spiders or is
used to build directories that a user can apprehend and use. Humans placing
citations in an ontology seems to make sense despite the explosion of
indexing spiders.

Fourth, fancy search-and-retrieval systems like natural language queries are
not yet among the most heavily used services. Fast, for example, uses an
interface that invites the entry of one or two words or a short phrase.
These observations reflect the size and information reach of the World Wide
Web. Figure 4 illustrates that content builds up in Web space. Servers are
taken down and content removed. But, overall, the amount of information
available to users continues to expand rapidly. The traditional rules of
records management are difficult to apply in a distributed Web space.
Similarly, the types of bibliographic and collection development strategies
that apply to paper-based archives are difficult to transfer to a dynamic
Web environment. Content, despite the rapid change in 'Internet space", is
cumulative. The backfile of Internet content follows "old" archiving rules
and is rapidly generating "new rules" (see Figure 4).



The small size of indexes, handcrafted directories, and hybrids like
Microsoft Search that combined spiders, voting technology, and directories
suggest that there is no single solution to the search problem. There are
some powerful online search-and-retrieval systems available from such
companies as Manning & Napier Information Services, Claritech Corporation,
Inference, ConQuest, and dozens of others. Users, however, seem to like a
simpler way to locate meaningful sites. Has technology failed? Has the
Internet and World Wide Web become consumerized to such a degree that it
encourages the "lowest common denominator" type of search-and-retrieval
system?

Have we regressed within the last year and a half or have we progressed with
regard to search-and-retrieval? Consider the illustration. The world of
online has become considerably more complex than it was in the 1980s. One of
its chief features is the wide range of information "types" that are
available online. Text, database, audio, video, and image files are mixed
and matched in remarkably fresh ways. At times, it seems as if everyone with
some computer expertise under the age of 25 is working feverishly to extend
the reach, scope, functionality, and richness of the content on the Web (see
Figure 5).



An increase in the perceived or real velocity of information-centric actions
promises to make online search-and-retrieval more important in the months
and years ahead.

1996-2000 the Internet Web explosion

Networks become Internet centric leading to an explosion of innovation,
content, and services. 1975-1985 searching was confined to proprietary
online systems with command line interfaces. Databases were rigidly
structured. 1986-1991 online access expanded to include indexes of files
stored on optical discs within organizations' networks. Commercial online
services expanded. Bulletin board systems, underground online services,
exploded from zero to more than 25,000 by 1991. 1992-1995 online went
through a series of transitions. These included commercial services chasing
mythical "end" users with minimal computer savvy and a general aimlessness
in reversing declining revenues and new services from groupware and
companies like Dow Jones. 2000-2003 will be the reengineering revolution,
when online becomes synonymous with everyday business functions.
Search-and-retrieval will be ubiquitous functions.

Time intervals are becoming shorter between each "phase" of online search
and retrieval.

Content: expanding in concentric rings

Search 2000's sputtering engines

Let's look at what the Web search tools do well before trading the old
buggies in for a new model:

(1) Search engines are largely free. In the early days of online, nothing
was free. One paid dearly for access to indexes. Free search services are a
net benefit.

(2) The major search engines cover most of the major sites in the USA,
Canada, and Europe. Secondary or non-English search engines provide useful
windows on other countries.

(3) A metasearch provides a very valuable way to extract the unique "hits"
offered by individual engines. Hence, searching multiple engines is a good
strategy.

(4) The major search engines are constantly improving their search and
retrieval services. Some like Microsoft do this by blending multiple
technologies; others like Infoseek do it by combining spiders and human
indexes where it makes sense to offer the user a particular type of tool.
When these very expensive to operate engines are running they are not
perfect. Most search engines are similar to 12-cylinder powerplants in
Formula One cars than to the reliable four bangers in a Toyota or Ford
econobox.


The adjustments that are necessary include:

(1) Better handling of changed pages for sites that are among the most
visited by users of a particular directory. This discipline is called
"change monitoring" in the jargon of the spider wizards. Indexing changes
are increasingly important, particularly as sites allow users to create and
post content.

(2) More effective indexing of large sites. With spiders set to follow links
to a specified "depth", some large sites are not indexed thoroughly. As more
free Web page services become available, it is important to capture those
pages. Many of the most interesting sites are buried deep with Xoom or
Tripod. These sites can be reached only by a link on another referring page.

(3) Specialized "clusters" of sites - for instance, Web Ring - should be
indexed in some manner. As communities form and dissolve, their Web sites
and postings often remain. Although some sites are content Light, many are
extremely useful. Comments about companies, products, and events often shed
a different light on these topics.

(4) Handling of non-English Web pages must improve. The major indexing sites
must index the most heavily used or most linked- to sites in other
countries. Handling these sites and addressing the language and display
problems will be difficult. The rapid diffusion of Internet technology
creates an environment where content-rich sites will certainly proliferate.
A person looking for information should be able to tap these generally
unknown and unavailable resources.

(5) Addressing the user's need for "smarter searches". Technologies that
provide insight into user behavior are needed for search and retrieval, not
just electronic commerce. Most online search services are "dumb". The user
enters a string, and the service makes an effort to provide some "hits"
rapidly. Intelligent search-and-retrieval is computationally expensive, but
it is needed. In closing, it is clear that search-and-retrieval in the Web
spaces is an emerging discipline. No one has the answer.

In the next 12 months, firms unable to develop new business models to
generate revenue from online search and retrieval will be under increasing
competitive pressure. The good news is that those using the Internet to look
for useful information will suffer from a surfeit of choice. The bad news is
that no single search-and-retrieval or Web indexing service will provide
comprehensive coverage. The landscape of search-and-retrieval will be
littered with the dead carcasses of business models that learned that
Darwin's concept of "survival" applies to cyberspace's "spiders" able to
adapt and thrive in the ecosystem of the Internet.

Stephen E. Arnold
Guest Editor
Arnold Information Technology,
Kentucky, USA
sa@arnoldit.com


[ Top ]   [ AIT Home ]   [ Site Map ]