Search Is a Threat. You’ve Been Warned!

February 23, 2008

It’s Saturday, February 23, 2008. It’s cold. I’m on my way to the gym to ensure that my youthful figure defies time’s corrosive forces. I look at the headlines in my newsreader, and I am now breaking my vow of “No News. No, Really!”

Thomas Claburn, Information Week journalist, penned a story with the headline, “Google-Powered Hacking Makes Search a Threat.” Read the story for yourself. Do you agree with the premise that information is bad when discoverable via a search engine?

With inputs from Cult of the Dead Cow and a nod to the Department of Homeland Security, the story flings buzzwords about security threats and offers some observations about “defending against search”. The article has a pyramid form, a super headline, quotes (lots of quotes), and some super tech references such as “the Goolag Scan”, among others. This is an outstanding example of technical journalism. I say, “Well done, sir.”

My thoughts are:

The fix for this problem of “bad” information is darn easy. Get one or two people to control information. The wrong sort of information can be blocked or the authors arrested. Plus, if a bad “data apple” slips through the homogenization process, we know with whom to discuss the gaffe.
The payoff of stopping “bad information” is huge. Without information, folks won’t know any thing “bad”, so the truth of “If ignorance is bliss, hello, happy” is realized. Happy folks are more productive. Eliminating bad information boosts the economy.
The organizations and individuals responsible for “threats” can be stopped. Bad guys can’t harm the good guys. Good information, therefore, doesn’t get corroded by the bad information. No bad “digital apples” can spoil the barrel of data.

I’m no Jonathan Swift. I couldn’t edit a single Cervantes’ sentence. I am a lousy cynic. I do, however, have one nano-scale worry about a digital “iron maiden”. As you may know, the iron maiden was a way to punish bad guys. When tricked out with with some inward facing spikes (shown below), the bad buy was impaled. If the bad guy was unlucky, death was slow, agonizing I assume. The iron maiden, I think, was a torture gizmo. Some historical details are murky, but I am not too keen on finding out via a demo in “iron” or in “digital” mode.

I think that trying to figure out what information is “good” and what information is “bad” is reasonably hard to do. Right, now, I prefer systems that don’t try to tackle these particular types of predictive tasks for me. I will take my chances figuring out what’s “good” and what’s “bad”. I’m 64, and so far, so good.

In behind-the-firewall systems, determining what to make available and to whom is an essential exercise. An error can mean a perp walk in an orange suit for the CEO and a pack of vice presidents.

Duplicating this process on the Web is — shall we say — a big job. I’m going to the gym. This news stuff is depressing me.

Stephen Arnold, February 23, 2008

Written by Stephen E. Arnold · Filed Under Online (general), Search | Comments Off on Search Is a Threat. You’ve Been Warned!

Context: Popular Term, Difficult Technical Challenge

February 13, 2008

In April 2008, I’m giving a talk at Information Today’s Buying & Selling Econtent conference.

When I am designated as a keynote speaker, I want to be thought provoking and well prepared. So I try to start thinking about the topic a month or more before the event. As I was ruminating about my topic, I was popping in and out of email. I was doing, what some students of human behavior might call, context shifting.

The idea is that I was doing one thing (thinking about a speech) and then turning my attention to email or a telephone call. When I worked at Booz, Allen, my boss described this behavior as multi-tasking, but I don’t think what I was doing was doing two or three things at once. He was, like Einstein, not really human. I’m just a guy from a small town in Illinois, trying to do one thing and not screwing it up. So I was doing one thing at a time, just jumping from one work context to another. Normal behavior for me, but I know from observation my 86-year-old father doesn’t handle this type of function as easily as I do. I also know that my son is more adept at context shifting than I am. Obviously it’s a skill that can deteriorate as one’s mental acuity declines.

What struck me this morning was that in the space of a half hour, one email, one telephone call, and one face-to-face meeting each used the word “context”. Perhaps the Nokia announcement and its use of the word context allowed me to group these different events. I think that may be a type of meta tagging, but more about that notion in a moment.

Context seemed to be a high-frequency term in the last 24 hours. I don’t meed a Markov procedure to flag the term. The Google Trends’ report seems to suggest that context has been in a slow decline since the fourth quarter of 2004. Maybe so, but “context” was le mot de jour for me.

What’s Context in Search?

In my insular world, most of the buzzwords I hear pertain to search and retrieval, text processing, and online. After thinking about the word context, I jotted down the different meanings of the word context had in each of the communications I noticed.

The first use of context referenced the term as I defined it in my 2007 contributions to Bear Stearns’ analyst note, “Google and the Semantic Web.” I can’t provide a link to this document. You will have to chase down your local Bear Stearns’ broker to get a copy. This report describes the inventions of Ramanathan Guha. The PSE or Programmable Search engine discerns and captures context for a user’s query, the information satisfying that query, and other data that provide clues to interpret a particular situation.

The second use of context was a synonym for personalization. The idea was that a user profile would provide useful information about the meaning of a query. The idea is that a user looks for consumer information about gasoline mileage. When the system “knows” this fact, a subsequent query for “green fuel” is processed in the context of an automobile. In this case, “green” means environmentally friendly. Context makes it possible to predict a user’s likely context based on search history and implicit or explicit personalization.

The third use of context came up in a discussion about key word search. My colleague made the point that most search engines are “pretty dumb.” “The key words entered in a search box have no context,” he opined. The search engine, therefore, has to deliver the most likely match based on whatever data are available to the query processor. A Web search engine gives you a popular result for many queries. Type Spears into Google and you get pop star hits and few manufacturing and weapon hits.

When a search engine “knows” something about a user — for example, search history, factual information provided when the user registered for a free service, or the implicit or explicit information a search system gathers from users — search results can be made more on point. The idea is that the relevance of the hits matches the user’s needs. The more the system knows about a user and his context, the more relevant the results can be.

Sometimes the word context, when used in reference to search and retrieval, means “popping up a level” in order to understand the bigger picture for the user. Context, therefore, makes it possible to “know” that a user is moving toward the airport (geo spatial input), has a history of looking at flight departure information (user search history), and making numerous data entry errors (implicit monitoring of user misspellings or query restarts). These items of information can be used to shape a results set. In a more extreme application, these context data can be used to launch a query and “push” the information to the user’s mobile device. This is the “search without search” function I discussed in my May 2007 iBreakfast briefing, which — alas! — is not available online at this time.

Is Context Functionality Ubiquitous Today?

Yes, there are many online services that make use of context functions, systems, and methods today.

Even though context systems and methods add extra computational cycles, many companies are knee deep in context and its use. I think the low profile of context functions may be, in part, due to privacy issues becoming the target of a media blitz. In my experience, most users accept implicit monitoring if the user has a perception that their identity is neither tracked nor used. The more fuzzification — that is, statistical blurring — of a single user’s identity, the less the user’s anxiety about implicit tracking in order to use context data as a way to make results more relevant. Other vendors have not figured out how to add additional computational loads to their systems without introducing unacceptable latency, and these vendors offer dribs and drabs of context functionality. As their infrastructure becomes more robust, look for more context services.

The company making good use of personalization-centric context is Yahoo. Its personalized MyYahoo service delivers news and information selected by the user. Yahoo’s forthcoming OneConnect, announced this week at the telco conference in Barcelona, Spain. Based on the news reports I have seen, Yahoo wants to extend its personalization services to mobile devices.

Although Yahoo doesn’t talk about context, a user who logs in with a Yahoo ID will be “known” to some degree by Yahoo. The user’s mobile experience, therefore, has more context than a user not “known” to Yahoo. Yahoo’s OneConnect is a single example of context that helps an online service customize information services. Viewed from a privacy advocate’s point of view, this type of context is an intrusion, perhaps unwelcome. However, from the vantage point of a mobile device user rushing to the airport, Yahoo’s ability to “know” more about the user’s context can allow more customized information displays. Flight departure information, parking lot availability, or weather information can be “pushed” to the Yahoo user’s mobile device without the user having to push buttons or make finger gestures.

Context, when used in conjunction with search, refers to additional information about [a] a particular user or group of users identified as belonging to a cluster of users, [b] information and data in the system, [c] data about system processes, and [d] or information available to Yahoo though not residing on its servers.

Yahoo and T-Mobile are not alone in their interest in this type of context sensitive search. Geo spatial functions are potential enablers of news services and targeted advertising revenue. Google and Nokia seem to be moving on a similar vector. Microsoft has a keen awareness of context and its usefulness in search, personalization, and advertising.

Context has become a key part of reducing what I call the “shackles of the search box.” Thumb typing is okay but it’s much more useful to have a device that anticipates, personalizes, and contextualizes information and services. If I’m on my way to the airport, the mobile device should be able to “know” what I will need. I know that I am a creature of habit as you probably are with regard to certain behaviors.

Context allows disambiguation. Disambiguation means figuring out which of two or more possibilities is the “right” one. A good example is comes up dozens of times a day. You are in line to buy a bagel. The clerk asks you, “What kind of bagel?” with a very heavy accent, speaking rapidly and softly. You know you want a plain bagel. Without hesitation, you are able to disambiguate what the clear uttered and reply, “Plain, please.”

Humans disambiguate in most social settings, when reading, when watching the boob tube, or just figuring out weird road signs glimpsed at 60 miles per hour. Software doesn’t have the wetware humans have. Disambiguation in search and retrieval systems is a much more complex problem than looking up string matches in an index.

Context is one of the keys to figuring out what a person means or wants. If you know a certain person looks at news about Kolmogorov axioms, next-generation search systems should know that if the user types “Plank”, that user wants information about Max Planck, even though the intrepid user mistyped the name. Google seems to be pushing forward to use this type of context information to minimize the thumb typing that plagues many mobile device users today.

These types of context awareness seem within reach. Though complex, many companies have technologies, systems, and methods to deliver what I call basic context metadata. Let me note that context aware services are in wide use, but rarely labeled as “context” functions. The problem with naming is endemic in search, but you can explore some of these services at there sites. You may have to register and provide some information to take advantage of the features:

Google ig (Individualized Google) — Personalized start page, automatic identification of possibly relevant information based on your search history, and tools for you to customize the information
Yahoo MyYahoo — content customization, email previews, and likely integration with the forthcoming OneConnect service
MyWay — IAC’s personalized start page. One can argue that IAC’s implementation is easier to use than Yahoo’s and more graphically adept than Google’s ig service.

If you are younger than I or young at heart, you will be familiar with the legions of Web 2.0 personalization services. These range from RSS (really simple syndication) feeds that you set up to NetVibes, among hundreds of other mashy, nifty, sticky services. You can explore the most interesting of these services at Tech Crunch. It’s useful to click through the Tech Crunch Top 40 here. I have set up a custom profile on Daily Rotation, a very useful service for people in the information technology market.

An Even Tougher Context Challenge

As interesting and useful as voice disambiguation and automatic adjustment of search results are, I think there is a more significant context issue. At this time, only a handful of researchers are working on this problem. It probably won’t surprise you that my research has identified Google as the leader in what I call “meta-context systems and methods.”

The term meta refers to “information about” a person, process, datum, or other information. The term has drifted a long way from its Latin meaning of a turn in a hippodrome; for example, meta prima was the first turn. Mathematicians and scientists use the term to mean related to or based upon. When a vendor talks about indexing, the term metadata is used to mean those tags or terms assigned to an information object by an automated indexing system or a human subject matter expert who assigns index terms.

The term is also stretched to reference higher levels in nested sets. So, when an index term applies to other index terms, that broader index term performs a meta-index function. For example, if you have an index of documents on your hard drive, you can index groups of documents about a new proposal as “USDA Proposal.” The term does not appear in any of the documents on your hard drive. You have created a meta-index term to refer to a grouping of information. You can create meta-indexes automatically. Most people don’t apply a term to creating a folder name or new directory. Software that performs automatic indexing can assign these meta-index terms. Automatic classification systems can perform this function. I discuss the different approaches in Beyond Search, and I won’t rehash that information in this essay.

The “real context challenge” then is to create a meta context for available context data. Recognize that context data is itself a higher level of abstraction than a key word index. So we are now talking about taking multiple contexts, probably from multiple systems, and creating a way to use these abstractions in an informed way.

You, like me, get a headache when thinking about these Russian doll structures. Matryoshka (Ð¼Ð°Ñ‚Ñ€Ñ‘ÑˆÐºÐ°)mare made of wood or plastic. When you open one doll, you see another inside. You open each doll and find increasingly small dolls inside the largest doll. The Russian doll metaphor is a useful one. Each meta-context refers to the larger doll containing smaller dolls. The type of meta context challenge I perceive is finding a way to deal with multiple matryoshkas, each containing smaller dolls. What we need, then, is a digital basket into which we can put our matryoshka. Single item of context data is useful, but having access to multiple items and multiple context containers opens up some interesting possibilities.

In Beyond Search, I describe one interesting initiative at Google. In 2006, Google acquired a small company that specialized in systems and methods for manipulating these types of information context abstractions. There is interesting research into this meta context challenge underway at the University of Wisconsin — Madison as well as at other universities in the U.S. and elsewhere.

Progress in context is taking place at two levels. At the lowest level, commercial services are starting to implement context functions into their products and services. Mobile telephony is one obvious application, and I think the musical chairs underway with Google, Yahoo, and their respective mobile partners is an indication that jockeying is underway. Also at this lowest level are the Web 2.0 and various personalization services that are widely available on Web sites or in commercial software bundles. In the middle, there is not much high-profile activity, but that will change as entrepreneurs sniff the big pay offs in context tools, applications, and services. The most intense activity is taking place out of sight of most journalists and analysts. Google, one of the leaders in this technology space, provides almost zero information about its activities. Even researchers at major universities have a low profile.

That’s going to change. Context systems and methods may open new types of information utility. In my April 2008 talk, I will provide more information about context and its potential for igniting new products, services, features, and functions for information-centric activities.

Stephen Arnold, February 13, 2008

Written by Stephen E. Arnold · Filed Under Database, Enterprise, Online (general), Search | Comments Off on Context: Popular Term, Difficult Technical Challenge

Trapped by a Business Model, Not Technology

February 12, 2008

The headline “Reuters CEO sees Semantic Web in its Future” triggered an immediate mouse click. The story appear on O’Reilly’s highly regarded Radar Web log.

Tim O’Reilly, who wrote the article, noted: “Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity.”

Mr. O’Reilly noted that: “I don’t think he [Devin Wenig, a Reuters executive] should discount the statistical, computer-aided curation that has proven so powerful on the consumer Internet.”

Hassles I’ve Encountered

Reuters comment about the Semantic Web did underscore the often poor indexing done by publishing and broadcasting companies. In my experience, I have had to pay for content that was in need of considerable post-processing and massaging.

For example, if you license a news feed from one of the commercial vendors, some of the feeds will:

Send multiple versions of the stories “down the wire”, often with tags that make it difficult to determine which is more accurate version. Scripts can delete previous versions, but errors can occur, and when noticed, some have to be corrected by manual inspection of the feed data.

Deliver duplicate versions of the same story because the news feed aggregator does not de-duplicate variants of the story from different sources. Some systems handle de-duplication gracefully and efficiently. Examples that come to mind are Google News and Vivisimo. Yahoo’s approach with tabs to different news services is workable as well, but it is not “news at a glance”. Yahoo imposes additional clicking on me.

Insert NewsXML plus additional tags without alerting downstream subscribers. When this happens, the scripts can crash or skip certain content. The news feed services try to notify subscribers about changes, but in my experience there are many “slips betwixt cup and lip.”

Now the traditional powerhouses in the news business face formidable competition on multiple fronts. There are Web logs. There are government “news” services, including the remarkably productive US Department of State, largely unknown Federal News Service , and the often useful Government Printing Office listserv. There are news services operated by trade associations. These range from the American Dental Association to
the Welding Technology Institute of Australia. Most of these organizations are now Internet savvy. Many use Web logs, ping servers, and RSS (really simple syndication) to get information to constituents, users, and news robots. Podcasts are just another medium for grass roots publishers to use at low or without cost.

We are awash in news — text, audio, and video.

Balancing Three Balls

Traditional publishers and broadcasters, therefore, are trying to accomplish three goals at the same time. I recall from a lecture that the legendary president of General Motors, Alfred P. Sloan (1875 – 1966) is alleged to have said: “Two objectives is no objective.” Nevertheless, publishers like Reuters and its soon-to-be owner are trying to balance three balls on top of one another:

First, maintain existing revenues in the face of the competition from governments, associations, individual Web log operators, and ad-supported or free Internet services.

Second, create new products and services that generate new revenue. The new revenue must not cannibalize any traditional revenue.

Third, give the impression of being “with it” and on the cutting edge of innovation. This is more difficult than it seems, and it leads to some executives’ talking about an innovation that is no longer news. Could I interpret the Reuters’ comment as an example of faux hipness?

Publishers can indeed leverage the Semantic Web. There’s a published standard. Commerical systems are widely available to perform content transformation and metatagging; for example, in Beyond Search I profile two dozen companies offering different bundles of the needed technology. Some of these are known (IBM, Microsoft); others are less well known (Bitext, Thetus). And as pre-historic as it may seem to some publishing and broadcast executives, even skilled humans are available to perform some tasks. As good as today’s semantic systems are, humans are sometimes need to do the knowledge work required to make content more easily sliced and diced, post-processed and “understood”.

It’s Not a Technology Problem

The fact is that traditional publishers and broadcasters have been slow to grasp that their challenge is their business model, not technology. No publisher has to be “with it” or be able to exchange tech-geek chatter with a Google, Microsoft, or Yahoo wizard.

Nope.

What’s needed is a hard look at the business models in use at most of the traditional publishing companies, including Reuters and the other companies who have their reports in professional publishing, trade publishing, newspaper publishing, and magazine publishing. While I’m making a list I want to include radio, television, and cable broadcasting companies as well.

These organizations have engineers who know what the emerging technologies are. There may be some experiments that are underway and yielding useful insights into how traditional publishing companies can generate new revenues.

The problem is that the old business models generate predictable revenue. Even if that revenue is softening or declining, most publishing executives understand the physics of their traditional business model. Newspapers sell advertising. Advertisers pay to reach the readers. Readers pay a subscription to get the newspaper with the ads and a “news hole”. Magazine publishers either rely on controlled circulation to sell ads or a variant of the newspaper model. Radio and other broadcast outlets sell outlets to advertisers.

These business models are deeply ingrained, have many bells and whistles, and deliver revenue reasonably well in today’s market. The problem is that the revenue efficiency in many publishing sectors is softening.

Now the publishers want to generate new revenues while preserving their traditional business models, and the executives don’t want to cannibalize existing revenues. Predictably, the cycle repeats itself. How hard is it to break the business model handcuffs of traditional publishing. Rupert Murdock has pulled in his horns at the Wall Street Journal. Not even he can get free of the business model shackles that are confining once powerful organizations and making them sitting ducks for competitive predators.

Semantic Web — okay. I agree it’s hot. I am just finishing a 250-page look at some of the companies doing semantics now. A handful of these companies are almost a decade old. Some, like IBM, were around when Albert Einstein was wandering around Princeton in his house slippers.

I hope Reuters “goes semantic”. With the core business embedded in numeric data, I think the “semantic” push will be more useful when Reuters’ customers have the systems and methods in place to make use of richer metatagging. The Thomson Corporation has been working for a decade or more to make its content “smarter”; that is, better indexing, automated repurposing of content, and making it possible for a person in one of Thomson’s more than 100 units to find out what another person in another unit wrote about the same topic. Other publishers are genuinely confused and unstandably uncertain about the Internet as an application platform. Buggy whip manufacturers could not make the shift to automotive seat covers more than a 100 years ago. Publishers and broadcasters face the same challenge.

Semantic technology may well be more useful inside a major publishing or broadcasting company initially. In my experience, most of these operations have data in different formats, systems, and data models. It will be tough to go “semantic” until the existing data can be normalized and then refreshed in near real time. Long updates are not acceptable in the news business. Wait too long, and you end up with a historical archive.

Wrap Up

To conclude, I think that new services such as The Issue, the integration of local results into Google News, and wide range of tools that allow anyone to create a personalized news feed are going to make life very, very difficult for traditional publishers. Furthermore, most traditional publishing and broadcast companies have yet to understand the differences between TV and cable programming and what I call “YouTube” programming.

Until publishing finds a way to get free of its business model “prison”, technology — trendy or not — will not be able to work revenue miracles.

Update February 13, 2008, 8 34 am Eastern — Useful case example about traditional publishing and new media. The key point is that the local newspaper is watching the upstart without knowing how to respond. Source: Howard Downs.

Stephen Arnold, February 12, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Online (general) | 4 Comments

Is the Death Knell for SEO Going to Sound?

February 9, 2008

Not long ago, a small company wondered why its Web site was the Avis to its competitor’s Hertz. The company’s president checked Google each day, running a query to find out if the rankings had changed.

I had an opportunity to talk with several of the people at this small company. The firm’s sales did not come from the Web site. Referrals had become the most important source of new business. The Web site was — in a sense — ego-ware.

I shared some basic information about Google’s Web master guidelines, a site map, and error-free code. These suggestions were met with what I would describe as “grim acceptance.” The mechanics of getting a Web site squared away was work but not unwelcome. Mycomments articulated what the Web team already knew.

The second part of the meeting focused on the “real” subject. The Web team wanted the Web site to be number one. I thanked the Web team and said, “I will send you the names of some experts who can assist you.” SEO work is not my cup of tea.
Then, yesterday, as Yogi Berra allegedly said, “It was dÃ©jÃ vu all over again.” Another local company found my name and arranged a meeting. Same script, different actors.

“We need to improve our Google ranking,” the Web master said. I probed and learned that the company’s business came within a 25 mile radius of the company’s office. Google and other search engines listed the firm’s Web site deep in the results lists.

I replayed the MP3 in my head about clean code, sitemaps, etc. I politely told the local Web team that I would email them the names of some SEO experts. SEO is definitely an issue. Is the worsening economy the reason?
Here’s a summary of my thinking about these two opportunities for me to bill some time, make some money:

Firms want to be number one of Google and somehow have concluded that SEO tactics can do the trick.
There is little resistance to mechanical fixes, but there is little enthusiasm for adding substantive content to a Web site
In the last year, interest in getting a Web site to the top of Live.com or Yahoo.com has declined, based on my observations.

Content, the backbone of a Web site, is important to site visitors. When I do a Web search, I want links to sites that have information germane to my query. Term stuffing, ripped off content, and other “tricks” don”t endear certain sites to me.

I went in search of sources and inspiration for ranking short cuts. Let me share with you some of my more interesting findings:

Getting Number One on Google with Frames! Note: Google uses frames. If you use frames, you can slither down the Google relevance pole, not up.
Increase Your Google hits. Get Top Rankings. Can You Become Number One? In theory, any Web site can be number one, but I’m not sure a “search phrase” will do the job.
How To Get To Number 1 In Google – SEO. Yes, it’s just like making a taco, the most popular how to topic among 9th graders.

You get the idea. There are some amazing assertions about getting a particular Web site to the top of the Google results list. Several observations may not be warranted, but here goes:

First, writing, even planning, high-impact, useful content is difficult. I’m not sure if it is a desire for a short cut, a lack of confidence, laziness, or inadequate training. There’s a content block in some organizations, so SEO is the way to solve the problem.

Second, Web sites can fulfill any need its owner may have. The problem is that certain types of businesses will have a heck of a time appearing at the top of a results list for a general topic. Successful, confident people expect a Web indexing system to fall prey to their charms as their clients do. Chasing a “number one on Google” can be expensive and a waste of time. There are many “experts” eager to help make a Web site number one. But I don’t think the results will be worth the cost.

Third, there are several stress points in Web indexing. The emergence of dynamic sites that basic crawlers cannot index is a growing trend. Some organizations may not be aware that their content management system (CMS) generates pages that are difficult, if not impossible, for a Web spider to copy and crunch Google’s programmable search engine is one response, and it has the potential to alter the relevance landscape if Google deploys the technology. The gold mine that SEO mavens have discovered guarantees that baloney sites will continue to plague me. Ads are sufficiently annoying. Now more and more sites in my results list are essentially valueless in terms of substantive content.

The editorial policy for most of the Web sites I visit is non-existent. The Web master wants a high ranking. The staff is eager to do mechanical fixes. Recycling content is easier than creating solid information.

The quick road to a high ranking falls off a cliff when a search system begins to slice and dice content, assigns “quality” scores to the informaton, and builds high-impact content pages. Doubt me. Take a look at this Google patent application, US20070198481 and let me know what you think.

Stephen Arnold, February 9, 2008

Written by Stephen E. Arnold · Filed Under Online (general) | Comments Off on Is the Death Knell for SEO Going to Sound?

Taxonomy: Search’s Hula-Hoop®

February 8, 2008

I received several thoughtful comments on my Beyond Search Web log from well-known search and content processing experts (not the search engine optimization type or the MBA analyst species). These comments addressed the topic of taxonomies. One senior manager at a leading search and content processing firm referenced David Weinberger’s quite good book, Everything is Miscellaneous. My copy has gone missing, so join me in ordering a new one from Amazon. Taxonomy and taxonomies have attained fad status in behind-the-firewall search and content processing. Every vendor has to support taxonomies. Every licensee wants to “have” a taxonomy.

This is a screen shot of the Oracle Pressroom. Notice that a “taxonomy” is used to present information by category. The center panel presents hot links by topics with the number of documents shown for each category. The outside column features a tag cloud.

A “taxonomy” is a classification of things. Let me narrow my focus to behind-the firewall content processing. In an organization, a taxonomy provides a conceptual framework that can be used to the organization’s information. Synonyms for taxonomy include classification, categorization, ontology, typing, and grouping. Each of these terms can be used with broader or narrower meanings, but for my purpose, we will assume each can be used interchangeably. Most vendors and consultants toss these terms around as interchangeable Lego blocks in my experience.

A fad, as you know, is an interest that is followed for some period of time with intense enthusiasm. Think Elvis, bell bottoms, and speaking Starbuck’s coffee language.

A Small Acorn

A few years ago, a consultant approached me to write about indexing content inside an organization. This individual had embarked on a consulting career and needed information for her Web site. I dipped into my files, collected some useful information about the challenges corporate jargon presented, and added some definitions of search-related terms.

I did work for hire, so my client could reuse the information to suit specific needs. Imagine my pleasant surprise when I found my information recycled multiple times and used to justify a custom taxonomy for an enterprise. I was pleased to have become a catalyst for a boom in taxonomy seminars, newsgroups, and consulting businesses. One remarkable irony was that a person who had recycled the information I sold to consultant A thousands of miles away turned up as consultant B at a company in which I was an investor. I sat in a meeting and heard my own information delivered back to me as a way to orient me about classifying an organization’s information.

Big Oak

A taxonomy revolution had taken place, and I was only partially aware. A new industry had taken root, flowered, and spread like kudzu around me.

The interest in taxonomies continues to grow. After completing the descriptions of companies offering what I call rich content processing, organizations looking for taxonomy-centric systems have many choices. Of the 24 companies profiled in the Beyond Search study, all 24 “do” taxonomies. Obviously there are greater and lesser degrees of stringency. One company has a system that supports American National Standards Institute guidelines for controlled terms and taxonomies. Other companies “discover” categories on the fly. Between these two extremes there are numerous variations. One conclusion I drew after this exhausting analysis is that it is difficult to locate a system that can’t “do” taxonomies.

What’s Behind the Fad?

Let me consider briefly a question that I don’t tackle in Beyond Search: “Why the white-hot interest in taxonomies?”

Taxonomies have a long and distinguished history in library science, philosophy, and epistemology. For those of you who are a bit rusty, “epistemology” is the theory of knowledge. Taxonomies require a grasp, no matter how weak, on knowledge. No matter how clever, a person creating a taxonomy must figure out how to organize email, proposals, legal documents, and the other effluvia of organizational existence.

I think people have enough experience with key word search to realize its strengths and limitations. Key words — either controlled terms or free text — work wonderfully when I know what’s in an electronic collection, and I know the jargon or “secret words” to use to get the information I need.

Boolean logic (implicit or explicit) is not too useful when one is trying to find information in a typical corpus today. There’s no editorial policy at work. Anything the indexing subsystem is fed is tossed into an inverted index. This is the “miscellaneous” in David Weinberger’s book.

A taxonomy becomes a way to index content so the user can look at a series of headings and subheadings. A series of headings and sub-headings makes it possible to see the forest, not the trees. Clever systems can take the category tags and marry them to a graphical interface. With hyperlinks, it is possible to follow one’s nose — what some vendors call exploratory search or search by serendipity.

Taxonomy Benefits

A taxonomy, when properly implemented, offers yields payoffs:

First, users like to point-and-click to discover information without having to craft a query. Believe me, most busy people in an organization don’t like trying to outfox the search box.

Second, the categories — even when hidden behind a naked search box interface — are intuitively obvious to a user. An accountant may (as I have seen) enter the term finance and then point-and-click through results. When I ask users if they know specific taxonomy terms, I hear, “What’s a taxonomy?” Intuitive search techniques should be a part of behind-the-firewall search and content processing systems.

Third, management is willing to invest in fine-tuning a taxonomy. Unlike a controlled vocabulary, a suggestion to add categories meets with surprisingly little resistance. I think the intuitive usefulness of cataloging and categorizing is obvious to people who tell people to search for them.

Some Pitfalls

There are some pitfalls in the taxonomy game: The standard warnings are “Don’t expect miracles when you categorize modest volumes of content.” And “Be prepared for some meetings that are more like a graduate class in logic than trying to figure out how to deliver what the marketing department needs in a search system. ” Etc.

On the whole, the investment in a system that automatically indexes is a wise one. It becomes ever wiser when the system can use a knowledge bases, word lists, taxonomies, and other information inputs to index more accurately.

Keep in mind that “smart” systems can be right most of the time and then without warning run into a ditch. At some point, you will have to hunker down and do the hard thinking that a useful taxonomy requires. If you are not sure how to proceed, try to get your hands on a the taxonomies that once were available from Convera. Oracle one once? offered vertical term lists. You can also Google for taxonomies. A little work will return some useful examples.

To wrap up, I am delighted that so many individuals and organizations have an interest in taxonomies — whether a fad or something more epistemologically more satisfying. The content processing industry is maturing. If you want to see a taxonomy in action, check out:

HMV, powered by Dieselpoint

Oracle’s Pressrom, powered by Siderean Software’s system

US government portal powered by Vivisimo (Microsoft)

Stephen Arnold, February 8, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Online (general), Search | Comments Off on Taxonomy: Search’s Hula-Hoop®

Simple Math = Big Challenge: MSFT & YHOO

February 4, 2008

I have only a few sections of Beyond Search to wrap up. Instead of being able to think about my updating my description of Access Innovations’ MAIstro, I am distracted by jibber jabber about the Microsoft (NSDQ:MSFT) Yahoo (NSDQ:YHOO) tie up.

Where We Are

First, it’s an offer, isn’t it? Maybe a trial balloon? No cash and stock have changed hands as I write this in the wee hours of Monday, February 4, 2008. Yet, many are in a frenzy over a hostile take over. Think about this word “hostile.” It means antagonistic, unfriendly, enemy. The reason for the bold move? Google, a company that has out foxed Microserfs and Yahooligans for almost a decade.

The number of articles in my various alerts, RSS feeds, and emails is remarkable. Worldwide a Microsoft – Yahoo marriage (even it is helped along with a shotgun) ignites folks’ imagination. Neither Microsoft nor Yahoo will be able to recruit tech wizards, one pundit asserts. Innovation in Silicon Valley will be forever changed, posits another. Sigh.

Sorry. I’m not that excited. I’m interested, but I’m too old, too pragmatic, and too familiar with the vagaries of acquisitions to jump up and down.

Judging from some grousing from Yahooligans, some Yahoo professionals aren’t too keen about working for Microsoft. I have had a hint that some Microsoft wizards aren’t too excited about fiddling with Yahoo’s mind-numbing array of products, services, technologies, search systems, partnerships, and research initiatives.

I think the root concern is trying to figure out how to fit two large operations together, a 1 + 1 = 3 problem. For example, there’s Yahoo Mail and Hotmail Live; Yahoo Panama and Microsoft Ad Center; and Yahoo News and Microsoft’s new services, etc., etc. One little-considered consequence is that Microsoft may end up owning more search systems than any other company. That’s a technology can of worms worthy of a separate essay.

I will tell you who is excited, and, please, keep in mind that this is my opinion. And, once I express my view, I want to offer another very simple (probably too simple for an MBA wizard) math problem. I will end this essay with my now familiar observations. Let’s begin.

Who Benefits?

This is an easy question to answer, and you will probably think that I am stating the obvious. Bear with me because the answer explains why some at Microsoft may not be able to get the right prescription for their deal bifocals. Without the right eye glasses, it’s tough to discern some smaller environmental factors obscured in the billion dollar fusillade fired at Yahoo’s board of directors’ meeting.

Shareholders who can make some money with the Microsoft offer. When there’s money to be made, concerns about technology, culture, and market opportunity are going to finish last. Most shareholders don’t think too much other than the answer to two questions: “How much did I make?” and “What are the tax implications?”
Investment bankers who earn money three ways on a deal of this magnitude. There are, of course, other ways for those in the financial loop to make money, but I’m going to focus on the ones that keep these professionals in blue suits, not orange jump suits. [a] Commissions. Where the is churn, there is a commission. For many investment advisors, buying and selling equals a bigger payday. [b] Bonuses. The mechanics of an investment banker’s bonus are complex. After all, it is a banker dealing with a fellow banker. Mere mortals should steer clear. The idea is simple. Generate churn or a fee, and you get more bonus money. The first three months of a calendar year is bonus and job hopping time on Wall Street. Anyone who can get a piece of the action for a big deal gets cash. [c] Involvement in a big deal acts like a huge electro magnet for more deals. Once Microsoft “thought” of the acquisition, significant positive input about the upside of the deal pours into the potential acquirer.
Consultants. Once a big deal is announced, the consultants [delete apostrophe here] leap into action. The buyer needs analyses, advice, and strategic counsel. The buyer’s minions need tactical advice to answer such questions as “How can we maximize our tax benefits?” and “How can we pay for this with cheap money?” The buyer becomes hungry for advisors of every species. Blue-chip outfits like Bain, Booz, Allen & Hamilton, Boston Consulting Group, and McKinsey & Co. drool in eagerness to provide guidance on lofty strategy matters such as answering the question, “How can I maximize my pay-out?” And “What are the tax consequences of my windfall profit?” Tactical advisors from these firms can provide support on human resource issues and real estate leases, among other matters. In short, buyers throw money at “the problem” in order to be prepared to negotiate or find a better deal.

These three constituencies want the deal to go through. If Microsoft is the buyer, that’s fine. If another outfit with cash shows, that’s okay too. The deal now has a life of its own. Money talks. To get the money, these constituencies have no desire to help Microsoft “see” some of the gaps and canyons that must be traversed. Let’s turn to one practical matter and the aforementioned simple math. Testosterone and money — these are two ways to cloud perception and jazz logic.

More Simple Math

Let’s do a thought experiment, what some German philosophers call Gedankenexperiment. I am not talking about the proposed Microsoft – Yahoo deal, gentle attorneys.

Accordingly, We have two companies, Company Alpha and Company Beta; hereinafter, Company A(lpha) and Company B(eta), neither of which is a real company and should not be construed as having any similarity with any company now in existence.

Company Alpha has a dominant position in a market and wants to gain a larger share of a newer, tangential market. Company A has a proven, well-tuned, aging business model. That business model is a variation on selling subscriptions and generating annuity income from renewals. Company A’s business model works this way. Company A offers a product and then, on a periodic basis, Company A makes a change to an existing product, assessing a fee for customers to get the “new” or “enhanced” version of the product (service).

The idea is that once a subscription base is in place, Company A can predict a certain amount of revenue from standing orders and new orders. Company A has an excellent, stable, cash flow based on this well-crafted business model and periodic fee increases. Although there are environmental factors that put pressure on the proven business model, the customer base is large, and the business model continues to work in Company A’s traditional markets. Company A, aware of exogenous factors — for instance, the emergence of cloud computing and other non-subscription business models — has learned through trial and error that its subscription-based business model does not work in certain new markets. These new markets are potentially lucrative, representing “new” revenue and a threat to Company’s existing revenue stream. Company A wants to acquire a company to increase its chances for success in the new and emerging markets. Company A’s goal is to [a] protect its existing revenue, [b] generate new revenue, and [c] prevent other companies from dominating the new market(s).

Company A has performed a rational, market analysis. Company A’s management has determined that one company only — our Company B — represents a mechanism for achieving Company A’s goals. Company A, by definition, has performed its analyses through Company A’s “eye glasses”; that is, Company A’s proven business model and business culture. “Walking in another person’s moccasins” is easy to say and difficult, if not impossible, to do. Everyone views the world through his own experiential frame. Hence, Company A “sees” Company B as having characteristics, attributes, and capabilities that are, despite some acceptable risks, significant benefits to Company A. Having made this decision about the upside from buying Company B, the management of Company A becomes less able to accept alternative inputs, facts, information, perceptions, and opinions. Company A’s reasoning in its decision space is closed. Company A vivifies what William James called “a certain blindness.” The idea is that each person is “blind” in some way to reality that others can perceive.

The implications of “a certain blindness” in this hypothetical acquisition warrant further discussion:

Culture

Company A has a culture built around a business model that allows incremental product enhancements so that subscription revenue is generated. Company B has a business model built around acquisitions. Company A has a more or less homogeneous atmosphere engendered by the business model or what Company A calls the agenda. Company B is more like a loose federation of separate companies — what some MBAs might call a Ling Temco Vought framework. Each entity within Company B retains its own identity, enjoys wide scope of action, and preserves its own culture. “We do our own thing” characterizes these units of Company B. Company A, therefore, has several options to consider:

Company A can leave Company B as it is. The plus is that not much will change Company B’s operations in the short term. The downside is that the technical problems will not be resolved.
Company A can impose its culture on Company B. You don’t need me to tell you that this will go over like the former Soviet Union’s intervention in Poland in the late 1950s.
Company A can try to make changes gradually. (This is a variation of the option in bullet 2 and will simply postpone rebellion. )

Technology

Company A has a different and relatively homogeneous technology base. Company B has a heterogeneous technology base. Maintaining multiple systems is more costly in general than homogeneous systems. Upon inspection, the technical staff needed to maintain these different systems have specialized to deal with particular technical problems in the heterogeneous environment. Technical people can learn new skills, but this takes time and adds cost. Company A has to find a way to streamline technical operations, reduce costs, and not waste time achieving rationalization. There are at least two ways to do this:

Shift to a single platform, ideally Company A’s
Retrain existing staff to have broader technical skills. With Company B’s staff able to perform more generalized work, Company A can reduce headcount at Company B, thus streamlining work processes and reducing cost.

Competitive Arena

The desirable new market for Company A has taking on the characteristics of what I call a “natural monopoly.” When I reflect on notable events in American business history, I note monopolistic behavior. Some monopolies were spawned by force of will; for example, JP Morgan and finance (this guy bailed out the US Treasury) and Andrew Carnegie and steel (this fellow thought of libraries for little people after pistol-whipping his competitors and antagonists).

Other monopolies — like Bell Telephone and your local electric company — came into being because some functions are more appropriately delivered by one organization. Water and Internet search / advertising, for instance, are subject to such economies of scale, quality of service, and standardization. In short, these may be “natural monopolies” due to numerous demand and cost force.

In our hypothetical example, Company A wants to enter a market which is coalescing and beginning now, based on my research, appears to be forming into a “natural monopoly”. This nameless competitor seems to be following a trajectory similar to that of the original Bell Telephone – AT&T life cycle.

Company A’s race, then, is against time and money. Untoward delay at any point going forward with regard to leveraging Company B means coming in second, maybe a distant second or losing out on the new market.

Instead of owning Park Place (a desirable property in the Parker Brothers’ game Monopoly), Company A ends up with Baltic and Mediterranean Avenues (really lousy properties in the Parker Brothers’ game). If Company A doesn’t get Company B, Company A is trapped in its old, deteriorating business model.

If Company A does acquire Company B, Company A has to challenge the competitor. Company B already has a five-year track record of being a day late and a dollar short. Company A, therefore, has to do everything in its power to make the Company B deal work, which appears to be an all-or-nothing proposition.

Now the math: Action by Company A = unknown, variable, escalating costs.

I told you math geeks would not like this analysis. Company A is betting the farm against long odds. Here’s why:

First, the cultures are not amenable to staff reductions or technological efficiencies; that is, use software and automation, not people, while increasing revenues. Company A, regardless of the money invested, cannot be certain of success. Company B’s culture – business model duality is investment insensitive. In short, money won’t close this gap. Company A’s resistance to cannibalizing its old, though still functioning, business model will be significant. Company A’s own employees will resist watching their money and jobs sacrificed to a great good.

Second, the competitive space is now being captured by the increasingly monopolistic competitor. Unchallenged for some period of time, the monopolistic competitor enjoys momentum and a significant lead in refining its own business model.

In the lingo of Wall Street, Company A can’t get enough “oxygen”; that is, revenue despite its best efforts to reign in the market leader.

Observations

If we assume a kernel of truth in my hypothetical analysis, we can now apply this hypothetical discussion to the Microsoft – Yahoo deal.

First, Microsoft’s business mode (not its technology) is the company’s strength. The business model is also its Achilles’ heel. Just as IBM’s mainframe-centric view of the world make its executives blind to Microsoft, now Microsoft can’t perceive today’s world from outside the Microsoft business model. The Microsoft business model is perhaps the most efficient subscription-based revenue generator in history. But that business model has not worked in the new markets Microsoft’s covets, so the Yahoo deal becomes the “obvious” play to Microsoft’s management. Its obviousness makes it difficult for Microsoft to see other options.

Second, the Microsoft business model is woven into the company’s culture. Cultures are ethnocentric. Ethnocentricity often manifests itself in conflict. Microsoft will have to make prescient, correct cultural decisions quickly and repeatedly. Microsoft’s culture, however, does not typically evidence excellent, rapid-fire decision-making.

Microsoft seems to be putting the company in a situation guaranteed to spark conflict within its own walls, between itself and Yahoo, and between Microsoft and Google. This is a three-front war. Even those with little exposure to military history can see that the costs and risks of a three-front conflict will be high, open-ended, and difficult to estimate.

The hostile bid itself is suggestive that Microsoft could not catch Google without Google, the notion that Microsoft can catch Google with the acquisition requires tremendous confidence in Microsoft’s management. I think Microsoft can make the deal work, but I think that execution must be flawless and that favorable winds push Microsoft along.

If Google continues to race forward, Microsoft has to spend more money to implement efficiencies more quickly. The calculus of catching a moving target can trigger a cost crisis. If costs go up too quickly, Microsoft must fall back on its proven business model. Taking a step backward when resolving the calculus of catching Google is not a net positive.

As you read this essay, you are wondering, “How can this doom and gloom be real?” The buzz about the deal is mostly positive. If you don’t believe me, call your broker and ask him how much your mutual fund will benefit from the MSFT – YHOO tie up.

I’ve spent some time around money types, and I can tell you making money is akin to blood in the water for sharks.

I’ve also been acquired and done the acquiring. Regardless of being the buyer or being the bought, ties ups are tricky. The larger the stakes, the more tricky the tie ups become. When the tie up is designed to halt the Google juggernaut, the calculus of time – cost is hard.

Please, recall, that I’m not saying that stopping Google is impossible for a Microsoft – Yahoo tie up to deliver. Making the tie up work will be difficult.

Don’t agree? That’s okay. Use the comments to set me straight. I’m willing to listen and learn. Just don’t overlook my core points; namely, business models, cultures, and technologies. One final thought: don’t factor out the Google (NSDQ:GOOG).
Stephen Arnold, February 4, 2008

Written by Stephen E. Arnold · Filed Under Library automation, Microsoft, Online (general) | 2 Comments

Lotsa Search at Yahoo!

February 3, 2008

Microsoft’s hostile take over of Yahoo! did not surprise me. Rumors about Micro – hoo or Ya – soft have floated around for a couple of years. I want to steer clear of the newsy part of this take over, ignore the share-pumping behind the idea that Mr. Murdoch will step in to buy Yahoo, and side step Yahoo’s 11th hour “we’re not sure we want to sell” Web log posting.

I prefer to do what might be called a “catalog of search engines,” a meaningless exercise roughly equivalent to Homer’s listing of ships in The Illiad. Scholars are still arguing about why he included the information and centuries later continue to figure out who these guys were and why such an odd collection of vessels was necessary. You may have a similar question about Yahoo’s search fleet after you peruse this short list of Yahoo “findability” systems:

InQuira. This is the Yahoo natural language customer support system. InQuira was formed from three smaller search outfits that ran aground. InQuire seems stable, and it provides NLP systems for customer support functions. Try it. Navigate to Yahoo. Click Help and ask a question, for example, “How do I cancel my premium mail account?” Good luck, but you have an opportunity to work with an “intelligent” agent who won’t tell you how to cancel a for-fee Yahoo service. When I learned of this deal, I asked, “Why don’t you just use Inktomi’s engine for this?” I didn’t get an answer. I don’t feel too bad. Google treats me the same way.
Inktomi. Yahoo bought this Internet indexing company in 2002. We used the Inktomi system for the original US government search service, FirstGov.gov (now USA.gov). The system worked reasonably well, but once in the Yahooligans’ hands, not much was done with the system, and Inktomi was showing its age. In 2002, Google was motoring just drawing even with Yahoo. Yahoo seemed indifferent or unaware that search had more potential than Yahoo’s portal approach.
Stata Labs. When Gmail entered semi-permanent beta, it offered two key features. First, there was one gigabyte of storage and, two, you could search your mail. Yahoo couldn’t search email at all. The fix was to buy Stata Labs in 2004. When you use the Yahoo mail search function, the Stata system does the work. Again I asked, “Why not use one of your Yahoo search systems to search mail?” Again, no response.
Fast Search & Transfer. Yahoo, through the acquisition of Overture, ended up with the AllTheWeb.com Web site. The spidering and search technology are operated by Fast Search & Transfer (the same outfit that Microsoft bought for $1.2 billion in January 2008). Yahoo trumpeted the “see results as you type feature” in 2007, maybe 2006. The idea was that as you key your query, the system shows you results matching what you have typed. I find this function distracting, but you may love it. Try it yourself here. I heard that Yahoo has outsourced some data center functions to Fast Search & Transfer, which, if true, contradicts some of the pundits who assert that Yahoo has its data center infrastructure well in hand. If so, why lean on Fast Search & Transfer?
Overture. When Yahoo acquired Overture (the original pay-for-traffic service) in 2003, it got the ad service and the Overture search engine. Overture purchased AllTheWeb.com and ad technology from Fast Search & Transfer. When Yahoo bought Overture, Yahoo inherited Overture’s Sun Microsystems’ servers with some Linux boxes running a home brew fraud detection service, the original Overture search system, and the AllTheWeb.com site. Yahoo still uses the Overture search system when you look for key words to buy. You can try it here. (Note: Google was “inspired” by the Overture system, and paid about $1.2 billion to Yahoo to avoid a messy lawsuit about its “inspiration” prior to the Google IPO in 2004. Yahoo seemed happy with the money and did little to impede Google.)
Delicious. Yahoo bought Delicious in 2005. Delicious came with its weird url and search engine. If you have tried it, you know that it can return results with some latency. When it does respond quickly, I find it difficult to locate Web sites that I have seen. As far as I know, the Delicious system still uses the original Delicious search engine. You can try it here.
Flickr. Yahoo bought Flickr in 2005, another cog in its social, Web 2.0 thing. The Flickr search engine runs on MySQL. At one trade show, I heard that the Flickr infrastructure and its search system were a “problem”. Scaling was tough. Based on the sketchy information I have about Yahoo’s search strategy, Flickr search is essentially the same as it was when it was purchased and is in need of refurbishing.
Mindset. Yahoo, like Google and Microsoft, has a research and development group. You can read about their work on the recently redesigned Web site here. If you want to try Mindset, navigate to Yahoo Research and slide the controls. I’ve run some tests, and I think that Mindset is better than the “regular” Yahoo search, but it seems unchanged over the last six or seven months.

I’m going to stop my listing of Yahoo’s search systems, although I could continue with the Personals search, Groups search, News search, and more. I may comment on AltaVista.com, another oar in Yahoo’s search vessel, but that’s a topic that requires more space than I have in this essay. And I won’t beat up on Yahoo Shopping search. If I were a Yahoo merchant, I would be hopping mad. I can’t figure out how to limit my query to just Yahoo merchants. The results pages are duplicative and no longer useful to me. Yahoo has 500 million “users” but Web statistics are mushy. Yahoo must be doing something right as it continues to drift with the breeze as a variant of America Online.

In my research for my studies and journal articles, I don’t recall coming across a discussion of Yahoo’s many different search systems. No one, it seems, has noticed that Yahoo lacks an integrated, coherent approach to search. I know I’m not the only person who has observed that Yahoo cannot mount a significant challenge to Google.

As Google’s most capable competitor, Yahoo stayed out of the race. But it baffles me that a sophisticated, hip, with-it Silicon Valley outfit like Yahoo collected different search systems the way my grandmother coveted weird dwarf figurines. Like Yahoo, my grandmother never did much with her collection, I may have to conclude that Yahoo hasn’t done much with its collection of search systems.The cost of licensing, maintaining, and upgrading a fleet of search systems is not trivial. What baffles me is why on earth couldn’t Yahoo index its own email? Why couldn’t Yahoo use one of its own search systems to index Delicious bookmarks and Flickr photos? Why does Yahoo have a historical track record of operating search systems in silos, thus making it difficult to rationalize costs and simplify technical problems?

Compared to Yahoo, Google has its destroyer ship shape — if you call squishy purple pillows, dinosaur bones, and a keen desire to hire every math geek with an IQ of 165 on the planet “ship shape”. But Yahoo is still looking for the wharf. As Google churned past Yahoo, Yahoo watched Google sail without headwinds to the horizon.Over the years, I’ve been in chit-chats with some Yahoo wizards. Let me share my impressions without using the wizards’ names:

Yahoo believes that its generalized approach is correct as Google made search the killer app of cloud computing. Yahoo’s very smart people seem to live in a different dimension
Yahoo believes that its technology is superior to Google’s and Microsoft’s. When I asked about a Google innovation, Yahoo’s senior technologist told me that Yahoo had “surprises for Google.” I think the surprise was the hostile take over bid last week
Yahoo sees its future in social, Web 2.0 services. To prove this, Yahoo hired economists and other social scientists. While Yahoo was recruiting, the company muffed the Facebook deal and let Yahoo 360 run aground. Yo, Yahoo, Google is inherently social. PageRank is based on human clicks and human-created Web pages. Google’s been social since Day One.

To bring this listing of Yahoo search triremes (ancient wooden war ships) to a close, I am not sure Microsoft, if it is able to acquire Yahoo, can integrate the fleet of search systems. I don’t think Mr. Murdoch can given the MySpace glitches. Fixing the flotilla of systems at Yahoo will be expensive and time consuming. The catch is that time is running out. Yahoo appears to me to be operating on pre-Internet time. Without major changes, Yahoo will be remembered for its many search systems, leaving pundits and academics to wonder where they came from and why. Maybe these investigators will use Google to find the answer? I know I would.

Stephen Arnold, February 3, 2008

Written by Stephen E. Arnold · Filed Under Online (general) | Comments Off on Lotsa Search at Yahoo!

Search Frustration: 1980 and 2008

February 2, 2008

I have received two telephone calls and several emails about user satisfaction with search. The people reaching out to me did not disagree that users were often frustrated with systems. I think the contacts were amplifications of the complexity of “getting search right”.

Instead of falling back on bell curves, standard deviations, and more exotic ways to think about populations, let’s go back in time. I want to then jump back to the present, offer some general observations, and then conclude with several of my opinions expressed as “observations”. I don’t mind push back. My purpose is to set forth facts as I understand them and stimulate discussion.

I’m quite a fan of Thucydides. If you have dipped into his sometimes stream-of-consciousness approach to history, you know that after a few hundred pages the hapless protagonists and antagonists just keep repeating their mistakes. Finally, after decades of running around the hamster wheel, resolution is achieved by exhaustion.

My hope is that with regard to search we arrive at a solution without slumping into torpor.

The Past: 1980

A database named ABI / INFORM (pronounced as three separate letters ay-bee-eye followed by the word inform) was a great online success. Its salad days are gone, but for one brief shining moment, it was white hot.

The idea for ABI (abstracted business information) originated at a university business school, maybe Wisconsin but I can’t recall. It was purchased by my friend Dennis Auld and his partner Greg Payne. There was another fellow involved early on, but I can’t dredge his name up this morning.

The database summarized and indexed journals containing information about business and management. Human SMEs (subject matter experts) read each article and wrote a 125-word synopsis. The SMEs paid particular attention to making the abstract meaty; that is, a person could read the abstract and get the gist of the argument and garner the two or three key “facts” in the source article. (Systems today perform automatic summarization, so the SMEs are out of a job.)

ABI / INFORM was designed to allow a busy person to ingest the contents of a particular journal like the Harvard Business Review quickly, or collect some abstracts on a topic such as ESOPs (Employee Stock Ownership Plans) and learn quickly on what was in the “literature” (a fancy word for current management thinking and research on a subject).

Our SMEs would write their abstracts on special forms that looked a lot like a 5″ by 8″ note card (about the amount of text on a single IBM mainframe green screen input form). SMEs would also enter the name of the author or authors, the title of the article, the source journal, and the standard bibliographic data taught in the 7th grade.

SMEs would also consult a printed list of controlled terms. A sample of a controlled term list appears below. Today, these controlled term lists are often called knowledge bases. For anyone my age, a list of words is pretty much a list of words. Flashy terminology doesn’t always make points easier to understand, which will be a theme of this essay.

Early in the production cycle, the index and abstract for each article would be typed twice once by an SME on a typewriter and then by a data entry operator into a dumb terminal. This type of information manufacturing reflected the crude, expensive systems available a quarter century ago. Once the data had been keyed into a computer system, it was in digital form, proofed, and sent via eight-track tape to a timesharing company. We generated revenue by distributing the ABI / INFORM records via Dialog Information Services, SDC Orbit, BRS, and other systems. (Perhaps I will go into more detail about these early online “players” in another post.) Our customers used the timesharing service to “search” ABI / INFORM. We split the money with the timesharing company and generally ended up with the short end of the stick.

Below is an example of the ABI / INFORM controlled vocabulary:

There were about 15,000 terms in the vocabulary. If you look closely, you will see that some terms are market “rt” and “uf”. These are “related terms” and “use for” terms. The idea was that a person assigning index terms would be able to select a general term like “market shares” and see that the related terms “competition” and “market erosion” would provide pertinent information. The “uf” or “use for” reminded the indexer that “share of market” was not the preferred index term. Our vocabulary could also be used by a customer or user whom we then called a searcher in 1980.

A person searching for information in the ABI / INFORM file (database) of business abstracts could use these terms to locate precisely the information desired. You may have heard the terms precision and recall used by search engine and content processing vendors. The idea originated with the need to allow users (then called searchers) to narrow results; that is, make them more precise. There was also a need to allow a user (searcher) to get more results if the first result set contained too few hits or did not have the information the user (searcher) wanted.

To address this problem, we created classification codes and assigned these to the ABI / INFROM records as well. As a point of fact, ABI / INFORM was one of the first, if not the first, commercial database to reindex every record in its database to assign manually six to eight index terms and classification codes as part of a quality assurance project.

When we undertook this time-consuming and expensive job, we had to use SMEs. The business terminology proved to be so slippery that our primitive automatic indexing and search-and-replace programs introduced too many indexing red herrings. My early experience with machine-indexing and my having to turn financial cartwheels to pay for the manual rework has made me suspicious of vendors pushing automated systems, especially for business content. Business content indexing remains challenging, eclipsed only by processing email and Web log entries. Scientific, technical, and medical content is tricky but quite a bit less complicated than general business content. (Again, that’s a subject for another Web log posting.)

Our solution to broadening a query was to make it possible for the SME indexing business abstracts to use a numerical code to indicate a general area of business; for example, marketing, and then use specific values to indicate a slightly narrower sub-category. The idea was that the controlled vocabulary was precise and narrow and the classification codes were broader and sub-divided into useful sub-categories. A snippet of the ABI / INFORM classification codes appears below:

If you look at these entries for the classification code 7000 Marketing, you will see terms such as “sn”. That’s a scope note, and it tells the indexer and the user (searcher) specific information about the code. You also see the “cd”. That means “code description”. A “code description” is provides specific guidance on when and how to use the classification code, in this case “7000 Marketing”.

Notice too that the code “7100 Marketing” is a sub-category of Marketing. The idea is that while 7000 Marketing is broad and appropriate for general articles about marketing, the sub-category allows the indexer or user to identify articles about “Market research.” While “Market research” is broad, it is ideally in a middle ground between the very broad classification code 7000 Marketing and the very specific terminology of the controlled vocabulary. We also had controlled terms lists for geography or what today is called “geo spatial coding”, document type codes, and other specialized index categories. These are important facets of the overall indexing scheme, but not germane to the point I want to make about user satisfaction with search and content processing systems.

Let’s step back. Humans created abstracts of journal articles. Humans then complete bibliographic entries for each selected article. Then an SME would index the abstracts, selecting terms that in their judgment and according to the editorial policy inherent in the controlled terms lists. These index terms became the building blocks of locating a specific article among hundreds of thousands or identifying a subset of all possible articles in ABI / INFORM directly on point to the topic on which the user wanted information.

The ABI / INFORM controlled vocabulary was used at commercial organizations to index internal documents or what we would today call “behind-the-firewall content.” One customer was IBM. Another was the Royal Bank of Canada. The need for a controlled vocabulary such as ABI / INFORM’s is rooted in the nature of business terminology. When business people speak, jargon creeps into almost every message. On top of that, new terms are coined for old concepts. For example, you don’t participate in a buzz group today. You participate in a focus group. Now you know why I am such a critic of the baloney used by search and content processing vendors. Making up words (neologisms) or misappropriating a word with a specific meaning (semantic, for example) and then gluing that word with another word with a reasonably clear meaning (processing, for example) creates the jargon semantic processing. Now I ask you, “Who knows what the heck that means?” I don’t, and that’s the core problem of business information. The language is slippery, fast moving, jargon-riddled, and fuzzy.

Appreciate that creating the ABI / INFORM controlled vocabulary, capturing the editorial policy in those lists, and then applying them consistently to what was then the world’s largest index to business and management thought was a big job. Everyone working on the project was exhausted after two years of researching, analyzing, and discussing. What made me particularly proud of the entire Courier-Journal team (organized by the time we finished into a separate database unit called Data Courier) was that library and information science courses used ABI / INFORM as a reference document. At Catholic University in Washington, DC, the entire vocabulary was used as a text book for an advanced information science class. Even today, ABI / INFORM’s controlled vocabulary stands as an example of:

The complexity of creating useful, meaningful knowledge bases
Proof that it is possible to index content so that it can be sliced and diced with few “false drops” or what we call today a “irrelevant hit”.
A difficult domain such as business can be organized and made more accessible via good indexing.,

Now here’s the kicker, actually a knife in the heart to me and the entire ABI / INFORM team. We did user satisfaction surveys on our customers before the reindexing job and then after the reindexing job. But our users (searchers) did not use the controlled terms. Users (searchers) keyed one or two terms, hit the Enter key, and used what the system spit out.

Before the work, two-thirds of the people we polled who were known users of ABI/ INFORM said our indexing was unsatisfactory. After the work, two thirds of the people we polled who were known users of ABI / INFORM said our indexing was unsatisfactory. In short, bad indexing sucked. And better indexing sucked. User behavior was responsible for the dissatisfaction, and even today, who dares tell a user (search) that he / she can’t search worth a darn.

I’ve been thinking about these two benchmark studies performed by the Courier-Journal every so often for 28 years. Here’s what I have concluded:

Inherent in the search and retrieval business is frustration with finding the information a particular user needs. This is neither a flaw in the human nor a flaw in the indexing. Users come to a database looking for information. Most of the time — two thirds to be exact — the experience disappoints.
Investing person years of effort in constructing an almost-perfect epistemological construct in the form of controlled vocabularies is a great intellectual exercise. It just doesn’t pay huge dividends. Users (searchers) flounder around and get “good enough” information which results in the general dissatisfaction with search.
As long as humans are involved, it is unlikely that the satisfaction scores will improve dramatically. Users (searchers) don’t want to work hard to formulate queries or don’t know how to formulate queries that deliver what’s needed. Humans aren’t going to change at least in my lifetime or what’s left of it.

What’s this mean?

Simply stated, algorithmic processes and the use of sophisticated mathematical procedures will deliver better results.

The Present: 2008

In my new study Beyond Search, I have not included much history. The reason is that today most procurement teams looking to improve an existing search system or replace one system with another want to know what’s available and what works.

The vendors of search and content processing systems have mastered the basics of key word indexing. Many have integrated entity extraction and classification functions into their content processing engines. Some have developed processes that look at documents, paragraphs, sentences, and phrases for clues to the meaning of a document.

Armed with these metatags (what I call index terms), the vendors can display the content in point-and-click interfaces. A query returns a result list, and the system also displays Use For references or what vendors call facets, hooks, or adjacent terms. The naked “search box” is surrounded with “rich interfaces”.

You know what?

Survey the users and you will find two-thirds of the users dissatisfied with the system to some degree. Users overestimate their ability and expertise in finding information. Many managers are too lazy to dig into results to find the most germane information. Search has become a “good enough” process for most users.

Rigorous search is still practiced by specialists like pharmaceutical company researchers and lawyers paid to turn over every stone in hopes of getting the client off the legal hook. But for most online users in commercial organizations, search is not practiced with diligence and thoroughness.

In May 2007, I mentioned in a talk at an iBreakfast seminar that Google had an invention called “I’m feeling doubly lucky.” The idea is that Google can look at a user’s profile (compiled automatically by the Googleplex), monitor the user’s location and movement via a geo spatial function in the user’s mobile device, and automatically formulate a query to retrieve information that may be needed by the user. So, if the user is known to be a business traveler and the geo spatial data plot his course toward La Guardia Airport, then the Google system will push to the user’s phone about which parking lot is available and whether the user’s flight is late. The key point is that the user doesn’t have to do anything but go one about his / her life. This is “I’m feeling doubly lucky” because it raises the convenient level of the “I’m feeling lucky button” on Google pages today. Press I’m feeling lucky and the system shows you the one best hit as defined by Google’s algorithmic factory. Some details of this invention appear in my September 2007 study, Google Version 2.0.

I’m convinced that automatic, implicit searching is the direction that search must go. Bear in mind that I really believe in controlled vocabularies, carefully crafted queries, and comprehensive review of results lists. But I’m a realist. Systems have to do most of the work for a user. When users have to do the searches themselves or at least most of the work, their level of dissatisfaction will remain high. The dissatisfaction is not with the controlled vocabulary, the indexing, or the particular search system. The dissatisfaction is with the work associated with finding and using the information. I think that most users are happy with the first page or first two or three results. These are good enough or at least assuage the user’s conscience sufficiently to make a decision.

The future, therefore, is going to be dominated by systems that automate, analyze, and predict what the mythical “average” user wants. These results will then be automatically refined based on what the system knows about a particular user’s wants and needs. The user profile becomes the “narrowing” function for a necessarily broad set of results.

Systems can automatically “push” information to users or at least keep it in a cache ready for near-zero latency delivery. In an enterprise, search must be hooked into work flow. The searches must be run for the user and the results displayed to the user. If not automatically, the user need only click a hot link and the needed information is displayed. A user can override an automatic systems, but I’m not sure most users would do it or care if the override were like a knob on a hotel’s air conditioner. You feel better turning the knob. You feel without control if you can’t turn the knob.

Observations

Let me offer several observations after this journey back in time and a look at the future of search and content processing. If you are easily upset, grab your antacid, because here we go:

The razzle-dazzle about taxonomies, ontologies, and company-specific controlled term lists hides the fact that specific terms have to be identified and used to index automatically documents and information objects found in behind-the-firewall search systems. Today, these terms can be generated by processing a representative sample of existing documents produced by the organization. The key is a good-enough term list, not doing what was done 25 years ago. Keep in mind the phrase “good enough.” There are companies who offer software systems that can make this list generation easier. You can read about some vendors in Beyond Search, or you can do a search on Google, Live.com, or Yahoo.
Users will never be satisfied. So before you dump your existing search system because of user dissatisfaction, you may want to get some other ammunition, preferably cost and uptime data. “Opinion” data are almost useless because no system will test better than another in my experience.
Don’t believe the business jargon thrown at you by vendors. Inherent in business itself is a tendency to create a foggy understanding. I think the tendency to throw baloney has been around since the first caveman offered to trade a super-sharp flint for a tasty banana. The flint is not sharp; it’s like a Gillette four-track razor. The banana is not just good; it is mouth-watering, by implication a great banana. You have to invest time, effort, energy, and money in figuring out which search or content processing system is appropriate for your organization., This means head-to-head bake-offs. Few do this, and the results are clear. Most people are unhappy with their vendor, with search, and with the “information problem”.
Background processes, agent-based automatic searching, and mechanisms to watch what your information needs and actions are will make search better. You enter ss cc=71? AND ud=9999 to get recent material about market research. but most people don’t and won’t.

In closing, keep these observations in mind when trying to figure out what vendors are really squabbling about. I’m not sure they themselves know. When you listen to a sales pitch, are the vendors saying the same thing? The answer is, “Yes.” You have to rise to the occasion and figure out the differences between systems. I guarantee you the vendors don’t know and if they know, the vendors sure won’t tell you.

Stephen Arnold, February 2, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Online (general), Search | 2 Comments

Transformation: An Emerging “Hot” Niche

January 25, 2008

Transformation is a $5 dollar word that means changing a file from one format to another. The trade magazines and professional writers often use data integration or normalization to refer to what amounts to taking a Word 97 document with a Dot DOC extension and turning it into a structured document in XML. These big words and phrases refer to a significant gotcha in behind-the-firewall search, content processing, and plain old moving information from one system to another.

Here’s a simple example of the headaches associated with what should be a seamless, invisible process after a half century of computing. The story:

You buy a new computer. Maybe a Windows laptop or a new Mac. You load a copy of Office 2007, write a proposal, save the file, and attach it to an email that says, “I’ve drafted the proposal we have to submit tomorrow before 10 am.” You send the email and go out with some friends.

In the midst of a friendly discussion about the merits of US democratic presidential contenders, your mobile rings. You hear your boss saying over the background noise, “You sent me a file I can’t open. I need the file. Where are you? In a bar? Do you have your computer so you can resend the file? No? Just get it done now!” Click here to read what ITWorld has to say on this subject. Also, there’s some user vitriol over Word to Word compatibiity hassle itself here. A work around from Tech Addict is here

Another scenario is to have a powerful new content processing system that churns through, according to the vendor’s technical specification, “more than 200 common file types.” You set up the content processing gizmo, aim it at the marketing department’s server, and click “Index.” You go home. When you arrive the next morning at 8 am, you find that the 60,000 documents in the folders containing what you wanted indexed had become an index with 30,000 documents.” Where are the other 30,000 documents? After a bit of fiddling, you discover the exception log and find that half of the documents you wanted indexed were not processed. You look up the error code and learn that it means, “File type not supported.”

The culprit is the inability of one system to recognize and process a file. The reasons for the exceptions are many and often subtle. Let’s troubleshoot the first problem, the boss’s inability to open a Word 2007 file sent as an attachment to an email.

The problem is that the recipient is using an older version of Word. The sender saved the file in the most recent Word’s version of XML. You can recognize these files by their extension Dot DOCX. What the sender should have done is save the [a] proposal as either a Dot DOC file in an older “flavor” of Word’s DOC format; [b] file as the now long-in-the-tooth RTF (rich text format) type; or [c] file in Dot TXT (ASCII) format. The fix is for the sender to resend the file in a format the recipient can view. But that one file can cost a person credibility points or the company a contract.

The second scenario is more complicated. The marketing department’s server had a combination of Word files, Adobe Portable Document Format files with Dot PDF extensions, some Adobe InDesign files, some Quark Express files, some Framemaker files, and some database files produced on a system no one knows much about except that the files came from a system no longer used by marketing. A bit of manual exploration revealed that the Adobe PDF files were password protected, so the content processing system rejected them. The content processing system lacked import filters to open the proprietary page layout and publishing program files. So it rejected them. The mysterious files from the disused system were data dumps from an IBM CICS system. The content processing system opened and then found them unreadable, so those were exceptions as well.

Now the nettles, painful nettles:

First, fixing the problem with any one file is disruptive but usually doable. The reputation damage done may or may not be repaired. At the very least, the sender’s evening was ruined, but the high-powered vice president was with a gaggle of upper crust types arguing about an election’s impact on trust funds. To “fix” the problem, she had to redo her work. Time consuming and annoying to leave her friends. The recipient — a senior VP — had to jiggle his plans in order to meet the 10 am deadline. Instead of chlling with The Simpsons TV show, he had to dive into the proposal and shape the numbers under theÂ pressure of the looming deadline.

We can now appreciate a 30,000 file problem. It is a very big problem. There’s probably no way to get the passwords to open some the PDFs. So, the PDFs’ content may remain unreadable. The weird publishing formats have to be opened in the application that created them and then exported in a file format the content processing system understands. This is a tricky problem, maybe another Web log posting. An alternative is to print out hard copies of the files, scan them, use optical character recognition software to create ASCII versions, and then feed the ASCII versions of the files to the content processing system. (Note: some vendors make paper-to-ASCII systems to handle this type of problem.) Those IBM CICS files can be recovered, but an outside vendor may be needed if the system producing the files is no longer available in house. When the costs are added up, these 30,000 files can represent hundreds of hours of tedious work. Figure $60 per hour and a week’s work if everything goes smoothly, and you can estimate the minimum budget “hit”. No one knows the final cost because transformation is dicey. Cost naivety is the reason my blood pressure spikes when a vendor asserts, “Our system will index all the information in your organization.” That’s baloney. You don’t know what will or won’t be indexed unless you perform a thorough inventory of files and their types and then run tests on a sample of each document type. That just doesn’t happen very often in my experience.

Now you know what transformation is. It is a formal process of converting lead into content gold.

One Google wizard — whose name I will withhold so Google’s legions of super-attorneys don’t flock to rural Kentucky to get the sheriff to lock me up — estimated that up to 30 percent of information technology budgets is consumed by transformation. So for a certain chicken company’s $17 million IT budget, the transformation bill could be in the $5 to $6 million range. That translates to selling a heck of a lot of fried chicken. Let’s assume the wizard is wrong by a factor of two. This means that $2 to $3 million is gnawed by transformation.

As organizations generate and absorb more digital information, what happens to transformation costs? The costs will go up. Whether the Google wizard is right or wrong, transformation is an issue that needs experienced hands minding the store.

The trigger for these two examples is a news item that the former president of Fast Search & Transfer, Ali Riaz, has started a new search company. Its USP (unique selling proposition) is data integration plus search and content processing. You can read Information Week‘s take on this new company here.

In Beyond Search, I discuss a number of companies and their ability to transform and integrate data. If you haven’t experienced the thrill of a transformation job, a data integration project, or a structured data normalization task — you will. Transformation is going to be a hot niche for the next few years.

Understanding of what can be done with existing digital information is, in general, wide and shallow. Transformation demands narrow and deep understanding of a number of esoteric and almost insanely diabolical issues. Let me identify three from own personal experience learned at the street academy called Our Lady of Saint Transformation.

First, each publishing system has its own peculiarities about files produced by different versions of itself. InDesign 1.0 and 2.0 cannot open the most recent version’s files. There’s a work around, but unless you are “into” InDesign, you have to climb this learning curve and fast. I’m not picking on Adobe. The same intra-program compatibilities plague Quark, PageMaker, the moribund Ventura, Framemaker, and some high-end professional publishing systems.

Second, data files spit out by mainframe systems can be fun for a 20-something. There are some interesting data formats still in daily use. EBCDIC or Extended Binary-Coded Decimal Interchange Code is something some readers can learn to love. It is either that or figuring out how to fire up an IBM mainframe, reinstalling the application (good luck on that one, 20 somethings), restoring the data from a DASD or flat file back up tapes (another fun task for a recent computer science grad), and then outputting something the zippy new search or content processing can convert in a meaningful way. (Note: “meaningful way” is important because when a filter gets confused, it produces some interesting metadata. Some glitches can require you to reindex the content if your index restore won’t work.)

Third, the Adobe PDFs with their two layers of security can be especially interesting. If you have one level of password, you can open the file and maybe print it, and copy some content from it. Or, not. If not, you either print the PDFs (if printing has not be disabled) , and go through the OCR-to-ASCII drill. In my opinion, PDFs are like a digital albatross. These birds hang around one’s neck. Your colleagues want to “search” for the PDFs’ content in their behind-the-firewall system. When asked to produce the needed passwords, I often hear something discomforting from the marketing department. So it is no surprise to learn that some system users are not too happy.

You may find this post disheartening.

No!

This post is chock full of really good news. It makes clear that companies in the business of transformation are going to have more customers in 2008 and 2009. It’s good news for off-shore conversion shops. Companies that have potent transformation tools are going to have a growing list of prospects. Young college grads get more chances to learn the mainframe’s idiosyncrasies.

The only negative in this rosy scenario is for the individual who:

Fails to audit the file types and the amount of content in those file types
Skips determining which content must be transformed before the new system is activated
Ignores the budget implications of transformation
Assumes that 200 or 300 filters will do the job
Does not understand the implications behind a vendor’s statement along these line: “Our engineers can create a custom filter for you if you don’t have time to do that scripting yourself.”

One final point: those 200 or more file types. Vendors talk about them with gusto. Check to see if the vendor is licensing filters from a third party. In certain situations, the included file type filters don’t support some of the more recent applications’ file formats. Other vendors “roll their own” filters. But filters can vary in efficacy because different people write them at different times with different capabilities. Try as they might, vendors can’t squash some of the filter nits and bugs. When you do some investigating, you may be able to substantiate my data that suggest filters work on about two thirds of the files you feed into the search or content processing system. Your investigation may prove my data incorrect. No problem. When you are processing 250,000 documents, the exception file becomes chunky from the system’s two to three percent rejection rate. A thirty percent rate can be a show stopper.

Stephen E. Arnold, January 25, 2008

Written by Stephen E. Arnold · Filed Under Online (general) | 1 Comment

Two Visions of the Future from the U.K.

January 17, 2008

Two different news items offered insights about the future of online. My focus is the limitations of key word search. I downloaded both articles, I must admit, eager to see if my research were disproved or augmented.

Whitebread

The first report appeared on January 14, 2008, in the (London) Times online in a news story “White Bread for Young Minds, Says University Professor.” In the intervening 72 hours, numerous comments appeared. The catch phrase is the coinage of Tara Brabazon, professor of Media Studies at the University of Brighton. She allegedly prohibits her students from using Google for research. The metaphor connotes in a memorable way a statement attributed to her in the Times’s article: “Google is filling, but it does not necessarily offer nutritional content.”

The argument strikes a chord with me because [a] I am a dinosaur, preferring warm thoughts about “the way it was” as the snow of time accretes on my shoulders; [b] schools are perceived to be in decline because it seems that some young people don’t read, ignore newspapers except for the sporty pictures that enliven gray pages of newsprint, and can’t do mathematics reliably at take-away shops; and [c] I respond to the charm of a “sky is falling” argument.

Ms. Brabazon’s argument is solid. Libraries seem to be morphing into Starbuck’s with more free media on offer. Google–the icon of “I’m feeling lucky” research–allows almost anyone to locate information on a topic regardless of its obscurity or commonness. I find myself flipping my dinosaurian tail out of the way to get the telephone number of the local tire shop, check the weather instead of looking out my window, and converting worthless dollars into high-value pounds. Why remember? Google or Live.com or Yahoo are there to do the heavy lifting for me.

Educators are in the business of transmitting certain skills to students. When digital technology seeps into the process, the hegemony begins to erode, so the argument goes. Ms. Brabazon joins Neil Postman Amusing Ourselves to Death: Public Discourse in the Age of Show Business, 1985) and more recently Andrew Keen (The Cult of the Amateur, 2007) among others in documenting the emergence of what I call the “inattention economy.”

I don’t like the loss of what weird expertise I possessed that allowed me to get good grades the old-fashioned way, but it’s reality. The notion that Google is more than an online service is interesting. I have argued in my two Google studies that Google is indeed much more than a Web search system growing fat on advertisers’ money. My research reveals little about Google as a corrosive effect on a teacher’s ability to get students to do their work using a range of research tools. Who wouldn’t use an online service to locate a journal article or book? I remember how comfortable my little study nook was in the rat hole in which I lived as a student, then slogging through the Illinois winter, dealing with the Easter egg hunt in the library stuffed with physical books that were never shelved in sequence, and manually taking notes or feeding 10-cent coins into a foul-smelling photocopy machine that rarely produced a readable copy. Give me my laptop and a high-speed Internet connection. I’m a dinosaur, and I don’t want to go back to my research roots. I am confident that the professor who shaped my research style–Professor William Gillis, may he rest in peace–neither knew nor cared how I gathered my information, performed my analyses, and assembled the blather that whizzed me through university and graduate school.

If a dinosaur can figure out a better way, Tefloned along by Google, a savvy teen will too. Draw your own conclusions about the “whitebread” argument, but it does reinforce my research that suggests a powerful “pull” exists for search systems that work better, faster, and more intelligently than those today. Where there’s a market pull, there’s change. So, the notion of going back to the days of taking class notes on wax in wooden frames and wandering with a professor under the lemon trees is charming but irrelevant.

The Researcher of the Future

The British Library is a highly-regarded, venerable institution. Some of its managers have great confidence that their perception of online in general and Google in particular is informed, substantiated by facts, and well-considered. The Library’s Web site offers a summary of a new study called (and I’m not sure of the bibliographic niceties for this title): A Ciber [sic] Briefing Paper. Information Behaviour of the Researcher of the Future, 11 January 2008. My system’s spelling checker is flashing madly regarding the spelling of cyber as ciber, but I’m certainly not intellectually as sharp as the erudite folks at the British Library, living in rural Kentucky and working by the light of buring coal. You can download this 1.67 megabyte 35 page document Researcher of the Future.

The British Library’s Web site article identifies the key point of the study as “research-behaviour traits that are commonly associated with younger users — impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs — are now becoming the norm for all age-groups, from younger pupils and undergraduates through to professors.” The British Library has learned that online is changing research habits. (As I noted in the first section of this essay, an old dinosaur like me figured out that doing research online faster, easier, and cheaper than playing “Find the Info” in my university’s library.)

My reading of this weirdly formatted document, which looks as if it was a PowerPoint presentation converted to a handout, identified several other important points. Let me share my reading of this unusual study’s findings with you:

The study was a “virtual longitudinal study”. My take on this is that the researchers did the type of work identified as questionable in the “whitebread” argument summarized in the first section of the paper. If the British Library does “Googley research”, I posit that Ms. Brabazon’s and other defenders of the “right way” to do research have lost their battle. Score: 1 for Google-Live.com-Yahoo. Nil for Ms. Brabazon and the British Library.
Libraries will be affected by the shift to online, virtualization, pervasive computing, and other impedimentia of the modern world for affluent people. Score 1 for Google-Live.com-Yahoo. Nil for Mr. Brabazon, nil for the British Library, nil for traditional libraries. I bet librarians reading this study will be really surprised to hear that traditional libraries have been affected by the online revolution.
The Google generation is comprised of “expert searchers”. The reader learns that most people are lousy searchers. Companies developing new search systems are working overtime to create smarter search systems because most online users–forget about age, gentle reader–are really terrible searchers and researchers. The “fix” is computational intelligence in the search systems, not in the users. Score 1 more for Google-Live.com-Yahoo and any other search vendor. Nil for the British Library, nil for traditional education. Give Ms. Brabazon a bonus point because she reached her conclusion without spending money for the CIBER researchers to “validate” the change in learning behavior.
The future is “a unified Web culture,” more digital content, eBooks, and the Semantic Web. The word unified stopped my ageing synapses. My research yielded data that suggest the emergence of monopolies in certain functions, and increasing fragmentation of information and markets. Unified is not a word I can apply to the online landscape.In my BearStearns’ report published in 2007 as Google’s Semantic Web: The Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft, I revealed that Google wants to become the Semantic Web.

Wrap Up

I look forward to heated debate about Google’s role in “whitebreading” youth. (Sounds similar to waterboarding, doesn’t it?) I also hunger for more reports from CIBER, the British Library, and folks a heck of lot smarter than I am. Nevertheless, my Beyond Search study will assert the following:

Search has to get smarter. Most users aren’t progressing as rapidly as young information retrieval experts.
The traditional ways of doing research, meeting people, even conversing are being altered as information flows course through thought and action.
The future is going to be different from what big thinkers posit.

Traditional libraries will be buffeted by bits and bytes and Boards of Directors who favor quill pens and scratching on shards. Publishers want their old monopolies back. Universities want that darned trivium too. These are notions I support but recognize that the odds are indeed long.

Stephen E. Arnold, January 17, 2008

Written by Stephen E. Arnold · Filed Under Online (general) | 2 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Search Is a Threat. You’ve Been Warned!

Context: Popular Term, Difficult Technical Challenge

Trapped by a Business Model, Not Technology

Is the Death Knell for SEO Going to Sound?

Taxonomy: Search’s Hula-Hoop®