Search’s Old Chestnuts Roasted and Crunched

February 14, 2008

Once again, I’m sitting in the Seattle – Tacoma airport waiting for one of inconvenient flights from the cold damp Northwest to the ice-frosted coal tailings of rural Kentucky.

Chewing through the news stories that flow to me each day, I scanned Nick Patience’s article “Fast Solutions to Lost Causes.” Dated February 13, 2008, Mr. Patience provides CIO Magazine’s readers with his commentary on the Microsoft – Fast Search deal. The subtitle of the article reads, “Nick Patience examines how Microsoft’s acquisition of FAST could help CIOs get to grips with the useful storage of elusive information.”

For the last 48 hours, I have dipped in and out of the Seattle business community where chatter about Microsoft is more common than talk about the US presidential primaries or the Seattle weather. After listening to opinions ranging from “the greatest thing since sliced bread” to “the deal was done without much thought”, I was revved and ready to get the CIO Magazine’s take on this subject.

Several points strike me as old chestnuts; that is, familiar information presented as fresh vegetables. For example, let me pull out three points and then offer a slightly different take on each of them. You can judge for yourself what you want to perceive about this deal.

First, the title is “Fast Solutions to Lost Causes.” As I read the title, it seems to say, “Microsoft has a lost cause.” So, how can an acquisition offer a “fast solution” to a “lost cause”? If a cause is lost, a solution is not available. A “fast solution” to a big problem is almost a guarantee that the fast solution will amplify the original, unsolved problem. Puzzling to me. I think this is one of those catchy headlines so liked by editors. Google’s indexing robot will be challenged to tag this story correctly based on the metaphor-charged word choice. But that’s a grumpy old man’s viewpoint.

Now, the second point. Mr. Patience asserts, “But Microsoft has done very little — even after the introduction of SharePoint in 2001 — to help CIOs not only get their arms around all this unstructured information, but more pertinently, to figure out what it is, where it is, how valuable or risky it is and how most effectively to store it.” Based on the research I did for the first three editions of the Enterprise Search Report and my regular consulting business, I don’t agree. Microsoft’s apparent disinterest in search does not match what I know. Specifically, in the early years of this decade, Microsoft relied on its partners to develop solutions using Microsoft tools. These solutions would “snap in”, amplify, and expand the ecosystem for Microsoft products, services, and certified vendors. The existence of dtSearch, Mondosoft (now part of SurfRay), Coveo, and other search systems that integrate more or less seamlessly with SharePoint are examples of a Microsoft strategy. I’m not saying Microsoft chose the optimal strategy, but to suggest that the company lacked a strategy misrepresents one of Microsoft’s methods for delivering on its “agenda”. Could Microsoft have approached behind-the-firewall search differently? Sure. Would these unexercised options worked better than the ecosystem approach? Who knows? The acquisition of Fast Search & Transfer is a new strategy. Coveo, for example, has expanded its operating system support, added functions, and looked into non-Microsoft markets because Microsoft seemed to be shifting from one strategic approach to a different one. But at the same time some vendors were decreasing their reliance on Microsoft, others like Autonomy created adapters to make their systems more compatible with the SharePoint environments. This is not a grumpy old man’s view; these are the facts of one facet of Microsoft’s business model.

Third, Mr. Patience references implicitly the search initiatives at Oracle (SES 11g, Triple Hop acquisition, Google partner deal) and IBM (Omnifind, WebFountain, the iPhrase acquisition, X1 tie up, deals with Endeca and Fast) as examples of more satisfying market tactics than Microsoft’s. No grump, just stabs in the dark the way I perceive reality.

As I stated in the three editions of Enterprise Search Report done on my watch, none of the superplatforms implemented effective behind-the-firewall strategies. Each of these companies tried different approaches. Each of these companies pushed into behind-the-firewall search with combinations of in-house development and acquisitions. Each of these companies experienced some success. Each of these companies’ strategies remain at this time works in progress. I’m not grumpy about this. This is just plain old corporate history. IBM and Oracle have been trying to crack the behind-the-firewall chestnut (metaphor intended). So far, both companies have only chipped teeth to show for their decades of effort.

I urge you to read Mr. Patience’s article and take a thorough look at other reports from the 451 Group. You will find some useful insights. Keep in mind, however, that when the chestnuts are broken open, the meat revealed may be quite different from the shell’s surface.

These three companies — IBM, Microsoft, and Oracle — have deep experience with behind-the-firewall search. Oracle’s efforts extend to the late 1980s when the database company acquired Artificial Linguistics. IBM’s efforts reach back to mainframe search with the STAIRS system, which is still available today as System Manager, and Microsoft’s search efforts have been one of the firm’s R&D centroids for many, many, many years.

Success is a different issue altogether. There are many reasons why none of these firms has emerged as the leader in behind-the-firewall search. But it is much more comforting to grab a handful of chestnuts than go find the chestnut tree, harvest the nuts, roast them, and then consume their tasty bits. What we know is that Microsoft is willing to pay $1.2 billion to try and go faster.

Stephen Arnold, February 14, 2008

Context: Popular Term, Difficult Technical Challenge

February 13, 2008

In April 2008, I’m giving a talk at Information Today’s Buying & Selling Econtent conference.

When I am designated as a keynote speaker, I want to be thought provoking and well prepared. So I try to start thinking about the topic a month or more before the event. As I was ruminating about my topic, I was popping in and out of email. I was doing, what some students of human behavior might call, context shifting.

The idea is that I was doing one thing (thinking about a speech) and then turning my attention to email or a telephone call. When I worked at Booz, Allen, my boss described this behavior as multi-tasking, but I don’t think what I was doing was doing two or three things at once. He was, like Einstein, not really human. I’m just a guy from a small town in Illinois, trying to do one thing and not screwing it up. So I was doing one thing at a time, just jumping from one work context to another. Normal behavior for me, but I know from observation my 86-year-old father doesn’t handle this type of function as easily as I do. I also know that my son is more adept at context shifting than I am. Obviously it’s a skill that can deteriorate as one’s mental acuity declines.

What struck me this morning was that in the space of a half hour, one email, one telephone call, and one face-to-face meeting each used the word “context”. Perhaps the Nokia announcement and its use of the word context allowed me to group these different events. I think that may be a type of meta tagging, but more about that notion in a moment.

Context seemed to be a high-frequency term in the last 24 hours. I don’t meed a Markov procedure to flag the term. The Google Trends’ report seems to suggest that context has been in a slow decline since the fourth quarter of 2004. Maybe so, but “context” was le mot de jour for me.

What’s Context in Search?

In my insular world, most of the buzzwords I hear pertain to search and retrieval, text processing, and online. After thinking about the word context, I jotted down the different meanings of the word context had in each of the communications I noticed.

The first use of context referenced the term as I defined it in my 2007 contributions to Bear Stearns’ analyst note, “Google and the Semantic Web.” I can’t provide a link to this document. You will have to chase down your local Bear Stearns’ broker to get a copy. This report describes the inventions of Ramanathan Guha. The PSE or Programmable Search engine discerns and captures context for a user’s query, the information satisfying that query, and other data that provide clues to interpret a particular situation.

The second use of context was a synonym for personalization. The idea was that a user profile would provide useful information about the meaning of a query. The idea is that a user looks for consumer information about gasoline mileage. When the system “knows” this fact, a subsequent query for “green fuel” is processed in the context of an automobile. In this case, “green” means environmentally friendly. Context makes it possible to predict a user’s likely context based on search history and implicit or explicit personalization.

The third use of context came up in a discussion about key word search. My colleague made the point that most search engines are “pretty dumb.” “The key words entered in a search box have no context,” he opined. The search engine, therefore, has to deliver the most likely match based on whatever data are available to the query processor. A Web search engine gives you a popular result for many queries. Type Spears into Google and you get pop star hits and few manufacturing and weapon hits.

When a search engine “knows” something about a user — for example, search history, factual information provided when the user registered for a free service, or the implicit or explicit information a search system gathers from users — search results can be made more on point. The idea is that the relevance of the hits matches the user’s needs. The more the system knows about a user and his context, the more relevant the results can be.

Sometimes the word context, when used in reference to search and retrieval, means “popping up a level” in order to understand the bigger picture for the user. Context, therefore, makes it possible to “know” that a user is moving toward the airport (geo spatial input), has a history of looking at flight departure information (user search history), and making numerous data entry errors (implicit monitoring of user misspellings or query restarts). These items of information can be used to shape a results set. In a more extreme application, these context data can be used to launch a query and “push” the information to the user’s mobile device. This is the “search without search” function I discussed in my May 2007 iBreakfast briefing, which — alas! — is not available online at this time.

Is Context Functionality Ubiquitous Today?

Yes, there are many online services that make use of context functions, systems, and methods today.

Even though context systems and methods add extra computational cycles, many companies are knee deep in context and its use. I think the low profile of context functions may be, in part, due to privacy issues becoming the target of a media blitz. In my experience, most users accept implicit monitoring if the user has a perception that their identity is neither tracked nor used. The more fuzzification — that is, statistical blurring — of a single user’s identity, the less the user’s anxiety about implicit tracking in order to use context data as a way to make results more relevant. Other vendors have not figured out how to add additional computational loads to their systems without introducing unacceptable latency, and these vendors offer dribs and drabs of context functionality. As their infrastructure becomes more robust, look for more context services.

The company making good use of personalization-centric context is Yahoo. Its personalized MyYahoo service delivers news and information selected by the user. Yahoo’s forthcoming OneConnect, announced this week at the telco conference in Barcelona, Spain. Based on the news reports I have seen, Yahoo wants to extend its personalization services to mobile devices.

Although Yahoo doesn’t talk about context, a user who logs in with a Yahoo ID will be “known” to some degree by Yahoo. The user’s mobile experience, therefore, has more context than a user not “known” to Yahoo. Yahoo’s OneConnect is a single example of context that helps an online service customize information services. Viewed from a privacy advocate’s point of view, this type of context is an intrusion, perhaps unwelcome. However, from the vantage point of a mobile device user rushing to the airport, Yahoo’s ability to “know” more about the user’s context can allow more customized information displays. Flight departure information, parking lot availability, or weather information can be “pushed” to the Yahoo user’s mobile device without the user having to push buttons or make finger gestures.

Context, when used in conjunction with search, refers to additional information about [a] a particular user or group of users identified as belonging to a cluster of users, [b] information and data in the system, [c] data about system processes, and [d] or information available to Yahoo though not residing on its servers.

Yahoo and T-Mobile are not alone in their interest in this type of context sensitive search. Geo spatial functions are potential enablers of news services and targeted advertising revenue. Google and Nokia seem to be moving on a similar vector. Microsoft has a keen awareness of context and its usefulness in search, personalization, and advertising.

Context has become a key part of reducing what I call the “shackles of the search box.” Thumb typing is okay but it’s much more useful to have a device that anticipates, personalizes, and contextualizes information and services. If I’m on my way to the airport, the mobile device should be able to “know” what I will need. I know that I am a creature of habit as you probably are with regard to certain behaviors.

Context allows disambiguation. Disambiguation means figuring out which of two or more possibilities is the “right” one. A good example is comes up dozens of times a day. You are in line to buy a bagel. The clerk asks you, “What kind of bagel?” with a very heavy accent, speaking rapidly and softly. You know you want a plain bagel. Without hesitation, you are able to disambiguate what the clear uttered and reply, “Plain, please.”

Humans disambiguate in most social settings, when reading, when watching the boob tube, or just figuring out weird road signs glimpsed at 60 miles per hour. Software doesn’t have the wetware humans have. Disambiguation in search and retrieval systems is a much more complex problem than looking up string matches in an index.

Context is one of the keys to figuring out what a person means or wants. If you know a certain person looks at news about Kolmogorov axioms, next-generation search systems should know that if the user types “Plank”, that user wants information about Max Planck, even though the intrepid user mistyped the name. Google seems to be pushing forward to use this type of context information to minimize the thumb typing that plagues many mobile device users today.

These types of context awareness seem within reach. Though complex, many companies have technologies, systems, and methods to deliver what I call basic context metadata. Let me note that context aware services are in wide use, but rarely labeled as “context” functions. The problem with naming is endemic in search, but you can explore some of these services at there sites. You may have to register and provide some information to take advantage of the features:

  • Google ig (Individualized Google) — Personalized start page, automatic identification of possibly relevant information based on your search history, and tools for you to customize the information
  • Yahoo MyYahoo — content customization, email previews, and likely integration with the forthcoming OneConnect service
  • MyWay — IAC’s personalized start page. One can argue that IAC’s implementation is easier to use than Yahoo’s and more graphically adept than Google’s ig service.

If you are younger than I or young at heart, you will be familiar with the legions of Web 2.0 personalization services. These range from RSS (really simple syndication) feeds that you set up to NetVibes, among hundreds of other mashy, nifty, sticky services. You can explore the most interesting of these services at Tech Crunch. It’s useful to click through the Tech Crunch Top 40 here. I have set up a custom profile on Daily Rotation, a very useful service for people in the information technology market.

An Even Tougher Context Challenge

As interesting and useful as voice disambiguation and automatic adjustment of search results are, I think there is a more significant context issue. At this time, only a handful of researchers are working on this problem. It probably won’t surprise you that my research has identified Google as the leader in what I call “meta-context systems and methods.”

The term meta refers to “information about” a person, process, datum, or other information. The term has drifted a long way from its Latin meaning of a turn in a hippodrome; for example, meta prima was the first turn. Mathematicians and scientists use the term to mean related to or based upon. When a vendor talks about indexing, the term metadata is used to mean those tags or terms assigned to an information object by an automated indexing system or a human subject matter expert who assigns index terms.

The term is also stretched to reference higher levels in nested sets. So, when an index term applies to other index terms, that broader index term performs a meta-index function. For example, if you have an index of documents on your hard drive, you can index groups of documents about a new proposal as “USDA Proposal.” The term does not appear in any of the documents on your hard drive. You have created a meta-index term to refer to a grouping of information. You can create meta-indexes automatically. Most people don’t apply a term to creating a folder name or new directory. Software that performs automatic indexing can assign these meta-index terms. Automatic classification systems can perform this function. I discuss the different approaches in Beyond Search, and I won’t rehash that information in this essay.

The “real context challenge” then is to create a meta context for available context data. Recognize that context data is itself a higher level of abstraction than a key word index. So we are now talking about taking multiple contexts, probably from multiple systems, and creating a way to use these abstractions in an informed way.

You, like me, get a headache when thinking about these Russian doll structures. Matryoshka (матрёшка)mare made of wood or plastic. When you open one doll, you see another inside. You open each doll and find increasingly small dolls inside the largest doll. The Russian doll metaphor is a useful one. Each meta-context refers to the larger doll containing smaller dolls. The type of meta context challenge I perceive is finding a way to deal with multiple matryoshkas, each containing smaller dolls. What we need, then, is a digital basket into which we can put our matryoshka. Single item of context data is useful, but having access to multiple items and multiple context containers opens up some interesting possibilities.

In Beyond Search, I describe one interesting initiative at Google. In 2006, Google acquired a small company that specialized in systems and methods for manipulating these types of information context abstractions. There is interesting research into this meta context challenge underway at the University of Wisconsin — Madison as well as at other universities in the U.S. and elsewhere.

Progress in context is taking place at two levels. At the lowest level, commercial services are starting to implement context functions into their products and services. Mobile telephony is one obvious application, and I think the musical chairs underway with Google, Yahoo, and their respective mobile partners is an indication that jockeying is underway. Also at this lowest level are the Web 2.0 and various personalization services that are widely available on Web sites or in commercial software bundles. In the middle, there is not much high-profile activity, but that will change as entrepreneurs sniff the big pay offs in context tools, applications, and services. The most intense activity is taking place out of sight of most journalists and analysts. Google, one of the leaders in this technology space, provides almost zero information about its activities. Even researchers at major universities have a low profile.

That’s going to change. Context systems and methods may open new types of information utility. In my April 2008 talk, I will provide more information about context and its potential for igniting new products, services, features, and functions for information-centric activities.

Stephen Arnold, February 13, 2008

Trapped by a Business Model, Not Technology

February 12, 2008

The headline “Reuters CEO sees Semantic Web in its Future” triggered an immediate mouse click. The story appear on O’Reilly’s highly regarded Radar Web log.

Tim O’Reilly, who wrote the article, noted: “Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity.”

Mr. O’Reilly noted that: “I don’t think he [Devin Wenig, a Reuters executive] should discount the statistical, computer-aided curation that has proven so powerful on the consumer Internet.”

Hassles I’ve Encountered

Reuters comment about the Semantic Web did underscore the often poor indexing done by publishing and broadcasting companies. In my experience, I have had to pay for content that was in need of considerable post-processing and massaging.

For example, if you license a news feed from one of the commercial vendors, some of the feeds will:

  • Send multiple versions of the stories “down the wire”, often with tags that make it difficult to determine which is more accurate version. Scripts can delete previous versions, but errors can occur, and when noticed, some have to be corrected by manual inspection of the feed data.
  • Deliver duplicate versions of the same story because the news feed aggregator does not de-duplicate variants of the story from different sources. Some systems handle de-duplication gracefully and efficiently. Examples that come to mind are Google News and Vivisimo. Yahoo’s approach with tabs to different news services is workable as well, but it is not “news at a glance”. Yahoo imposes additional clicking on me.
  • Insert NewsXML plus additional tags without alerting downstream subscribers. When this happens, the scripts can crash or skip certain content. The news feed services try to notify subscribers about changes, but in my experience there are many “slips betwixt cup and lip.”

Now the traditional powerhouses in the news business face formidable competition on multiple fronts. There are Web logs. There are government “news” services, including the remarkably productive US Department of State, largely unknown Federal News Service , and the often useful Government Printing Office listserv. There are news services operated by trade associations. These range from the American Dental Association to
the Welding Technology Institute of Australia. Most of these organizations are now Internet savvy. Many use Web logs, ping servers, and RSS (really simple syndication) to get information to constituents, users, and news robots. Podcasts are just another medium for grass roots publishers to use at low or without cost.

We are awash in news — text, audio, and video.

Balancing Three Balls

Traditional publishers and broadcasters, therefore, are trying to accomplish three goals at the same time. I recall from a lecture that the legendary president of General Motors, Alfred P. Sloan (1875 – 1966) is alleged to have said: “Two objectives is no objective.” Nevertheless, publishers like Reuters and its soon-to-be owner are trying to balance three balls on top of one another:

First, maintain existing revenues in the face of the competition from governments, associations, individual Web log operators, and ad-supported or free Internet services.

Second, create new products and services that generate new revenue. The new revenue must not cannibalize any traditional revenue.

Third, give the impression of being “with it” and on the cutting edge of innovation. This is more difficult than it seems, and it leads to some executives’ talking about an innovation that is no longer news. Could I interpret the Reuters’ comment as an example of faux hipness?

Publishers can indeed leverage the Semantic Web. There’s a published standard. Commerical systems are widely available to perform content transformation and metatagging; for example, in Beyond Search I profile two dozen companies offering different bundles of the needed technology. Some of these are known (IBM, Microsoft); others are less well known (Bitext, Thetus). And as pre-historic as it may seem to some publishing and broadcast executives, even skilled humans are available to perform some tasks. As good as today’s semantic systems are, humans are sometimes need to do the knowledge work required to make content more easily sliced and diced, post-processed and “understood”.

It’s Not a Technology Problem

The fact is that traditional publishers and broadcasters have been slow to grasp that their challenge is their business model, not technology. No publisher has to be “with it” or be able to exchange tech-geek chatter with a Google, Microsoft, or Yahoo wizard.

Nope.

What’s needed is a hard look at the business models in use at most of the traditional publishing companies, including Reuters and the other companies who have their reports in professional publishing, trade publishing, newspaper publishing, and magazine publishing. While I’m making a list I want to include radio, television, and cable broadcasting companies as well.

These organizations have engineers who know what the emerging technologies are. There may be some experiments that are underway and yielding useful insights into how traditional publishing companies can generate new revenues.

The problem is that the old business models generate predictable revenue. Even if that revenue is softening or declining, most publishing executives understand the physics of their traditional business model. Newspapers sell advertising. Advertisers pay to reach the readers. Readers pay a subscription to get the newspaper with the ads and a “news hole”. Magazine publishers either rely on controlled circulation to sell ads or a variant of the newspaper model. Radio and other broadcast outlets sell outlets to advertisers.

These business models are deeply ingrained, have many bells and whistles, and deliver revenue reasonably well in today’s market. The problem is that the revenue efficiency in many publishing sectors is softening.

Now the publishers want to generate new revenues while preserving their traditional business models, and the executives don’t want to cannibalize existing revenues. Predictably, the cycle repeats itself. How hard is it to break the business model handcuffs of traditional publishing. Rupert Murdock has pulled in his horns at the Wall Street Journal. Not even he can get free of the business model shackles that are confining once powerful organizations and making them sitting ducks for competitive predators.

Semantic Web — okay. I agree it’s hot. I am just finishing a 250-page look at some of the companies doing semantics now. A handful of these companies are almost a decade old. Some, like IBM, were around when Albert Einstein was wandering around Princeton in his house slippers.

I hope Reuters “goes semantic”. With the core business embedded in numeric data, I think the “semantic” push will be more useful when Reuters’ customers have the systems and methods in place to make use of richer metatagging. The Thomson Corporation has been working for a decade or more to make its content “smarter”; that is, better indexing, automated repurposing of content, and making it possible for a person in one of Thomson’s more than 100 units to find out what another person in another unit wrote about the same topic. Other publishers are genuinely confused and unstandably uncertain about the Internet as an application platform. Buggy whip manufacturers could not make the shift to automotive seat covers more than a 100 years ago. Publishers and broadcasters face the same challenge.

Semantic technology may well be more useful inside a major publishing or broadcasting company initially. In my experience, most of these operations have data in different formats, systems, and data models. It will be tough to go “semantic” until the existing data can be normalized and then refreshed in near real time. Long updates are not acceptable in the news business. Wait too long, and you end up with a historical archive.

Wrap Up

To conclude, I think that new services such as The Issue, the integration of local results into Google News, and wide range of tools that allow anyone to create a personalized news feed are going to make life very, very difficult for traditional publishers. Furthermore, most traditional publishing and broadcast companies have yet to understand the differences between TV and cable programming and what I call “YouTube” programming.

Until publishing finds a way to get free of its business model “prison”, technology — trendy or not — will not be able to work revenue miracles.

Update February 13, 2008, 8 34 am Eastern — Useful case example about traditional publishing and new media. The key point is that the local newspaper is watching the upstart without knowing how to respond. Source: Howard Downs.

Stephen Arnold, February 12, 2008

Social Search: No Panacea

February 11, 2008

I wrote a chapter for the forthcoming book of essays, Collective Intelligence. Information about the volume is at the Oss.net Web site. If you don’t see a direct link to the study, check back. The book is just in its final run up to publication.

I’m thinking about my chapter “Search Panacea or Ploy: Can Collective Intelligence Improve Findability?” As we worked on the index for my contribution, we talked about the notion of social search. Wikipedia, as you might have suspected, has a substantial entry about social search. A search for the phrase “social search” on any of the Web search engines returns thousands of entries. As of February 11, 2008, here are Yahoo’s.

Few will doubt that the notion of social search — with humans providing metatags about information — is a hot trend in search.

I can’t recycle the arguments presented in my contribution to Collective Intelligence. I can, however, ask several questions about social search to which I think more research effort should be applied:

Gaming the System

In a social setting, most people will play by the rules. A small percentage of those people will find ways to “game” or manipulate the system to suit their purposes. Online social systems are subject to manipulation. Digg.com and Reddit.com have become targets of people and their scripts. The question is, “How can a user trust the information on a social system?” This is a key issue for me. Several years ago I gave a talk at a Kroll (Marsh McLennan) officer’s meeting where the audience was keenly interested in ways to determine the reputation of people and the validity of their actions in a social online system.

Most Lurk, Two Percent Contribute

My work in social search last year revealed a surprising — to me at least — piece of data. Take a social search site with 100 users. Only two people contribute on a regular basis. I think more research is needed to understand how active individuals can shape the information available. The question is, “What is the likelihood that active participants will present information that is distorted or skewed inadvertently?” The problem is that in an online space where there is no or a lax editorial policy, distortion may be “baked into” the system. Naive users can visit a site in search of objective results, and the information, by definition, is not objective.

Locked in a Search Box

Some of the social search systems offer tag clouds or a graphic display of topics. The Mahalo.com site makes it easy for a user to get a sense of the topics covered. Click on the image below, and you will readily see that Mahalo is a consumer centric system, almost an updated version of Yahoo’s original directory:

mahalo

The question is, “What else is available in this system?” Most of the social search sites pose challenges to users. There’s no index to the content, and no easy way to know when the information was updated. I’ve had this issue with About.com for years. The notion of scope and currency nag at me, and the search box requires that I guess the secret combination of words before I can dig deeply into the information available.

In my contribution to Collective Intelligence, I cover a number of more complex issues. For example, Google is — at its core — a social search system. The notion of links and clicks are artifacts of human action and attention. By considering these, Google has its pulse on its users’ behavior. I think this aspect of Google’s system has be long understood, but Google’s potential in the social search space has not been viewed in some of the social buzz.

Stephen Arnold, February 11, 2008

Google and Obfuscated JavaScript

February 10, 2008

Sunday mornings are generally calm in rural Kentucky. There’s the normal pop of gun fire as my neighbors hunt squirrels for burgoo, and there is the routine salvo of news stories about Google.

I zipped through the “Google to Invest in CNet” but paused on the Google’s “obfuscated JavaScript” story here. A number of Web log and Web sites are running the news item. Google Blogoscoped’s story ran on Friday, February 8, 2008, “Why Does Google Obfuscate Their [sic] Code?” Philipp Lenssen does a good job, and this post contains a number of intriguing comments from his readers. These folks speculate on Google’s compressing JavaScript to save bandwidth; others hint that Google is intentionally creating hard to read code. A possible code siting is here.

Speculating about Google whys and wherefores is fun but semi-helpful. My hit-and-miss dealings with the company reveal “controlled chaos.” The way to get a look at what Google does is to dig through their technical papers (do it daily because the list can change) and read some of the company’s patent applications, patents, and if you are in Washington, DC, the “wrappers” available to some savvy researchers.

Some hints of the JavaScript mystery appear in this document: “Method and System for Dyanamically Composing Distributed Interactive Applications from High-Level Programming Languages”, US20080022267. The invention was filled in April 2005 anad was published on January 24, 2008. When an application is published, Google often has examples of the document’s “invention” running or even visible for the intrepid investigator. Three years is a long time for gestation at the Google. My hunch is the JavaScript is produced by the Googleplex’s auto-programming techniques, possibly the one disclosed in US20080022267.

I’m no attorney, and patents are difficult to analyze even for the experts. Read the document. You may find that the odd ball JavaScript is a way to eliminate manual drudgery for Googlers. US20080022267 may shed some light on what Google may do to spit out JavaScript for browsers, for instance. What do you think? I am toying with the idea that Google does automatic JavaScript to improve efficiency and eliminate some grunt work for its wizards.

You can obtain US20080022267 here. If you haven’t used the USPTO’s search and retrieval system, check out the sample queries. The system is sluggish, so you can try Google’s own patent service here. I’ve found that Google’s service is okay, but it’s better to go to the USPTO site, particularly for recently issued documents.

I want to conclude by saying, “I don’t think conspiracy theories are the way to think about Google.” Google’s big. It is innovative. It is — to use Google’s own term — chaotic. I think Google operates like a math club on steroids, based on my limited experience with the company.

I’m inclined to stick with the simplest explanation which seems clearly set forth in US20080022267. “Controlled chaos” is a way to disrupt monoliths, but it doesn’t lend itself to highly-targeted, human-motivated fiddling with JavaScript. Not even Google has that many spare synapse cycles.

Stephen Arnold, February 10, 2008

Is the Death Knell for SEO Going to Sound?

February 9, 2008

Not long ago, a small company wondered why its Web site was the Avis to its competitor’s Hertz. The company’s president checked Google each day, running a query to find out if the rankings had changed.

I had an opportunity to talk with several of the people at this small company. The firm’s sales did not come from the Web site. Referrals had become the most important source of new business. The Web site was — in a sense — ego-ware.

I shared some basic information about Google’s Web master guidelines, a site map, and error-free code. These suggestions were met with what I would describe as “grim acceptance.” The mechanics of getting a Web site squared away was work but not unwelcome. Mycomments articulated what the Web team already knew.

The second part of the meeting focused on the “real” subject. The Web team wanted the Web site to be number one. I thanked the Web team and said, “I will send you the names of some experts who can assist you.” SEO work is not my cup of tea.
Then, yesterday, as Yogi Berra allegedly said, “It was déjà vu all over again.” Another local company found my name and arranged a meeting. Same script, different actors.

“We need to improve our Google ranking,” the Web master said. I probed and learned that the company’s business came within a 25 mile radius of the company’s office. Google and other search engines listed the firm’s Web site deep in the results lists.

I replayed the MP3 in my head about clean code, sitemaps, etc. I politely told the local Web team that I would email them the names of some SEO experts. SEO is definitely an issue. Is the worsening economy the reason?
Here’s a summary of my thinking about these two opportunities for me to bill some time, make some money:

  1. Firms want to be number one of Google and somehow have concluded that SEO tactics can do the trick.
  2. There is little resistance to mechanical fixes, but there is little enthusiasm for adding substantive content to a Web site
  3. In the last year, interest in getting a Web site to the top of Live.com or Yahoo.com has declined, based on my observations.

Content, the backbone of a Web site, is important to site visitors. When I do a Web search, I want links to sites that have information germane to my query. Term stuffing, ripped off content, and other “tricks” don”t endear certain sites to me.

I went in search of sources and inspiration for ranking short cuts. Let me share with you some of my more interesting findings:

You get the idea. There are some amazing assertions about getting a particular Web site to the top of the Google results list. Several observations may not be warranted, but here goes:

First, writing, even planning, high-impact, useful content is difficult. I’m not sure if it is a desire for a short cut, a lack of confidence, laziness, or inadequate training. There’s a content block in some organizations, so SEO is the way to solve the problem.

Second, Web sites can fulfill any need its owner may have. The problem is that certain types of businesses will have a heck of a time appearing at the top of a results list for a general topic. Successful, confident people expect a Web indexing system to fall prey to their charms as their clients do. Chasing a “number one on Google” can be expensive and a waste of time. There are many “experts” eager to help make a Web site number one. But I don’t think the results will be worth the cost.

Third, there are several stress points in Web indexing. The emergence of dynamic sites that basic crawlers cannot index is a growing trend. Some organizations may not be aware that their content management system (CMS) generates pages that are difficult, if not impossible, for a Web spider to copy and crunch Google’s programmable search engine is one response, and it has the potential to alter the relevance landscape if Google deploys the technology. The gold mine that SEO mavens have discovered guarantees that baloney sites will continue to plague me. Ads are sufficiently annoying. Now more and more sites in my results list are essentially valueless in terms of substantive content.

The editorial policy for most of the Web sites I visit is non-existent. The Web master wants a high ranking. The staff is eager to do mechanical fixes. Recycling content is easier than creating solid information.

The quick road to a high ranking falls off a cliff when a search system begins to slice and dice content, assigns “quality” scores to the informaton, and builds high-impact content pages. Doubt me. Take a look at this Google patent application, US20070198481 and let me know what you think.

Stephen Arnold, February 9, 2008

Taxonomy: Search’s Hula-Hoop®

February 8, 2008

I received several thoughtful comments on my Beyond Search Web log from well-known search and content processing experts (not the search engine optimization type or the MBA analyst species). These comments addressed the topic of taxonomies. One senior manager at a leading search and content processing firm referenced David Weinberger’s quite good book, Everything is Miscellaneous. My copy has gone missing, so join me in ordering a new one from Amazon. Taxonomy and taxonomies have attained fad status in behind-the-firewall search and content processing. Every vendor has to support taxonomies. Every licensee wants to “have” a taxonomy.

oraclepressroomfeb08

This is a screen shot of the Oracle Pressroom. Notice that a “taxonomy” is used to present information by category. The center panel presents hot links by topics with the number of documents shown for each category. The outside column features a tag cloud.

A “taxonomy” is a classification of things. Let me narrow my focus to behind-the firewall content processing. In an organization, a taxonomy provides a conceptual framework that can be used to the organization’s information. Synonyms for taxonomy include classification, categorization, ontology, typing, and grouping. Each of these terms can be used with broader or narrower meanings, but for my purpose, we will assume each can be used interchangeably. Most vendors and consultants toss these terms around as interchangeable Lego blocks in my experience.

A fad, as you know, is an interest that is followed for some period of time with intense enthusiasm. Think Elvis, bell bottoms, and speaking Starbuck’s coffee language.

A Small Acorn

A few years ago, a consultant approached me to write about indexing content inside an organization. This individual had embarked on a consulting career and needed information for her Web site. I dipped into my files, collected some useful information about the challenges corporate jargon presented, and added some definitions of search-related terms.

I did work for hire, so my client could reuse the information to suit specific needs. Imagine my pleasant surprise when I found my information recycled multiple times and used to justify a custom taxonomy for an enterprise. I was pleased to have become a catalyst for a boom in taxonomy seminars, newsgroups, and consulting businesses. One remarkable irony was that a person who had recycled the information I sold to consultant A thousands of miles away turned up as consultant B at a company in which I was an investor. I sat in a meeting and heard my own information delivered back to me as a way to orient me about classifying an organization’s information.

Big Oak

A taxonomy revolution had taken place, and I was only partially aware. A new industry had taken root, flowered, and spread like kudzu around me.

The interest in taxonomies continues to grow. After completing the descriptions of companies offering what I call rich content processing, organizations looking for taxonomy-centric systems have many choices. Of the 24 companies profiled in the Beyond Search study, all 24 “do” taxonomies. Obviously there are greater and lesser degrees of stringency. One company has a system that supports American National Standards Institute guidelines for controlled terms and taxonomies. Other companies “discover” categories on the fly. Between these two extremes there are numerous variations. One conclusion I drew after this exhausting analysis is that it is difficult to locate a system that can’t “do” taxonomies.

What’s Behind the Fad?

Let me consider briefly a question that I don’t tackle in Beyond Search: “Why the white-hot interest in taxonomies?”

Taxonomies have a long and distinguished history in library science, philosophy, and epistemology. For those of you who are a bit rusty, “epistemology” is the theory of knowledge. Taxonomies require a grasp, no matter how weak, on knowledge. No matter how clever, a person creating a taxonomy must figure out how to organize email, proposals, legal documents, and the other effluvia of organizational existence.

I think people have enough experience with key word search to realize its strengths and limitations. Key words — either controlled terms or free text — work wonderfully when I know what’s in an electronic collection, and I know the jargon or “secret words” to use to get the information I need.

Boolean logic (implicit or explicit) is not too useful when one is trying to find information in a typical corpus today. There’s no editorial policy at work. Anything the indexing subsystem is fed is tossed into an inverted index. This is the “miscellaneous” in David Weinberger’s book.

A taxonomy becomes a way to index content so the user can look at a series of headings and subheadings. A series of headings and sub-headings makes it possible to see the forest, not the trees. Clever systems can take the category tags and marry them to a graphical interface. With hyperlinks, it is possible to follow one’s nose — what some vendors call exploratory search or search by serendipity.

Taxonomy Benefits

A taxonomy, when properly implemented, offers yields payoffs:

First, users like to point-and-click to discover information without having to craft a query. Believe me, most busy people in an organization don’t like trying to outfox the search box.

Second, the categories — even when hidden behind a naked search box interface — are intuitively obvious to a user. An accountant may (as I have seen) enter the term finance and then point-and-click through results. When I ask users if they know specific taxonomy terms, I hear, “What’s a taxonomy?” Intuitive search techniques should be a part of behind-the-firewall search and content processing systems.

Third, management is willing to invest in fine-tuning a taxonomy. Unlike a controlled vocabulary, a suggestion to add categories meets with surprisingly little resistance. I think the intuitive usefulness of cataloging and categorizing is obvious to people who tell people to search for them.

Some Pitfalls

There are some pitfalls in the taxonomy game: The standard warnings are “Don’t expect miracles when you categorize modest volumes of content.” And “Be prepared for some meetings that are more like a graduate class in logic than trying to figure out how to deliver what the marketing department needs in a search system. ” Etc.

On the whole, the investment in a system that automatically indexes is a wise one. It becomes ever wiser when the system can use a knowledge bases, word lists, taxonomies, and other information inputs to index more accurately.

Keep in mind that “smart” systems can be right most of the time and then without warning run into a ditch. At some point, you will have to hunker down and do the hard thinking that a useful taxonomy requires. If you are not sure how to proceed, try to get your hands on a the taxonomies that once were available from Convera. Oracle one once? offered vertical term lists. You can also Google for taxonomies. A little work will return some useful examples.

To wrap up, I am delighted that so many individuals and organizations have an interest in taxonomies — whether a fad or something more epistemologically more satisfying. The content processing industry is maturing. If you want to see a taxonomy in action, check out:

HMV, powered by Dieselpoint

Oracle’s Pressrom, powered by Siderean Software’s system

US government portal powered by Vivisimo (Microsoft)

Stephen Arnold, February 8, 2008

No News. No, Really

February 8, 2008

A colleague (who shall remain nameless) sent me at 0700 Eastern this morning, Friday, February 8, 2008. To brighten this gloomy day in Harrod’s Creek, he wrote: “Your site indeed does have excellent content. It could be more ‘newsy’ though, which would encourage people to come back daily.”

Blog Changes Coming

Great comment, and I have begun taking baby steps to improve this Web log. It’s not even a month old, and I know I have to invest more time in this beastie. In the next two weeks, I want to introduce a three-column format, and I will include links to news. Personally, I dislike news because it is too easy to whack out a “press release”, post it on a Free news release distributions, and sit back while the tireless Topix.net and Google.com bots index the document. Sometimes, a few seconds after the ping server spits out a packet, the “press release” turns up in one of my alert mechanisms. Instant news — sort of. Not for me, sorry.

The other change will be the inclusion of some advertising. My son, who runs a thriving Google-centric integration business, reminded me, “Dad, you are getting traffic. Use AdSense to generate some revenue.” As a docile father, I will take his suggestion. We will put the text ads in the outside column of the new “magazine” layout we will be using. I will also write an essay about what he is doing and why it is representative of the change taking place in fixing a “broken” search system. Not news. But if you are struggling with a multi-million dollar investment in a behind-the-firewall system that users find a thorn in their britches, you will want to know how to make the pain go away without major surgery. You won’t find this information on most search and content processing vendors’ Web sites. Nor will you get much guidance from the search “experts” involved in search engine optimization, shilling for vendors with deep pockets, or from analysts opining about the “size” of the market for “enterprise search”, whatever that phrase means. You will get that information here, though. No links in this paragraph. I don’t want hate mail.

Let me be perfectly clear, as our late, beloved, President Richard Nixon used as an audio filler, “The content of this Web log is NOT news.” If you look at the posts, I have been using this Web log to contain selected information I couldn’t shoehorn into my new study Beyond Search: What to Do When Your Search System Doesn’t Work (in press at Gilbane Group now, publication date is April 2008).

Some News Will Creep In

It is true that I have commented on some information that is garnering sustained financial and media attention; for example, the issues that must be satisfactorily resolved if Microsoft succeeds in its hostile take over of Yahoo. Unless a white knight gallops into Mountain View, California, soon, Micro-Hoo will be born. I’ve also made a few observations about the Microsoft – Fast tie up. Although 1 /36th the dollar amount of the Yahoo deal — the Microsoft – Fast buyout is interesting. The cultural, technical, social, and financial issues of this deal are significant. My angle is that Fast Search & Transfer was the company to turn its back on Web indexing and advertising at the moment Google was bursting from the starting blocks. Fast Search’s management sold its Web search site AllTheWeb.com and its advertising technology at the moment Google embraced these two initiatives. We have, therefore, a fascinating story with useful lessons about a single decision’s impact. Google’s market cap today is $157.9 billion and Fast Search’s is $1.2 billion. I think this decision is a pivotal event in online search and content processing.

In my opinion, Fast Search was in 2003 – 2004 the one company with high-speed indexing and query processing technology comparable to Google’s. When Fast Search & Transfer withdrew from Web indexing, Google had zero significant competition. AltaVista.com had fallen off the competitive radar under Hewlett-Packard’s mismanagement.

Fast Search bet the farm on “enterprise search”. Google bet on Web search and advertising sector. Here’s a what – if question to consider, “What if Fast Search had fought head-to-head with Google? Perhaps a business historian will dig into this more deeply.

I have several posts in the works that are definitely non-news. Later today, I will offer some observations about today’s taxonomy fad. I have some information about social search, and I put particular emphasis on the weaknesses of this sub-species of information retrieval; namely, the leader is Google. The other folks are doing “social search” roughly in the same way a high school soccer team plays football against the Brazilian national team. The game is the same, but the years of experience the Brazilians have translate into an easy win. I think this is a controversial statement. Is it news? No.

Stealth Feature to Début

The revamped Web log will include a “stealth feature”. (In Silicon Valley speak, this means something anyone can do but when kept secret becomes a “secret”.) I don’t want to let the digital cat out of the Web bag yet, but you will be able to get insight into how some of the major search and content processing developed. I will post some original information on my archive Web site and summarize the key points in a Web log posting in this forum.

We have been getting an increasing number of off-topic comments. I’m deleting these. My editorial policy is that substantive comments germane to search and content processing are fine. You may disagree with me, explain a point in a different way, or provide supplemental information. Using the comments section to get a person to buy stolen software, acquire Canadian drugs, connect with Clara (a popular spammer surname) for a good time, or the other wacko stuff is out of bounds.

Okay, Here’s Some Real News

For the news mavens, I’ve included some hot links in this announcement of non news. Here’s one to Google’s version of Topix.net’s local news service. (Hurry, these newsy links go dead pretty quickly, which is one reason I don’t do news.) You don’t need me to tell you what this means to Topix.net. It seems that when you search for Arnoldit on Google, the Govern – ator comes up first. Now that’s a interesting twist in Google’s relevancy algorithm.,

Stephen Arnold, February 8, 2008

When Turkeys Marry

February 7, 2008

In 2006, I was invited to attend a dinner with Robert Scoble in London, England. For two hours, I sat and listened. As the trajectory of the dinner took shape, I concluded that I should generally agree with his views. Now, 14 months removed from that instructive experience, I’m going to agree with him about the proposed Microsoft – Yahoo tie up.

I enjoyed “What You All Are All Missing about Google.” The best sentence in the essay is, in my opinion: “As I said on Channel 5 news on Friday night: put two turkeys together and you don’t get an eagle.” [Emphasis added.] Microsoft is one turkey. Yahoo is the other turkey. Together — no eagle. I agree with him.

I also agree in a tangential way with the statement attributed to an SEO guru: “Danny Sullivan told me that this deal is all about search.” [Emphasis added.] Let me offer what the French call pensées.

Turkeys

Turkeys, as I recall from my days near the farm yard, are one of the few birds rumored to drown in a rainstorm. The folk tale is that when a turkey looks up at the rain and forgets to button its beak, the turkey drowns. Turkeys are not the rocket scientists of the bird world, but , in my experience, turkeys aren’t that dumb. When Thanksgiving came around, my aunt and her hatchet were watched closely by the turkeys in the family’s farm yard. Turkeys knew something was about to happen.

The firms are profitable. Both have some good products. Both companies have thousands of extremely bright people. The reason I’m uncomfortable (note that I am not disagreeing) is that each company has certain technological characteristics. Each company has a distinctive culture. Turkey may be too strong a characterization.

Google’s View of the Deal

I agree that Google can benefit from Microsoft’s acquisition of Yahoo. Mr. Scoble says: “Google stands to gain HUGE by slowing down this deal.” To add one small nuance to Mr. Scoble’s statement I find Google’s thoughts and actions difficult to predict. My infrequent yet interesting interactions with the company have given me sufficient data to conclude that Google is a strategy mongrel. Other actions are manifestations of spontaneous decisions and “controlled chaos.” Perhaps Google is working, on one front, with thrusts into five or six key business sectors. On the other hand, Google is reacting quickly with a suggestion that a Microsoft – Yahoo tie up will be the end of the Web as we know it.

Email

I agree email is not where the money is. But Google has devised several mechanism to monetize electronic communications of which email is one member of this class. Google may have several ways to turn the “hard to monetize” email into “less hard to monetize” email. I expect rapid-fire testing and adapting. I want to wait and see what the Google does.

Instant Messaging: An Email Variant

Instant messaging has been difficult to monetize. IM is a variant of email. Perhaps the communication modes that today seem distinct will blur going forward?

Search

I agree that the Microsoft – Yahoo deal is about search. May I respectfully suggest that the major chord in this opera is the platform? Search is but one application running on that platform. I articulate this idea in my two Google studies: 2005’s The Google Legacy: How Search Became the Next Applicaton Platform and 2007’s Google Version 2.0: The Calculating Predator, If Google is a platform, does it follow that Google could disrupt other industry sectors as it has snarled competitive search engines’ horse power?

Gold in Google’s Pockets

I agree that Google benefits by a deal delay. I agree that Google benefits if Microsoft – Yahoo get married. Google may be vulnerable no matter what Microsoft does. Perhaps I will discuss some of these exogenous factors in another post.

Stephen Arnold, February 7, 2008

Requirements for Behind-the-Firewall Search

February 5, 2008

Last fall, I received a request from a client for a “shopping list of requirements for search.” The phrase shopping list threw me. My wife gives me a shopping list and asks me to make sure the tomatoes are the “real Italian kind”. She’s a good cook, but I don’t think she worries about my getting a San Marzano or an American genetically-engineered pomme d’amour.

Equating shopping list with requirements for a behind-the-firewall search / content processing system gave me pause. As I beaver away, gnawing down the tasks remaining for my new study Beyond Search: What to Do When Your Search System Won’t Work”, I had a mini-epiphany; to wit:

Getting the requirements wrong can
undermine a search / content processing system.

In this essay, I want to make some comments about requirements for search and content processing systems. I’m not going to repeat the more detailed discussion in The Enterprise Search Report, 1st, 2nd, and 3rd editions, nor will I recycle the information in Beyond Search. I propose to focus on the tendency of very bright people to see search and content processing requirements like check off items on a house inspection. Then I want to give one example of how a perceptual mismatch on requirements can cause a search and content processing budget to become a multi-year problem. To conclude the essay, I want to offer some candid advice to three constituencies: the customer who licenses a search / content processing solution, the vendor who enters into a deal with a customer, and the consultants who circle like buzzards.

Requirements

To me, a requirement is a clear, specific statement of a function a system should perform; for example, a search system should process the following file types: Lotus Notes, Framemaker, and DB2 tables.

How does one arrive at a requirement and then develop a list of requirements?

Most people develop requirements by combining techniques. Here’s a short list of methods that I have seen used in the last six months:

  • Ask users of a search or content processing system what they would like the search system to do
  • Look at information from vendors who seem to offer a solution similar to the one the organization thinks it wants
  • Ask a consultant, sometimes a specialist in a discipline only tangentially related to search.

The Fly Over

My preferred way of developing requirements is more mundane, takes time, and is resistant to short cuts. The procedure is easy to understand. The number of steps can be expanded when the organization operates in numerous locations around the world, processes content in multiple languages, and has different security procedures in place for different types of work.

But let’s streamline the process and focus on the core steps. When I was younger, I guarded this information closely. I believed knowing the steps was a key ingredient for selling consulting. Now, I have a different view, and I want you to know what I do for the simple reason that you may avoid some mistakes.

First, perform a data gathering sweep. In this step you will be getting a high-level or general view of the organization. Pay particular attention to these key areas. Any one of them can become a search hot spot and burn your budget, schedule, and you with little warning:

  • Technical infrastructure. This means looking at how the organization handles enterprise applications now, what the hardware platform is, what the work load on the present technical staff is, how the organization uses contractors and outsourcing, what the present software licensing deals stipulate, and the budget. I gather these data by circulating a data collection form electronically or using a variety of telephonic and in-person meetings. I like to see data centers and hardware. I can tell a lot by looking at how the cables are organized and from various log files which I can peruse on site with the customer’s engineer close at hand to explain a number or entry to me. The key point of the exercise is to understand if the organization is able to work within its existing budget and keep the existing systems alive and well.
  • User behavior. To obtain these data, I use two methods. One component is passive; that is, I walk around and observe. The other component is active; that is, I set up brief, informal meetings where people are using systems and ask them to show me what they now do. If I see something interesting, I ask, “What caused you to take that action?” I write down my observations. Note that I try to get lower-level employees input about needs before I talk to too many big wheels. This is an essential step. Without knowing what employees do, it is impossible to listen accurately to what top managers assert.
  • Competitive arena. Most organizations don’t know much about what their competitors do. In terms of search, most organizations are willing to provide some basic information. I find that conversations at trade shows are particularly illuminating. But another source of excellent information is search vendors. I admit that I can get executives on the telephone or by email pretty easily, but anyone can do that with some persistence. I ask general questions about what’s happening of interest in law firms or ecommerce companies. I am able to combine that information with data I maintain. From these two sources, I can develop a reasonable sense of what type of system is likely to be needed to keep Company A competitive with Company B.
  • Management goals. I try to get a sense of what management wants to accomplish with search and content processing. I like to hear from senior management, although most senior managers are out of touch with the actual information procedures and needs of their colleagues. Nevertheless, I endure discussions with the brass to get a broad calibration. Then I use two techniques to get information about the needs. Once these interviews or discussions are scheduled, I use two techniques to get data from mid-level managers. One technique is a Web survey. I use an online questionnaire and make it available to any employee who wishes to participate. I’m not a fan of long surveys. A few pointed questions delivers the freight of meaning I need. More importantly, survey data can be counted and used as objective data about needs. Second, I use various types of discussions. I like one-on-one meetings; I like small-group meetings; and I like big government-style meetings with 30 people sitting around a chunk of wood big enough to make a yacht. The trick is to have a list of questions and the ability to make everyone comment. What’s said is important but how people react to one another can speak volumes and indicate who really has a knack for expressing a key point for his / her co-workers.

I take this information and data, read it, sort it, and analyze it. The result is the intellectual equipment of a bookcase. The supports are the infrastructure. Each of the shelves consists of the key learnings from the high-level look at the organization. I don’t know how much content the organization has. I don’t know the file types. I don’t have a complete inventory of the enterprise applications into which the search and content processing must integrate. What I do know is whom to call or email for the information. So drilling down to get a specific chunk of data is greatly simplified by the high-level process.

Matching

I take these learnings and the specific data such as the list of enterprise systems to support and begin what I call the “matching sequence.” Here’s how I do it. I maintain a spreadsheet with the requirements from my previous search and content processing jobs. Each of these carries a short comment and a code that identifies the requirement by availability, stability, and practicality. For example, many companies want NLP or natural language processing. I code this requirement as Available, Generally Stable, and Impractical. You may disagree with my assessment of NLP, but in my experience few people use it, and it can add enormous complexity to an otherwise straight forward system. In fact, when I hear or identify jargon in the fly-over process, my warning radar lights up. I’m interested in what people need to do a job or to find on point information. I don’t often hear a person in accounting asking to do a query in the form a complete sentence. People want information in the most direct, least complicated way possible. Writing sentences is neither easy nor speedy for many employees working on a deadline.

What I have after working through my list of requirements and the findings from the high level process is three lists of requirements. I keep definitions or mini-specifications in my spread sheet, so I don’t have to write boiler plate for each job. The three lists with brief comments are:

  • Must-have. These are the requirements that the search or content processing system must meet in order to meet the needs of the organization based on my understanding of the data. A vendor unable to meet a must-have requirement, by definition, is excluded from consideration. Let me illustrate. Years ago, a major search procurement stipulated truncation, technically lemmatization. In plain English, the system had to discard inflections, called rearward truncation. One vendor wrote an email saying, “We will not support truncation.” The vendor was disqualified. When the vendor complained about the disqualification, I showed the vendor the email. Silence fell.
  • Options. These are requirements that are not mandatory for the deal, but the vendor should be able to demonstrate that these requirements can be implemented if the customers request them. A representative option is support for double-byte languages; e.g., Chinese. The initial deployment does not require double byte, but the vendor should be able to implement double-byte support upon request. A vendor who does not have this capability is on notice that if he / she wins the job, a request for double-byte support may be forthcoming. The wise vendor will make arrangements to support this request. Failure to implement the option may result in a penalty, depending on the specifics of the license agreement.
  • Nice-to-have. These are the Star Trek or science fiction requirements that shoot through procurements like fat through a well-marbled steak. A typical Star Trek requirement is that the system deliver 99 percent precision and 99 percent recall or deliver automatic translation with 99 percent accuracy. These are well-intentioned requests but impossible with today’s technology and budgets available to organizations. Even with unlimited money and technology, it’s tough to hit these performance levels.

Creating a Requirements Document

I write a short introduction to the requirements, create a table with the requirements and other data, and provide it to the client for review. After a period of time, it’s traditional to bat the draft back and forth, making changes on each volley. At some point, the changes become trivial, and the document is complete. There may be telephone discussions, face-to-face meetings, or more exotic types of interaction. I’ve participated in a requirements wiki, and I found the experience thrilling for the 20 – somethings at the bank and enervating for me. That’s what 40 years age difference yields — an adrenaline rush for the youngster and a dopamine burst for the geriatrics.

There are different conventions for a requirements document. The US Federal government calls a requirements document “a statement of work”. There are standard disclaimers, required headings for security, an explanation of what the purpose of the system is, the requirements, scoring, and a mind-numbing array of annexes.

For commercial organizations, the requirements document can be an email with the following information:

  • Brief description of the organization and what the goal is
  • The requirements, a definition, the metrics for performance or a technical specification for the item, and an optional comment
  • What the vendor should do with the information; that is, do a dog-and-pony show, set up an online demonstration, make a sales call, etc.
  • Whom to call for questions.

Whether you prefer the bureaucratic route or a Roman road builder method, you now have your requirements in hand.

Then What?

That’s is a good question. In go-go organizations, the requirements document is the guts of a request for a proposal. Managing an RFP process is a topic for another post. In government entities, the RFP may be preceded by an RFI or Request for Information. When the vendors provide information, a cross-matching of the RFI information with the requirements document (SOW) may be initiated. The bureaucratic process may take so long that the fiscal year ends, funding lost, and the project is killed. Government work is rewarding in its own way.

Whether you use the requirements to procure a search system or whether you put the project on hold, you have a reasonably accurate representation of what a search / content processing system should deliver.

The fly-over provides the framework. The follow up questions deliver detail and metrics. The requirements emerge from the analysis of these information and data. The requirements are segmented into three groups, with the wild and crazy requirements relegated to the “nice to have” category. The customer can talk about these, but no vendor has to be saddled with delivering something from the future today. The requirements document can be the basis of a procurement.

There are some pitfalls in the process I have described. Let me highlight three:

First, this procedure takes time, expertise, and patience. Most organizations lack adequate amounts of each ingredient. As a result, requirements are off kilter, so the search system can list or sink. How can a licensee blame the vendor when the requirements are wacky.

Second, the analysis of the data and information is a combination of analytic and synthetic investigation. Most organizations prefer to use their existing knowledge and gut instinct. While these may be outstanding resources, in my experience, the person who relies on these techniques is guessing. In today’s business climate, guessing is not just risky. It can severely damage an organization. Think about a well-known pharmaceutical company pushing a drug to trial despite it being known to show negative side effects in the company’s own prior research. That’s one consequence of a lousy behind-the-firewall search / content processing system.

Third, requirements are technical specifications. Today, people involved in search want to talk about the user interface. The user interface manifests what is in the system’s index. The focus, therefore, should not be on the Web 2.0 color and features of the interface. The focus must be kept squarely on the engineering specifications for the system.

You can embellish my procedure. You can jiggle the sequence. You may be able to snip out a step or a sub-process. But if you jump over the hard stuff in the requirements game, you will deploy a lousy system, create headaches for your vendor, annoy, even anger, your users, and maybe lose your job. So, get the requirements right. Search is tough enough without starting off on the wrong foot.

Stephen Arnold, February 6, 2008

 

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta