Hit Boosting: SEO for Intranet Search Systems

February 5, 2008

The owner of a local marketing company asked me, “Is there such a thing as SEO for an in-house search system?”

After gathering more information about this large health care organization, I can state without qualification, “Yes.”

Let’s define some terms, because the acronym SEO is used primarily to apply techniques to get a public Web page to appear at the top of a results list. Once this definition is behind us, I want to look at three situations (not exhaustive but illustrative) when you would want to use SEO for behind-the-firewall search. To wrap up, I will present three techniques for achieving SEO-type “lift” on an Intranet.

I’m not going to dig too deeply into the specific steps for widely-used search systems. I want to provide some broad guidance. I decided to delete this information from my new study Beyond Search: What to Do When Your Search System Doesn’t Work in order to keep the manuscript a manageable size.

SEO and its variants is becoming more and more important, and I have been considering a short monograph on this topic. I implore the SEO gurus, genii, and mavens to spare me their brilliant insights about spoofing Google, Live.com, and Yahoo. I am not interested in deceiving a public Web search engine. Anyway, my comments aren’t aimed at the public indexing systems. We’re talking about indexing information on servers that live behind a firewall.

Definition Time

SEO means “search engine optimization.” In my view, this jargon should be used exclusively for explaining how a Web master can adjust Web pages (static and dynamic) to improve a site’s ranking in a results list. The idea behind SEO is to make editorial and coding changes so a Web page buried on results page 12 appears on results page 1 even though the content on the Web page doesn’t warrant that high rank. A natural high rank can be seen with this query; go to Google and search for “arnoldit google patents”. My Web site should be at or near the top of the results list. SEO wizards want to make this high ranking happen — not by content alone — but with a short cut or trick. SEO often aims to exploit idiosyncrasies in the search sysetm indexing and ranking procedures. If you want to see a list of about 100 factors that Google allegedly used in the 2004-2005 time period, get a copy of my The Google Legacy. I include a multi-page table and some examples. But my thinking about distorting a relevancy procedures makes me queasy.

When you want to make sure specific content appears on a colleague’s behind-the-firewall, results page, you are performing hit boosting. The idea behind “hit boosting” is that certain organizational content will not appear on a colleague’s results page because it is too new, too obscure, or set forth in a manner that a behind-the-firewall content processing system cannot figure out.

An example from my files says is a memo whose text is in its entirety, “ATTN: Fire Drill at 3 PM. Mandatory. Susan.” Not surprisingly, you would have to be one heck of a search expert to find this document even if you knew it existed. With the latency in most behind the firewall content processing systems, this memo may not be in the index until the fire drill was over and forgotten.

To get this message in front of your colleagues, you need “hit boosting”. Some information retrieval experts just say “boosting” to refer to this function.

What Needs Boosting?

Let me give you an example. a vice president of the United States wanted his home page to come up at the top of a results list on various Federal systems. One system — used by the 6,000 officials and staff of the US Senate — did not index the Veep’s Web content. The only way to make the site appear was to do “hit boosting.” The reason had nothing to do with relevance, timeliness, or any query. The need for hit boosting was pragmatic. A powerful person wanted to appear on certain results pages. End of story. You may find yourself in a similar situation. If you haven’t, you probably will.

A second example is an expansion of the emergency notification about the fire drill. Your colleagues in certain departments — HR, legal, and accounting in my experience — tell you that certain information must be displayed for all employees. I dislike categorical affirmatives, but these folks wallow in them. Furthermore the distinction between a search system, a portal, and a newsfeed is “too much detail”. Some search system vendors have added components to make this news push task easier.

A third example is that a very important document cannot be located. There are many reasons for this. Some systems may perform key word indexing. The terminology of the document is very complex, even arcane. A person looking for this type of legal, scientific, technical, or medical document cannot locate it unless he or she knows the specific terminology used in the document. Searching by more general concepts buries the document in a lengthy result list or chops off the least relevant documents, displaying only the 20 most relevant documents. Some systems routinely reject documents if they exceed certain word counts, contain non-text objects, or is an unsupported file format. Engineers are notorious for spitting out a drawing with a linked broadsheet containing the components in the drawing and snippets of text explaining in geek-speak a new security system.

To recap, you have to use “hit boosting” to deal with requests from powerful people, display content to employees whether those employees have searched for the information or not, or manipulate certain types of information to make it findable.
In my work, the need for “hit boosting” is increasingly. The need rises as the volume of digital information goes up. The days of printing out a message and putting it on the bulletin board by the cafeteria are fast disappearing.

How to Do It

There are three basic techniques for “hit boosting”. I am going to generalize, not select a single system such as the Google Search Appliance or Vivisimo’s system. Details vary by system, but the broad principles I summarize should work.

First, you create a custom search query and link it to an icon, image, or chunk of text. When the user clicks the hot link, the system runs the query and displays the content. For example, you can use the seal of the vice president, use hover text that says, “Important Information from the Vice President”, and use a hot link on text that says, “Click here.” Variations of this approach include what I call “CSS tweaking” accompanied with an iFrame. The idea is that on any results page, you force the information of the moment in front of the user. If this seems like a banner ad or an annoying Forbes’ message, you are correct. The idea is that you don’t fool around with your enterprise search system. You write code to deliver exactly what the powerful person wants. I know this is not “search”, but most powerful people don’t know search from a Queensland cassowary. When you do a demo, the powerful one sees what he / she expects to see.

Second, you read the documentation for your search engine and look for the configuration file(s) that control relevance. Some high-end search systems allow you to specify conditions or feed “rules” to handle certain content. The trick here is to program the system to make certain content relevant regardless of the user’s query. If you can’t find the config file or specific relevance control panel, then you use the search systems API. Write explicit instructions to get content from location A and display it at location B. You may end up with an RSS hack to refresh the boosted content pool, so expect to invest some time mucking around to get the effect you want. Because vendor documentation is often quite like a haiku, you will be doing some experimenting. (Remember. Don’t do this on a production server.) You can also hack an ad display widget into your results page. With this approach, your boosted content is handled as an ad.

Third, you take the content object, rework it into a content type the search system can manipulate. Then you add lots of metadata to this reworked document. You are doing what SEO mavens call keyword stuffing or term stuffing. With some experimentation, you can make one or more documents appear in the context you want. Once you have figured out the right combination of terms to stuff, you can automate this process and “inject” these additional tags into any document you want to boost. (The manual hit boosting techniques should be automated as soon as you know the hack won’t cause other problems.)
Wrap Up

Hit boosting is an important task for system administrators and the politically-savvy managers of a behind-the-firewall search system. If you have other tricks and techniques, please, post them so others can learn.

Stephen Arnold, February 5, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Search | 2 Comments

Simple Math = Big Challenge: MSFT & YHOO

February 4, 2008

I have only a few sections of Beyond Search to wrap up. Instead of being able to think about my updating my description of Access Innovations’ MAIstro, I am distracted by jibber jabber about the Microsoft (NSDQ:MSFT) Yahoo (NSDQ:YHOO) tie up.

Where We Are

First, it’s an offer, isn’t it? Maybe a trial balloon? No cash and stock have changed hands as I write this in the wee hours of Monday, February 4, 2008. Yet, many are in a frenzy over a hostile take over. Think about this word “hostile.” It means antagonistic, unfriendly, enemy. The reason for the bold move? Google, a company that has out foxed Microserfs and Yahooligans for almost a decade.

The number of articles in my various alerts, RSS feeds, and emails is remarkable. Worldwide a Microsoft – Yahoo marriage (even it is helped along with a shotgun) ignites folks’ imagination. Neither Microsoft nor Yahoo will be able to recruit tech wizards, one pundit asserts. Innovation in Silicon Valley will be forever changed, posits another. Sigh.

Sorry. I’m not that excited. I’m interested, but I’m too old, too pragmatic, and too familiar with the vagaries of acquisitions to jump up and down.

Judging from some grousing from Yahooligans, some Yahoo professionals aren’t too keen about working for Microsoft. I have had a hint that some Microsoft wizards aren’t too excited about fiddling with Yahoo’s mind-numbing array of products, services, technologies, search systems, partnerships, and research initiatives.

I think the root concern is trying to figure out how to fit two large operations together, a 1 + 1 = 3 problem. For example, there’s Yahoo Mail and Hotmail Live; Yahoo Panama and Microsoft Ad Center; and Yahoo News and Microsoft’s new services, etc., etc. One little-considered consequence is that Microsoft may end up owning more search systems than any other company. That’s a technology can of worms worthy of a separate essay.

I will tell you who is excited, and, please, keep in mind that this is my opinion. And, once I express my view, I want to offer another very simple (probably too simple for an MBA wizard) math problem. I will end this essay with my now familiar observations. Let’s begin.

Who Benefits?

This is an easy question to answer, and you will probably think that I am stating the obvious. Bear with me because the answer explains why some at Microsoft may not be able to get the right prescription for their deal bifocals. Without the right eye glasses, it’s tough to discern some smaller environmental factors obscured in the billion dollar fusillade fired at Yahoo’s board of directors’ meeting.

Shareholders who can make some money with the Microsoft offer. When there’s money to be made, concerns about technology, culture, and market opportunity are going to finish last. Most shareholders don’t think too much other than the answer to two questions: “How much did I make?” and “What are the tax implications?”
Investment bankers who earn money three ways on a deal of this magnitude. There are, of course, other ways for those in the financial loop to make money, but I’m going to focus on the ones that keep these professionals in blue suits, not orange jump suits. [a] Commissions. Where the is churn, there is a commission. For many investment advisors, buying and selling equals a bigger payday. [b] Bonuses. The mechanics of an investment banker’s bonus are complex. After all, it is a banker dealing with a fellow banker. Mere mortals should steer clear. The idea is simple. Generate churn or a fee, and you get more bonus money. The first three months of a calendar year is bonus and job hopping time on Wall Street. Anyone who can get a piece of the action for a big deal gets cash. [c] Involvement in a big deal acts like a huge electro magnet for more deals. Once Microsoft “thought” of the acquisition, significant positive input about the upside of the deal pours into the potential acquirer.
Consultants. Once a big deal is announced, the consultants [delete apostrophe here] leap into action. The buyer needs analyses, advice, and strategic counsel. The buyer’s minions need tactical advice to answer such questions as “How can we maximize our tax benefits?” and “How can we pay for this with cheap money?” The buyer becomes hungry for advisors of every species. Blue-chip outfits like Bain, Booz, Allen & Hamilton, Boston Consulting Group, and McKinsey & Co. drool in eagerness to provide guidance on lofty strategy matters such as answering the question, “How can I maximize my pay-out?” And “What are the tax consequences of my windfall profit?” Tactical advisors from these firms can provide support on human resource issues and real estate leases, among other matters. In short, buyers throw money at “the problem” in order to be prepared to negotiate or find a better deal.

These three constituencies want the deal to go through. If Microsoft is the buyer, that’s fine. If another outfit with cash shows, that’s okay too. The deal now has a life of its own. Money talks. To get the money, these constituencies have no desire to help Microsoft “see” some of the gaps and canyons that must be traversed. Let’s turn to one practical matter and the aforementioned simple math. Testosterone and money — these are two ways to cloud perception and jazz logic.

More Simple Math

Let’s do a thought experiment, what some German philosophers call Gedankenexperiment. I am not talking about the proposed Microsoft – Yahoo deal, gentle attorneys.

Accordingly, We have two companies, Company Alpha and Company Beta; hereinafter, Company A(lpha) and Company B(eta), neither of which is a real company and should not be construed as having any similarity with any company now in existence.

Company Alpha has a dominant position in a market and wants to gain a larger share of a newer, tangential market. Company A has a proven, well-tuned, aging business model. That business model is a variation on selling subscriptions and generating annuity income from renewals. Company A’s business model works this way. Company A offers a product and then, on a periodic basis, Company A makes a change to an existing product, assessing a fee for customers to get the “new” or “enhanced” version of the product (service).

The idea is that once a subscription base is in place, Company A can predict a certain amount of revenue from standing orders and new orders. Company A has an excellent, stable, cash flow based on this well-crafted business model and periodic fee increases. Although there are environmental factors that put pressure on the proven business model, the customer base is large, and the business model continues to work in Company A’s traditional markets. Company A, aware of exogenous factors — for instance, the emergence of cloud computing and other non-subscription business models — has learned through trial and error that its subscription-based business model does not work in certain new markets. These new markets are potentially lucrative, representing “new” revenue and a threat to Company’s existing revenue stream. Company A wants to acquire a company to increase its chances for success in the new and emerging markets. Company A’s goal is to [a] protect its existing revenue, [b] generate new revenue, and [c] prevent other companies from dominating the new market(s).

Company A has performed a rational, market analysis. Company A’s management has determined that one company only — our Company B — represents a mechanism for achieving Company A’s goals. Company A, by definition, has performed its analyses through Company A’s “eye glasses”; that is, Company A’s proven business model and business culture. “Walking in another person’s moccasins” is easy to say and difficult, if not impossible, to do. Everyone views the world through his own experiential frame. Hence, Company A “sees” Company B as having characteristics, attributes, and capabilities that are, despite some acceptable risks, significant benefits to Company A. Having made this decision about the upside from buying Company B, the management of Company A becomes less able to accept alternative inputs, facts, information, perceptions, and opinions. Company A’s reasoning in its decision space is closed. Company A vivifies what William James called “a certain blindness.” The idea is that each person is “blind” in some way to reality that others can perceive.

The implications of “a certain blindness” in this hypothetical acquisition warrant further discussion:

Culture

Company A has a culture built around a business model that allows incremental product enhancements so that subscription revenue is generated. Company B has a business model built around acquisitions. Company A has a more or less homogeneous atmosphere engendered by the business model or what Company A calls the agenda. Company B is more like a loose federation of separate companies — what some MBAs might call a Ling Temco Vought framework. Each entity within Company B retains its own identity, enjoys wide scope of action, and preserves its own culture. “We do our own thing” characterizes these units of Company B. Company A, therefore, has several options to consider:

Company A can leave Company B as it is. The plus is that not much will change Company B’s operations in the short term. The downside is that the technical problems will not be resolved.
Company A can impose its culture on Company B. You don’t need me to tell you that this will go over like the former Soviet Union’s intervention in Poland in the late 1950s.
Company A can try to make changes gradually. (This is a variation of the option in bullet 2 and will simply postpone rebellion. )

Technology

Company A has a different and relatively homogeneous technology base. Company B has a heterogeneous technology base. Maintaining multiple systems is more costly in general than homogeneous systems. Upon inspection, the technical staff needed to maintain these different systems have specialized to deal with particular technical problems in the heterogeneous environment. Technical people can learn new skills, but this takes time and adds cost. Company A has to find a way to streamline technical operations, reduce costs, and not waste time achieving rationalization. There are at least two ways to do this:

Shift to a single platform, ideally Company A’s
Retrain existing staff to have broader technical skills. With Company B’s staff able to perform more generalized work, Company A can reduce headcount at Company B, thus streamlining work processes and reducing cost.

Competitive Arena

The desirable new market for Company A has taking on the characteristics of what I call a “natural monopoly.” When I reflect on notable events in American business history, I note monopolistic behavior. Some monopolies were spawned by force of will; for example, JP Morgan and finance (this guy bailed out the US Treasury) and Andrew Carnegie and steel (this fellow thought of libraries for little people after pistol-whipping his competitors and antagonists).

Other monopolies — like Bell Telephone and your local electric company — came into being because some functions are more appropriately delivered by one organization. Water and Internet search / advertising, for instance, are subject to such economies of scale, quality of service, and standardization. In short, these may be “natural monopolies” due to numerous demand and cost force.

In our hypothetical example, Company A wants to enter a market which is coalescing and beginning now, based on my research, appears to be forming into a “natural monopoly”. This nameless competitor seems to be following a trajectory similar to that of the original Bell Telephone – AT&T life cycle.

Company A’s race, then, is against time and money. Untoward delay at any point going forward with regard to leveraging Company B means coming in second, maybe a distant second or losing out on the new market.

Instead of owning Park Place (a desirable property in the Parker Brothers’ game Monopoly), Company A ends up with Baltic and Mediterranean Avenues (really lousy properties in the Parker Brothers’ game). If Company A doesn’t get Company B, Company A is trapped in its old, deteriorating business model.

If Company A does acquire Company B, Company A has to challenge the competitor. Company B already has a five-year track record of being a day late and a dollar short. Company A, therefore, has to do everything in its power to make the Company B deal work, which appears to be an all-or-nothing proposition.

Now the math: Action by Company A = unknown, variable, escalating costs.

I told you math geeks would not like this analysis. Company A is betting the farm against long odds. Here’s why:

First, the cultures are not amenable to staff reductions or technological efficiencies; that is, use software and automation, not people, while increasing revenues. Company A, regardless of the money invested, cannot be certain of success. Company B’s culture – business model duality is investment insensitive. In short, money won’t close this gap. Company A’s resistance to cannibalizing its old, though still functioning, business model will be significant. Company A’s own employees will resist watching their money and jobs sacrificed to a great good.

Second, the competitive space is now being captured by the increasingly monopolistic competitor. Unchallenged for some period of time, the monopolistic competitor enjoys momentum and a significant lead in refining its own business model.

In the lingo of Wall Street, Company A can’t get enough “oxygen”; that is, revenue despite its best efforts to reign in the market leader.

Observations

If we assume a kernel of truth in my hypothetical analysis, we can now apply this hypothetical discussion to the Microsoft – Yahoo deal.

First, Microsoft’s business mode (not its technology) is the company’s strength. The business model is also its Achilles’ heel. Just as IBM’s mainframe-centric view of the world make its executives blind to Microsoft, now Microsoft can’t perceive today’s world from outside the Microsoft business model. The Microsoft business model is perhaps the most efficient subscription-based revenue generator in history. But that business model has not worked in the new markets Microsoft’s covets, so the Yahoo deal becomes the “obvious” play to Microsoft’s management. Its obviousness makes it difficult for Microsoft to see other options.

Second, the Microsoft business model is woven into the company’s culture. Cultures are ethnocentric. Ethnocentricity often manifests itself in conflict. Microsoft will have to make prescient, correct cultural decisions quickly and repeatedly. Microsoft’s culture, however, does not typically evidence excellent, rapid-fire decision-making.

Microsoft seems to be putting the company in a situation guaranteed to spark conflict within its own walls, between itself and Yahoo, and between Microsoft and Google. This is a three-front war. Even those with little exposure to military history can see that the costs and risks of a three-front conflict will be high, open-ended, and difficult to estimate.

The hostile bid itself is suggestive that Microsoft could not catch Google without Google, the notion that Microsoft can catch Google with the acquisition requires tremendous confidence in Microsoft’s management. I think Microsoft can make the deal work, but I think that execution must be flawless and that favorable winds push Microsoft along.

If Google continues to race forward, Microsoft has to spend more money to implement efficiencies more quickly. The calculus of catching a moving target can trigger a cost crisis. If costs go up too quickly, Microsoft must fall back on its proven business model. Taking a step backward when resolving the calculus of catching Google is not a net positive.

As you read this essay, you are wondering, “How can this doom and gloom be real?” The buzz about the deal is mostly positive. If you don’t believe me, call your broker and ask him how much your mutual fund will benefit from the MSFT – YHOO tie up.

I’ve spent some time around money types, and I can tell you making money is akin to blood in the water for sharks.

I’ve also been acquired and done the acquiring. Regardless of being the buyer or being the bought, ties ups are tricky. The larger the stakes, the more tricky the tie ups become. When the tie up is designed to halt the Google juggernaut, the calculus of time – cost is hard.

Please, recall, that I’m not saying that stopping Google is impossible for a Microsoft – Yahoo tie up to deliver. Making the tie up work will be difficult.

Don’t agree? That’s okay. Use the comments to set me straight. I’m willing to listen and learn. Just don’t overlook my core points; namely, business models, cultures, and technologies. One final thought: don’t factor out the Google (NSDQ:GOOG).
Stephen Arnold, February 4, 2008

Written by Stephen E. Arnold · Filed Under Library automation, Microsoft, Online (general) | 2 Comments

Lotsa Search at Yahoo!

February 3, 2008

Microsoft’s hostile take over of Yahoo! did not surprise me. Rumors about Micro – hoo or Ya – soft have floated around for a couple of years. I want to steer clear of the newsy part of this take over, ignore the share-pumping behind the idea that Mr. Murdoch will step in to buy Yahoo, and side step Yahoo’s 11th hour “we’re not sure we want to sell” Web log posting.

I prefer to do what might be called a “catalog of search engines,” a meaningless exercise roughly equivalent to Homer’s listing of ships in The Illiad. Scholars are still arguing about why he included the information and centuries later continue to figure out who these guys were and why such an odd collection of vessels was necessary. You may have a similar question about Yahoo’s search fleet after you peruse this short list of Yahoo “findability” systems:

InQuira. This is the Yahoo natural language customer support system. InQuira was formed from three smaller search outfits that ran aground. InQuire seems stable, and it provides NLP systems for customer support functions. Try it. Navigate to Yahoo. Click Help and ask a question, for example, “How do I cancel my premium mail account?” Good luck, but you have an opportunity to work with an “intelligent” agent who won’t tell you how to cancel a for-fee Yahoo service. When I learned of this deal, I asked, “Why don’t you just use Inktomi’s engine for this?” I didn’t get an answer. I don’t feel too bad. Google treats me the same way.
Inktomi. Yahoo bought this Internet indexing company in 2002. We used the Inktomi system for the original US government search service, FirstGov.gov (now USA.gov). The system worked reasonably well, but once in the Yahooligans’ hands, not much was done with the system, and Inktomi was showing its age. In 2002, Google was motoring just drawing even with Yahoo. Yahoo seemed indifferent or unaware that search had more potential than Yahoo’s portal approach.
Stata Labs. When Gmail entered semi-permanent beta, it offered two key features. First, there was one gigabyte of storage and, two, you could search your mail. Yahoo couldn’t search email at all. The fix was to buy Stata Labs in 2004. When you use the Yahoo mail search function, the Stata system does the work. Again I asked, “Why not use one of your Yahoo search systems to search mail?” Again, no response.
Fast Search & Transfer. Yahoo, through the acquisition of Overture, ended up with the AllTheWeb.com Web site. The spidering and search technology are operated by Fast Search & Transfer (the same outfit that Microsoft bought for $1.2 billion in January 2008). Yahoo trumpeted the “see results as you type feature” in 2007, maybe 2006. The idea was that as you key your query, the system shows you results matching what you have typed. I find this function distracting, but you may love it. Try it yourself here. I heard that Yahoo has outsourced some data center functions to Fast Search & Transfer, which, if true, contradicts some of the pundits who assert that Yahoo has its data center infrastructure well in hand. If so, why lean on Fast Search & Transfer?
Overture. When Yahoo acquired Overture (the original pay-for-traffic service) in 2003, it got the ad service and the Overture search engine. Overture purchased AllTheWeb.com and ad technology from Fast Search & Transfer. When Yahoo bought Overture, Yahoo inherited Overture’s Sun Microsystems’ servers with some Linux boxes running a home brew fraud detection service, the original Overture search system, and the AllTheWeb.com site. Yahoo still uses the Overture search system when you look for key words to buy. You can try it here. (Note: Google was “inspired” by the Overture system, and paid about $1.2 billion to Yahoo to avoid a messy lawsuit about its “inspiration” prior to the Google IPO in 2004. Yahoo seemed happy with the money and did little to impede Google.)
Delicious. Yahoo bought Delicious in 2005. Delicious came with its weird url and search engine. If you have tried it, you know that it can return results with some latency. When it does respond quickly, I find it difficult to locate Web sites that I have seen. As far as I know, the Delicious system still uses the original Delicious search engine. You can try it here.
Flickr. Yahoo bought Flickr in 2005, another cog in its social, Web 2.0 thing. The Flickr search engine runs on MySQL. At one trade show, I heard that the Flickr infrastructure and its search system were a “problem”. Scaling was tough. Based on the sketchy information I have about Yahoo’s search strategy, Flickr search is essentially the same as it was when it was purchased and is in need of refurbishing.
Mindset. Yahoo, like Google and Microsoft, has a research and development group. You can read about their work on the recently redesigned Web site here. If you want to try Mindset, navigate to Yahoo Research and slide the controls. I’ve run some tests, and I think that Mindset is better than the “regular” Yahoo search, but it seems unchanged over the last six or seven months.

I’m going to stop my listing of Yahoo’s search systems, although I could continue with the Personals search, Groups search, News search, and more. I may comment on AltaVista.com, another oar in Yahoo’s search vessel, but that’s a topic that requires more space than I have in this essay. And I won’t beat up on Yahoo Shopping search. If I were a Yahoo merchant, I would be hopping mad. I can’t figure out how to limit my query to just Yahoo merchants. The results pages are duplicative and no longer useful to me. Yahoo has 500 million “users” but Web statistics are mushy. Yahoo must be doing something right as it continues to drift with the breeze as a variant of America Online.

In my research for my studies and journal articles, I don’t recall coming across a discussion of Yahoo’s many different search systems. No one, it seems, has noticed that Yahoo lacks an integrated, coherent approach to search. I know I’m not the only person who has observed that Yahoo cannot mount a significant challenge to Google.

As Google’s most capable competitor, Yahoo stayed out of the race. But it baffles me that a sophisticated, hip, with-it Silicon Valley outfit like Yahoo collected different search systems the way my grandmother coveted weird dwarf figurines. Like Yahoo, my grandmother never did much with her collection, I may have to conclude that Yahoo hasn’t done much with its collection of search systems.The cost of licensing, maintaining, and upgrading a fleet of search systems is not trivial. What baffles me is why on earth couldn’t Yahoo index its own email? Why couldn’t Yahoo use one of its own search systems to index Delicious bookmarks and Flickr photos? Why does Yahoo have a historical track record of operating search systems in silos, thus making it difficult to rationalize costs and simplify technical problems?

Compared to Yahoo, Google has its destroyer ship shape — if you call squishy purple pillows, dinosaur bones, and a keen desire to hire every math geek with an IQ of 165 on the planet “ship shape”. But Yahoo is still looking for the wharf. As Google churned past Yahoo, Yahoo watched Google sail without headwinds to the horizon.Over the years, I’ve been in chit-chats with some Yahoo wizards. Let me share my impressions without using the wizards’ names:

Yahoo believes that its generalized approach is correct as Google made search the killer app of cloud computing. Yahoo’s very smart people seem to live in a different dimension
Yahoo believes that its technology is superior to Google’s and Microsoft’s. When I asked about a Google innovation, Yahoo’s senior technologist told me that Yahoo had “surprises for Google.” I think the surprise was the hostile take over bid last week
Yahoo sees its future in social, Web 2.0 services. To prove this, Yahoo hired economists and other social scientists. While Yahoo was recruiting, the company muffed the Facebook deal and let Yahoo 360 run aground. Yo, Yahoo, Google is inherently social. PageRank is based on human clicks and human-created Web pages. Google’s been social since Day One.

To bring this listing of Yahoo search triremes (ancient wooden war ships) to a close, I am not sure Microsoft, if it is able to acquire Yahoo, can integrate the fleet of search systems. I don’t think Mr. Murdoch can given the MySpace glitches. Fixing the flotilla of systems at Yahoo will be expensive and time consuming. The catch is that time is running out. Yahoo appears to me to be operating on pre-Internet time. Without major changes, Yahoo will be remembered for its many search systems, leaving pundits and academics to wonder where they came from and why. Maybe these investigators will use Google to find the answer? I know I would.

Stephen Arnold, February 3, 2008

Written by Stephen E. Arnold · Filed Under Online (general) | Comments Off on Lotsa Search at Yahoo!

Search Frustration: 1980 and 2008

February 2, 2008

I have received two telephone calls and several emails about user satisfaction with search. The people reaching out to me did not disagree that users were often frustrated with systems. I think the contacts were amplifications of the complexity of “getting search right”.

Instead of falling back on bell curves, standard deviations, and more exotic ways to think about populations, let’s go back in time. I want to then jump back to the present, offer some general observations, and then conclude with several of my opinions expressed as “observations”. I don’t mind push back. My purpose is to set forth facts as I understand them and stimulate discussion.

I’m quite a fan of Thucydides. If you have dipped into his sometimes stream-of-consciousness approach to history, you know that after a few hundred pages the hapless protagonists and antagonists just keep repeating their mistakes. Finally, after decades of running around the hamster wheel, resolution is achieved by exhaustion.

My hope is that with regard to search we arrive at a solution without slumping into torpor.

The Past: 1980

A database named ABI / INFORM (pronounced as three separate letters ay-bee-eye followed by the word inform) was a great online success. Its salad days are gone, but for one brief shining moment, it was white hot.

The idea for ABI (abstracted business information) originated at a university business school, maybe Wisconsin but I can’t recall. It was purchased by my friend Dennis Auld and his partner Greg Payne. There was another fellow involved early on, but I can’t dredge his name up this morning.

The database summarized and indexed journals containing information about business and management. Human SMEs (subject matter experts) read each article and wrote a 125-word synopsis. The SMEs paid particular attention to making the abstract meaty; that is, a person could read the abstract and get the gist of the argument and garner the two or three key “facts” in the source article. (Systems today perform automatic summarization, so the SMEs are out of a job.)

ABI / INFORM was designed to allow a busy person to ingest the contents of a particular journal like the Harvard Business Review quickly, or collect some abstracts on a topic such as ESOPs (Employee Stock Ownership Plans) and learn quickly on what was in the “literature” (a fancy word for current management thinking and research on a subject).

Our SMEs would write their abstracts on special forms that looked a lot like a 5″ by 8″ note card (about the amount of text on a single IBM mainframe green screen input form). SMEs would also enter the name of the author or authors, the title of the article, the source journal, and the standard bibliographic data taught in the 7th grade.

SMEs would also consult a printed list of controlled terms. A sample of a controlled term list appears below. Today, these controlled term lists are often called knowledge bases. For anyone my age, a list of words is pretty much a list of words. Flashy terminology doesn’t always make points easier to understand, which will be a theme of this essay.

Early in the production cycle, the index and abstract for each article would be typed twice once by an SME on a typewriter and then by a data entry operator into a dumb terminal. This type of information manufacturing reflected the crude, expensive systems available a quarter century ago. Once the data had been keyed into a computer system, it was in digital form, proofed, and sent via eight-track tape to a timesharing company. We generated revenue by distributing the ABI / INFORM records via Dialog Information Services, SDC Orbit, BRS, and other systems. (Perhaps I will go into more detail about these early online “players” in another post.) Our customers used the timesharing service to “search” ABI / INFORM. We split the money with the timesharing company and generally ended up with the short end of the stick.

Below is an example of the ABI / INFORM controlled vocabulary:

There were about 15,000 terms in the vocabulary. If you look closely, you will see that some terms are market “rt” and “uf”. These are “related terms” and “use for” terms. The idea was that a person assigning index terms would be able to select a general term like “market shares” and see that the related terms “competition” and “market erosion” would provide pertinent information. The “uf” or “use for” reminded the indexer that “share of market” was not the preferred index term. Our vocabulary could also be used by a customer or user whom we then called a searcher in 1980.

A person searching for information in the ABI / INFORM file (database) of business abstracts could use these terms to locate precisely the information desired. You may have heard the terms precision and recall used by search engine and content processing vendors. The idea originated with the need to allow users (then called searchers) to narrow results; that is, make them more precise. There was also a need to allow a user (searcher) to get more results if the first result set contained too few hits or did not have the information the user (searcher) wanted.

To address this problem, we created classification codes and assigned these to the ABI / INFROM records as well. As a point of fact, ABI / INFORM was one of the first, if not the first, commercial database to reindex every record in its database to assign manually six to eight index terms and classification codes as part of a quality assurance project.

When we undertook this time-consuming and expensive job, we had to use SMEs. The business terminology proved to be so slippery that our primitive automatic indexing and search-and-replace programs introduced too many indexing red herrings. My early experience with machine-indexing and my having to turn financial cartwheels to pay for the manual rework has made me suspicious of vendors pushing automated systems, especially for business content. Business content indexing remains challenging, eclipsed only by processing email and Web log entries. Scientific, technical, and medical content is tricky but quite a bit less complicated than general business content. (Again, that’s a subject for another Web log posting.)

Our solution to broadening a query was to make it possible for the SME indexing business abstracts to use a numerical code to indicate a general area of business; for example, marketing, and then use specific values to indicate a slightly narrower sub-category. The idea was that the controlled vocabulary was precise and narrow and the classification codes were broader and sub-divided into useful sub-categories. A snippet of the ABI / INFORM classification codes appears below:

If you look at these entries for the classification code 7000 Marketing, you will see terms such as “sn”. That’s a scope note, and it tells the indexer and the user (searcher) specific information about the code. You also see the “cd”. That means “code description”. A “code description” is provides specific guidance on when and how to use the classification code, in this case “7000 Marketing”.

Notice too that the code “7100 Marketing” is a sub-category of Marketing. The idea is that while 7000 Marketing is broad and appropriate for general articles about marketing, the sub-category allows the indexer or user to identify articles about “Market research.” While “Market research” is broad, it is ideally in a middle ground between the very broad classification code 7000 Marketing and the very specific terminology of the controlled vocabulary. We also had controlled terms lists for geography or what today is called “geo spatial coding”, document type codes, and other specialized index categories. These are important facets of the overall indexing scheme, but not germane to the point I want to make about user satisfaction with search and content processing systems.

Let’s step back. Humans created abstracts of journal articles. Humans then complete bibliographic entries for each selected article. Then an SME would index the abstracts, selecting terms that in their judgment and according to the editorial policy inherent in the controlled terms lists. These index terms became the building blocks of locating a specific article among hundreds of thousands or identifying a subset of all possible articles in ABI / INFORM directly on point to the topic on which the user wanted information.

The ABI / INFORM controlled vocabulary was used at commercial organizations to index internal documents or what we would today call “behind-the-firewall content.” One customer was IBM. Another was the Royal Bank of Canada. The need for a controlled vocabulary such as ABI / INFORM’s is rooted in the nature of business terminology. When business people speak, jargon creeps into almost every message. On top of that, new terms are coined for old concepts. For example, you don’t participate in a buzz group today. You participate in a focus group. Now you know why I am such a critic of the baloney used by search and content processing vendors. Making up words (neologisms) or misappropriating a word with a specific meaning (semantic, for example) and then gluing that word with another word with a reasonably clear meaning (processing, for example) creates the jargon semantic processing. Now I ask you, “Who knows what the heck that means?” I don’t, and that’s the core problem of business information. The language is slippery, fast moving, jargon-riddled, and fuzzy.

Appreciate that creating the ABI / INFORM controlled vocabulary, capturing the editorial policy in those lists, and then applying them consistently to what was then the world’s largest index to business and management thought was a big job. Everyone working on the project was exhausted after two years of researching, analyzing, and discussing. What made me particularly proud of the entire Courier-Journal team (organized by the time we finished into a separate database unit called Data Courier) was that library and information science courses used ABI / INFORM as a reference document. At Catholic University in Washington, DC, the entire vocabulary was used as a text book for an advanced information science class. Even today, ABI / INFORM’s controlled vocabulary stands as an example of:

The complexity of creating useful, meaningful knowledge bases
Proof that it is possible to index content so that it can be sliced and diced with few “false drops” or what we call today a “irrelevant hit”.
A difficult domain such as business can be organized and made more accessible via good indexing.,

Now here’s the kicker, actually a knife in the heart to me and the entire ABI / INFORM team. We did user satisfaction surveys on our customers before the reindexing job and then after the reindexing job. But our users (searchers) did not use the controlled terms. Users (searchers) keyed one or two terms, hit the Enter key, and used what the system spit out.

Before the work, two-thirds of the people we polled who were known users of ABI/ INFORM said our indexing was unsatisfactory. After the work, two thirds of the people we polled who were known users of ABI / INFORM said our indexing was unsatisfactory. In short, bad indexing sucked. And better indexing sucked. User behavior was responsible for the dissatisfaction, and even today, who dares tell a user (search) that he / she can’t search worth a darn.

I’ve been thinking about these two benchmark studies performed by the Courier-Journal every so often for 28 years. Here’s what I have concluded:

Inherent in the search and retrieval business is frustration with finding the information a particular user needs. This is neither a flaw in the human nor a flaw in the indexing. Users come to a database looking for information. Most of the time — two thirds to be exact — the experience disappoints.
Investing person years of effort in constructing an almost-perfect epistemological construct in the form of controlled vocabularies is a great intellectual exercise. It just doesn’t pay huge dividends. Users (searchers) flounder around and get “good enough” information which results in the general dissatisfaction with search.
As long as humans are involved, it is unlikely that the satisfaction scores will improve dramatically. Users (searchers) don’t want to work hard to formulate queries or don’t know how to formulate queries that deliver what’s needed. Humans aren’t going to change at least in my lifetime or what’s left of it.

What’s this mean?

Simply stated, algorithmic processes and the use of sophisticated mathematical procedures will deliver better results.

The Present: 2008

In my new study Beyond Search, I have not included much history. The reason is that today most procurement teams looking to improve an existing search system or replace one system with another want to know what’s available and what works.

The vendors of search and content processing systems have mastered the basics of key word indexing. Many have integrated entity extraction and classification functions into their content processing engines. Some have developed processes that look at documents, paragraphs, sentences, and phrases for clues to the meaning of a document.

Armed with these metatags (what I call index terms), the vendors can display the content in point-and-click interfaces. A query returns a result list, and the system also displays Use For references or what vendors call facets, hooks, or adjacent terms. The naked “search box” is surrounded with “rich interfaces”.

You know what?

Survey the users and you will find two-thirds of the users dissatisfied with the system to some degree. Users overestimate their ability and expertise in finding information. Many managers are too lazy to dig into results to find the most germane information. Search has become a “good enough” process for most users.

Rigorous search is still practiced by specialists like pharmaceutical company researchers and lawyers paid to turn over every stone in hopes of getting the client off the legal hook. But for most online users in commercial organizations, search is not practiced with diligence and thoroughness.

In May 2007, I mentioned in a talk at an iBreakfast seminar that Google had an invention called “I’m feeling doubly lucky.” The idea is that Google can look at a user’s profile (compiled automatically by the Googleplex), monitor the user’s location and movement via a geo spatial function in the user’s mobile device, and automatically formulate a query to retrieve information that may be needed by the user. So, if the user is known to be a business traveler and the geo spatial data plot his course toward La Guardia Airport, then the Google system will push to the user’s phone about which parking lot is available and whether the user’s flight is late. The key point is that the user doesn’t have to do anything but go one about his / her life. This is “I’m feeling doubly lucky” because it raises the convenient level of the “I’m feeling lucky button” on Google pages today. Press I’m feeling lucky and the system shows you the one best hit as defined by Google’s algorithmic factory. Some details of this invention appear in my September 2007 study, Google Version 2.0.

I’m convinced that automatic, implicit searching is the direction that search must go. Bear in mind that I really believe in controlled vocabularies, carefully crafted queries, and comprehensive review of results lists. But I’m a realist. Systems have to do most of the work for a user. When users have to do the searches themselves or at least most of the work, their level of dissatisfaction will remain high. The dissatisfaction is not with the controlled vocabulary, the indexing, or the particular search system. The dissatisfaction is with the work associated with finding and using the information. I think that most users are happy with the first page or first two or three results. These are good enough or at least assuage the user’s conscience sufficiently to make a decision.

The future, therefore, is going to be dominated by systems that automate, analyze, and predict what the mythical “average” user wants. These results will then be automatically refined based on what the system knows about a particular user’s wants and needs. The user profile becomes the “narrowing” function for a necessarily broad set of results.

Systems can automatically “push” information to users or at least keep it in a cache ready for near-zero latency delivery. In an enterprise, search must be hooked into work flow. The searches must be run for the user and the results displayed to the user. If not automatically, the user need only click a hot link and the needed information is displayed. A user can override an automatic systems, but I’m not sure most users would do it or care if the override were like a knob on a hotel’s air conditioner. You feel better turning the knob. You feel without control if you can’t turn the knob.

Observations

Let me offer several observations after this journey back in time and a look at the future of search and content processing. If you are easily upset, grab your antacid, because here we go:

The razzle-dazzle about taxonomies, ontologies, and company-specific controlled term lists hides the fact that specific terms have to be identified and used to index automatically documents and information objects found in behind-the-firewall search systems. Today, these terms can be generated by processing a representative sample of existing documents produced by the organization. The key is a good-enough term list, not doing what was done 25 years ago. Keep in mind the phrase “good enough.” There are companies who offer software systems that can make this list generation easier. You can read about some vendors in Beyond Search, or you can do a search on Google, Live.com, or Yahoo.
Users will never be satisfied. So before you dump your existing search system because of user dissatisfaction, you may want to get some other ammunition, preferably cost and uptime data. “Opinion” data are almost useless because no system will test better than another in my experience.
Don’t believe the business jargon thrown at you by vendors. Inherent in business itself is a tendency to create a foggy understanding. I think the tendency to throw baloney has been around since the first caveman offered to trade a super-sharp flint for a tasty banana. The flint is not sharp; it’s like a Gillette four-track razor. The banana is not just good; it is mouth-watering, by implication a great banana. You have to invest time, effort, energy, and money in figuring out which search or content processing system is appropriate for your organization., This means head-to-head bake-offs. Few do this, and the results are clear. Most people are unhappy with their vendor, with search, and with the “information problem”.
Background processes, agent-based automatic searching, and mechanisms to watch what your information needs and actions are will make search better. You enter ss cc=71? AND ud=9999 to get recent material about market research. but most people don’t and won’t.

In closing, keep these observations in mind when trying to figure out what vendors are really squabbling about. I’m not sure they themselves know. When you listen to a sales pitch, are the vendors saying the same thing? The answer is, “Yes.” You have to rise to the occasion and figure out the differences between systems. I guarantee you the vendors don’t know and if they know, the vendors sure won’t tell you.

Stephen Arnold, February 2, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Online (general), Search | 2 Comments

Search Saber Rattling

February 1, 2008

The Washington Post, January 31, 2008, ran a story “Google Slams Autonomy over Enterprise Search Claims.” The subtitle was, “Google Says Autonomy’s White Paper Contains ‘Significant Inaccuracies’ about its Search Appliance.”

The gist of the story, as I understand it, is that Autonomy wrote a white paper. The white paper contains assertions that the Google Search Appliance is not as good as Autonomy’s search engine. The Autonomy white paper is here. Google’s response is here.

What’s a White Paper?

For those of you not familiar with the lingo of high-tech marketing, a white paper is an essay, usually three or four pages to 50 pages or more. The author, usually an “expert”, opines on a particular topic, including facts, assertions, data from “objective tests”, and other sources. The idea is that a white paper presents information that supports an argument. If you want to immerse yourself in white papers, navigate to Bitpipe, and sign up. The young founders created a treasure trove of these documents after a stint at the Thomson Corporation. White papers, based on my experience, are among the favorite reads of graduate students in far-off places. Bitpipe reports heavy usage of their repository. My test a couple of years ago revealed zero substantive leads from a white paper about behind-the-firewall search. My hunch is that these documents occupy 20-something public relations experts, their superiors, and, of course, the hundreds of graduate students looking for information. Maybe some legitimate buyers order up several million dollars worth of computer gear after reading a white paper, but I think the white papers’ impact might be more limited; for example, competitors seem to read one another’s white papers. I scan them, but mostly I focus on the specifications (if any are included) and the technical diagrams (also rare as hen’s teeth).

I keep a collection of white papers published by the 52 search and content processing companies I track. I don’t want to dig into the Autonomy white papers or the mind-numbing complexities of the Google essays here.

The majority of white papers are like sonnets in the 16th century. There’s a convention, and vendors follow the convention. The individual white papers are formulaic. Most white papers arguing that the sponsor’s or author’s product is not just good but really very good.

About half the white papers take implicit or explicit swipes at competitors’ products. I’m not sure these swipes are harmless, but a white paper is not going to have the impact of a story by Walt Mossberg in Rupert Murdoch’s Wall Street Journal. Furthermore, the writing of a white paper is not going to get anyone a Ph.D. or even a high grade in a first-year writing class.

The objective of a white paper is to make a sale or help a fence-sitting prospect to make the “right” decision. The sponsor or author of the white paper wants to paint a clear picture of one product. The competitors’ products are so-so. White papers are usually free, but to download one, you may have to register. You become a sales lead.

Why the Fuss?

I understand the frustration of search vendors who find their product or service criticized. Search systems are terribly complex, generally not well understood by their licensees, and almost always deemed “disappointing” by their users. Marketers can suggest and imply remarkable features of their employers’ search systems. Hyperbole sells in some situations.

I’ve made reference to a major study we conducted in 2007. The data suggested that two-thirds of a behind-the-firewall search system’s users were dissatisfied or somewhat dissatisfied with their search engine. It didn’t seem to make much difference whose system the respondent had in mind. The negative brush painted a broad swath across the best and brightest vendors in the search market place. In December, I learned from a colleague in Paris, France, that she found similar results in her studies of search system satisfaction.

To lay my cards on the table, I don’t like any search system all that much. You can read more about search and content processing warts in my new study, Beyond Search, available in April 2008. Task-master Frank Gilbane has me on schedule. I assume that’s one reason his firm has a good reputation. In Beyond Search, I discuss what makes people unhappy when they use commercial search systems, and I also offer some new information about fixing a broken system.

Dust Up: Not the World Wrestling Federation

The recent dust up reported by the Washington Post and dozens is that Autonomy and Google are shaking their PR swords at one another. I find that amusing because no one outside of a handful of specialists have the foggiest idea what makes each respective company’s system work. I recall a fierce argument about Spencer’s Faerie Queen. I don’t anyone knew what the distinguished combatants were talking about.

Autonomy

IDOL stands for Integrated Data Operating Layer. The Autonomy approach is to put a “framework” for information applications in an organization. The licensee uses the IDOL framework to acquire, process, and make available information. You can run a search, and you can process video, identify data anomalies, and output visual reports. The system, when properly configured and resourced, is amazing. Autonomy has thousands of customers, and based on the open source intelligence available to me, most are happy. You can read more about the Autonomy IDOL system at www.autonomy.com. There’s a long discussion of the IDOL framework in all four editions of the Enterprise Search Report, which I authored from 2003 to 2006 and some of my thoughts linger in the 4th edition.

Google Search Appliance

The GSA is a server or servers with a Google search system pre-installed. This is a “search toaster,” purpose built for quick deployment. You can also use a GSA for a Web site search, but the Google Custom Search Engine can do that job for free. The GSA comes with an API called the “One Box API”. In my research for the first three editions of the Enterprise Search Report, I kept readers up to date on the evolution of the Google Search Appliance. My assessment in the first edition of ESR was that GSA was outstanding for Web site search and acceptable for certain types of behind-the-firewall requirements. Then in editions two and three of ESR, I reported on the improvements Google was making to the GSA. The Google wasn’t churning out new versions every few months, but it was making both incremental and significant improvements. Import filters improved. The GSA became more adept with behind-the-firewall security. With each upgrade to the GSA, it was evident that Google was making improvements.

When the One box API came along maybe a year or two ago, the GSA morphed from an okay solution into a search stallion for the savvy licensee. Today’s GSA and One box API duo are hampered by a Googley, non-directive sales plan. Google, in a sense, is not in a hurry. Competitors are.

Differences between Autonomy IDOL and GSA

Autonomy has a solid track record of knowing what the “next big thing in search” will be. The company’s top management seem to be psychic. There was “portal in a box”. Very successful. Great timing. There was Kenjin (remember that?), a desktop search application. Industry-leading and ahead of its time. And there was IDOL itself. Autonomy invented the notion of an information operating platform. Other vendors like Fast Search & Transfer jumped on the Autonomy idea with enthusiasm. Now most search vendors offer a “platform” or a “framework”.

Let’s look at some differences in the two competitors’ systems:

	Autonomy IDOL	GSA (One box API)
Platform	On premises installation using licensee’s infrastructure	Servers available in different configurations
Deployment	Custom installation	Toaster approach. Plug in the boxes
Features	Myriad. Mostly snap in with mild customization	Code your own via the One Box API
Support	Direct, partners, and third-parties not affiliated with Autonomy	About 40 partners provide support, customization, etc.

Autonomy IDOL is a “classic” approach to enterprise systems. A licensee can mix and match features, customize almost every facet of the system, and build new applications on IDOL. IDOL runs on a range of operating systems. IDOL includes snazzy visualization and report services. The licensee has one responsibility — ensuring that the resources required by the system are appropriate. With appropriate resources, Autonomy is a very good content processing and search system.

Google’s approach is quite different. The GSA is a plug-and-play solution. A licensee can do some customization via style sheets and the GSA’s administrative utility. But for really interesting implementations, the licensee or one of the three dozen partners have to roll up their sleeves and write code. Google wants its customers “to get it.”

Neither IDOL nor GSA is objectively better than the other; both can do an outstanding job of search. Both can deliver disappointing results. Remember, none of the hundreds of search and content processing systems is able to please most of the users most of the time.

Full Circle

This brings me back to the white paper dust up. I don’t think that dueling news releases will do much to change who buys what. What I think is going on is pretty obvious. Let me make my view clear:

Google’s GSA is gaining traction in the behind-the-firewall search segment. I heard that the enterprise unit has tallied more than 8,500 GSA sales as of December 31, 2008. Information about the One Box API is beginning to diffuse among the technical crowd. Google’s own PR and marketing is — how can I phrase it? — non-directive. Google’s is not in-your-face when it comes to sales. Customers have to chase Google. But this white paper affair suggests that Google may have to change. Autonomy knows how to toss white paper grenades at the Google. The Google has some grenades to toss at Autonomy. For example, you can dip into this Google white paper, Algorithm fÃ¼r dynamische geometrische DatenstrÃ¶me, Gereon Frahling, Ausgezeichnete Informatikdissertationen 2006.

What the Saber Rattling Does Reveal

Google, despite being Googley, is no longer the elephant in the search engine play house no one sees or talks about. I also learned that:

The battle for search mind share is no longer between and among the traditional on-premises players like Endeca, Exalead, and ISYS Search Software. Autonomy has certified Google as the competitor for 2008.
Google, judging by its response, is beginning to realize it needs some additional marketing chutzpah.
The media will continue toss logs on the fire in the best tradition of “If it bleeds, it leads” journalism
Prospects now have reason to equate Autonomy and Google.

Let me be 100 percent clear. No search system is perfect. None of the marketing is particularly useful to potential licensees who remain confused about what system does what. Some vendors still refuse to make a price list available. (Now that’s a great idea when selling to the U.S. government.) And, none of the currently shipping systems will deliver customer satisfaction scores Rolls-Royce enjoys.

Search is difficult. Its value proposition fuzzier and harder to demonstrate. The resources available to a licensee make more difference than the specific system deployed. A so-so search system can deliver great results when the users’ requirements are met, the system doesn’t time out, and the needed content has been indexed. Any commercial search engine will flat out fail when the licensee doesn’t have the expertise, money, or patience to resource the search system.

Want to know which vendor has the “best” system? Get 300 identical servers, load up the same content on each, and let your users run queries. Ask the users which system is “best”. Know what? No one does this. Search is, therefore, a he-said, she-said business. The fix. Do head-to-head bake offs. Decide for yourself. Don’t let vendors do your thinking for you.

Stephen Arnold, February 1, 2008

Written by Stephen E. Arnold · Filed Under Google, Search | 3 Comments

Study Spam: Caveat Emptor

January 31, 2008

My interest in Google is evident from my speeches, articles, and studies about the company. Not surprisingly, I receive a number of emails, telephone calls, and snail mail solicitations. Most of these are benign. I respond to the email, take the call, and toss the snail mail junk. Most people are hoping that I will review their work, maybe engage in some Google gossip, or try to sell me their studies or expertise. No problem 99 percent of the time. I post my telephone number and email (I know it’s not a great idea), but most people are courteous to me and respectful of my requests.

Recently I have been getting email solicitations from an entity known as Justyna Drozdzal from Visiongain Intelligence. The email address is justyna.drozdzal at visiongainglobal.com. Maybe you will have better luck interacting with this outfit? I certainly haven’t had any success getting Ms. Drozdzal to stop sending me email solicitation about a “new” study about Google’s Android. The study’s title is come-hither scented Google’s Android and Mobile Linux Report 2008: A Google-Led Initiative to Reshape the Mobile Market Environment. I didn’t think Google “led” much of anything, but that’s my Kentucky silliness surfacing. I accept Google’s “controlled chaos” theory of building its business. (This is the subject of my April 2008 column in KMWorld where some of my “official” work appears.)

The Visiongain study is about, according to the information available to me, Google’s Android — sort of. I did review the table of contents, and I drew three conclusions, but I urge you to make your own decision, not accept the opinion of a geek in rural Kentucky with the squirrels and horses. Here’s what I decided:

First, the report seems to recycle information available on Google’s own Web site and from other readily available open sources. Summaries are useful. I do them myself, but my summaries have not been able to do much to clear the shroud of secrecy that surrounds Google’s intentions. When it comes to telco, I think the Google has “won”. The company has destabilized traditional telecom operations in the US, floated this open platform notion, and managed to round up a couple dozen partners. I’m reasonably confident that Google knows exactly what it will do — be opportunistic. That’s because Google reacts to clicks and follows the data. A company with Google’s Gen-X fluidity defies conventional wisdom and makes predictions about Google wobbly. Unlike my uncertainty, Visiongain has figured out Google, Android, and telecommunications. Impressive.

Second, the study appears to have some chunks of other Visiongain reports. such as the use of Linux on mobile devices. Again, my spot checks suggested that most of the information is available with routine queries passed to metasearch engines like Dogpile or consumer-oriented Web search services such as Microsoft Live.com or Yahoo. Maybe I’m jaded or plain wrong, but “experts” and “high end consultancies” with names that convey so much more meaning than ArnoldIT.com have been quick to exploit Google – mania. Is Visiongain guilty of this? I’m not sure.

Third, the study seems to suggest that Visiongain is not baffled by Google’s telecommunications’ strategy. Is Dell making a Google phone? Is Apple the next Google target? Will Google build a telecommunications company? I was unable to see through Google’s and the Web’s Sturm und Drang. My conclusion: Remove my telecommunications analysis from my study Google Version 2.0.
Now, back to Visiongain. My thought is that spamming me to buy a report in an area in which I have conducted research for several years is silly. When I asked the company to stop, I received another missive from Ms. Drozdzal asking me to buy the study. That was even sillier. My personal opinion is that this particular Visiongain report seems to be more fool’s gold than real gold.

To cite one example, there is chatter circulating that Google and Dell have teamed to produce a Google phone. Not surprisingly, neither Dell Computer nor Google is talking. Similarly, there are rumors that Google continues to bid for spectrum, and there are rumors that Google is not serious. Until the actual trajectory of these Google mobile activities is made clear by Google itself, I think spending money on reports from companies purporting to know how Google will “reshape the mobile market environment” is not for me. You make your own decision about how to spend your money.

Stephen Arnold, January 31, 2008

Written by Stephen E. Arnold · Filed Under Google | 11 Comments

Autonomy: Right but for the Wrong Reasons

January 30, 2008

I try to turn a blind eye to the PR chaff wafted by software companies. An Ovum story caught my eye on January 30, 2008, as I waited for another dose of red-eye airline abuse. As I sat in Seattle’s airport, I started to think about the Ovum article “Sub-Prime Will Provide Two-Year Boost for Autonomy.” I realized that Autonomy would have a strong 2008, but I did not agree that the “sub-prime issue” would be the driver of Autonomy’s juicy 2008 revenue.

As I shivered in the empty departure hall, I replayed in my aging mind some of the conversations I had in the previous 36 hours. Let me be clear: I think Ovum does good work. As a former consultant at a blue-chip firm, I do understand the currents and eddies of maintaining client relationships, seeming “smart” on crucial issues, and surfing on big, intellectual waves.

I think Autonomy is a very good information platform. But as readers of this Web log know, I think search and content processing is a tough business, so everyone can and should improve. So, I’m okay with Autonomy IDOL and its various moving parts.

Autonomy’s Reasoning for a Great 2008

What I want to do is quote a snippet of the Ovum essay. Please, read the original. I cannot do justice to the Ovum wordsmithing. Here’s a portion that I found interesting, almost like a melody that keeps rattling around my mind when I’m trying to relax:

“After declaring record earnings for the quarter and the year, Mike Lynch CEO of Cambridge, UK-based Autonomy said in addition to a good pipeline for 2008 he is expecting a positive bounce resulting from the US sub-prime debacle through banks adopting Autonomy’s technology.”

Then a few sentences further on in the Ovum essay, I read:

“When describing the year ahead Lynch highlighted that the fall-out of the sub-prime issue in the US is that banks were having to secure and analyse very large amounts of disparate information in a very short timescale, exactly where Autonomy positions its Meaning Based Computing message, and reflected in the recently announced $70m deal … with a global bank. To illustrate how significant the demand being seen by Autonomy was, [Sir Michael] Lynch stated that the cycle time for that deal was two months, and that the company was in the first instance moving people across to support the sales and deployment activity, and it would be supporting its partners to undertake the work in the near future.”

I think I follow this line of reasoning, but it doesn’t strike to the heart of what may well be Autonomy’s best revenue-generating year in the firm’s history. Let me tell you what I think I heard, and you judge what’s more important: financial crises or the information you are about to read.

Alternate Reasoning for a Great 2008

In my chit-chats — all off the record and without attribution — in Seattle I picked up two pieces of what may be reliable information. I urge you to verify my information before concluding that I know what I am talking about. I invite you to provide corrections, additions, or emendations to these points. I had not heard these two ideas expressed before, and I find them thought provoking. I find looking at events from different viewpoints helpful. Here are the two pieces of information. Proceed at your own risk.

First point: the Microsoft deal was done by the firm’s Office group’s senior leadership. That unit concluded that it needed to take a bold step. The Fast Search & Transfer acquisition made sense because it delivered revenue (~$150 – $200 million), 2,000 customers, lots of technology, and smart people. The deal was pushed forward in the period between Thanksgiving and the New Year. When the news broke, some inside Microsoft were surprised. I thought I heard something along the lines: “What? We own Fast Search & Transfer”.

Second point: When other units of Microsoft started pooling their knowledge of Fast Search & Transfer, there was concern that the guts of Fast Search did not share Microsoft’s DNA or the Microsoft “agenda”. Fast Search has a SharePoint adaptor, but the rest of the technology looked like Amazon, Google, or Yahoo “stuff”. I thought I heard: “That is going to be an interesting integration chore for the Office group.”

When I hear to word interesting, my ears quiver just like my dog Tyson’s when he hears me open the treat jar. Interesting can mean good things or bad things, but rarely dull things. I have some experience with Microsoft frameworks and some with Fast Search’s ESP (Enterprise Search Platform). Integrating these two frameworks is not something I could do. I’m too old and slug-like for super-wizardary.

Back to 2008

How do these two unsubstantiated pieces of information relate to Autonomy?

What I think is that think Autonomy will win in those head-to-head competitions where Autonomy must sell against Fast Search & Transfer (maybe Micro-Fast?). I think that in large account face-offs, procurement teams will not know how the merger will play out. Uncertainty about the future may tip the scales in favor of Autonomy. The company is stable, not at this moment being acquired, and has oodles of mostly happy customers. Its has name recognition. The Microsoft – Fast team can only offer assurances that everything will be okay. In my view, Autonomy can go beyond okay and will, therefore, win most deals.

But many search procurements involve three or more vendors. In those situations, Autonomy will win some and lose some. So their win rate won’t be much different from what it was in 2007, which as I understand the financial reports, continued to nose upwards. With or without the Microsoft – Fast deal, Autonomy has been charging forward.
Financial factors seem tangentially significant, but Autonomy’s golden goose 2008 is going to be attributable to the Microsoft – Fast merger.

The Microsoft – Fast Options (Hypothetical, Speculative, Thought Exercise)

Let’s assume that I am right and consider as a thought experiment what Microsoft can do to win sales from Autonomy and keep Fast Search’s revenues on the trail to financial health. Here are three possible scenarios that seem insightful to me in the Seattle airport at 12:15 am Pacific time. I invite you to weigh in. Attorneys, consultants, and share churners can climb too. This is an essay, an attempt, not much more than my opinion based on the aforementioned, unsubstantiated chit-chat. Okay? Now the options to consider:

Microsoft – Fast leaves the two platforms separate. Microsoft provides management expertise, leadership, and marketing horsepower and goes flat out to wrest business from Autonomy in its key accounts. Microsoft ignores Autonomy’s response (price cutting, PR chaff, etc.,) and uses Microsoft billions to neuter Autonomy, Virage, and any other search technology Autonomy offers.
Microsoft – Fast hunkers down, integrates the two platforms. Using its reseller and Certified Partner network, Microsoft – Fast combines a better search solution plus slick technology plus high-powered marketing. Although Microsoft concedes some battles, when the company hits the street with its offering, it executes a Netscape-type (think free or really low license fees) strategy and becomes the dominant player in the behind-the-firewall search market.
Microsoft – Fast does the integration work well. Instead of fighting Autonomy, Microsoft uses a variation of its customer relationship management strategy. Free trials and low cost introductory rates are used to get existing Microsoft-centric customers to remain loyal to Microsoft. Applied globally for a year or more, Autonomy may be slowly deprived of oxygen. Autonomy loses its agility, becomes weaker, and fades into a distant second place behind the Microsoft super-platform.

As I think about these hypothetical scenarios, I see a win for Microsoft in any of these paths. If the company were to mix and match strategies — for example, all-out assault and long-term oxygen deprivation — Autonomy would have its cash reserves depleted. The end game, of course, is that another super-platform steps forward to acquire Autonomy. Then two super-platforms would fight for the behind-the-firewall search customers.

Who are candidate super-platforms at a time when the US economy is teetering toward a recession? Here’s my shortlist, and may I ask, “Who are your candidates in this high-stakes poker game?

Oracle. This company already inked a deal with Google to hawk the Google Search Appliance. Oracle bought Triple Hop, but so far has not been able to leverage that technology. Mr. Ellison is a buyer, and I hear that Oracle has looked closely at Autonomy in the past.
SAP. This company has the TREX search system. A shiny new search system is on the horizon. Buying Autonomy brings several thousand customers, revenue, and engineers. Microsoft tried to buy SAP, and SAP fought back. Maybe this is the next step for SAP?
An investment bank — maybe Carlyle Group, an outstanding outfit. Carlyle could work a deal to convert Autonomy into several companies and start selling various units off to the higher bidder. There’s real money in buy outs and break ups.
IBM. IBM has more search solutions than any other vendor I track. Buying Autonomy brings customers and revenue. IBM then implements one of the Microsoft options and goes after Microsoft. IBM still remembers the great business relationship IBM enjoyed with Microsoft in the DOS and OS/2 era.

Note that none of these hypotheticals is greatly influenced by the sub-prime tempest. The stakes are now sufficiently high in behind-the-firewall search to make secondary forces — well — secondary. I don’t want to disagree with Ovum, but I think my analysis may add some useful “color” as the financial analysts like to say, to their look at Autonomy.

Stephen Arnold, January 31, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Microsoft | 4 Comments

Search: The Problem with Words and Their Misuse

January 30, 2008

I rely on several different types of alerts, including Yahoo’s service, to keep pace with developments in what I call “behind the firewall search”.

Today was particularly frustrating because the number of matches for the word “search” has been increasing, particularly since the Microsoft – Fast Search & Transfer acquisition and the Endeca cash injection from Intel and SAP. My alerts contain a large number of hits, and I realized that most of these are not about “behind the firewall” search, nor chock full of substantive information. Alerts are a necessary evil, but over the years, the primitive key word indexing offered by free services don’t help me.

The problem is the word search and its use or misuse. If you know of better examples to illustrate these types of search, please, post them. I’m interested in learning about sites and their search technology.

I have a so-so understanding of language drift, ambiguity, and POM (plain old marketing) work. For someone looking for information about search, the job is not getting easier. In fact, search has become such a devalued term that locating information about a particular type of search requires some effort. I’ve just finished compiling the Glossary for “Beyond Search”, due out in April 2008 from the Gilbane Group, a high-caliber outfit in the Boston, Massachusetts area. So, terminology is at the top of my mind this morning.

Let’s look at a few terms. These are not in alphabetical order. The order is by their annoyance factor. The head of the list contains the most annoying terms to me. The foot of the list are terms that are less offensive to me. You may not agree. That’s okay.

Vertical search. Number one for 2008. Last year it was in second place. This term means that a particular topic or corpus has been indexed. The user of a vertical search engine like Sidestep.com sees only hits in the travel area. As Web search engines have done a better and better job of indexing horizontal content — that is, on almost every topic — vertical search engines narrow their focus. Think deep and narrow, not wide and shallow. As I have said elsewhere, vertical search is today’s 20-somethings rediscovering how commercial databases handled information in the late 1970s with success then but considerably less success today.

Search engine marketing. This is last year’s number one. Google and other Web engines are taking steps to make it harder to get junk sites to the top of a laundry list of results. This phrase search engine marketing is the buzzword for the entire industry of getting a site on the first page of Google results. The need to “rank high” and has made some people “search gurus”. I must admit I don’t think too much of SEM, as it is called. I do a reasonable job of explaining SEM in terms of Google’s Webmaster guidelines. I believe that solid content is enough. If you match that with clean code, Web indexing bots will index the information. Today’s Web search systems do a good job of indexing, and there are value-added services such as Clusty.com that add metadata, whether the metadata exists on the indexed sites or not. When I see the term search used to mean SEM, I’m annoyed. Figuring out how to fool Google, Microsoft Live.com, or Yahoo’s indexing systems is not something that is of much interest to me. Much of the SEM experts’ guidance amounts to repeating Google’s Web master guidelines and fiddling with page elements until a site moves up in the rankings. Most sites lack substantive content and deserve to be at the bottom of the results list. Why do I want to have in my first page of results a bunch of links to sites without heft? I want links to pages significant enough to get to the top of results list because of solid information, not SEM voodoo. For basics, check out “How Stuff Works.”

Guided, faceted, assisted, and discovery search. The idea that is difficult to express in words and phrases is a system that provides point-and-click access to related information. I’ve heard a variation on these concepts expressed as drill-down search or exploratory search. These are 21st-century buzzwords for “Use For” and “See Also” references. But by the time a vendor gets done explaining taxonomies, ontologies, and controlled term lists, the notion of search is mired in confusion. Don’t get me wrong. Rich metadata and exposed links to meaningful “See Also” and “Use For” information is important. I’m just burned out with companies using these terms when their technology can’t deliver.

Enterprise search. I do not know what “enterprise search” is. I do know that there are organizations of all types. Some are government agencies. Some are non-profit organizations. Some are publicly-traded companies. Some are privately held companies. Some are professional services corporations. Some are limited liability corporations. Each has a need to locate electronic information. There is no one-size-fits-all content processing and retrieval system. I prefer the phrase “behind the firewall search.” It may not be perfect, but it makes clear that the system must function in a specific type of setting. Enterprise search has been overused, and it is now too fuzzy to be useful from my point of view. A related annoyance is the word “all”. Some vendors say they can index “all the organization’s information.” Baloney. Effective “behind the firewall” systems deliver information needed to answer questions, not run afoul of federal regulations regarding health care information, incite dissatisfaction by exposing employee salaries, or let out vital company secrets that should be kept under wraps.

Natural language search. This term means that the user can type a question into a system. A favorite query is, “What are the car dealerships in Palo Alto?” You can run this query on Google or Ask.com. The system takes this “natural language question”, coverts it to Boolean, and displays the results. Some systems don’t do anything more than display a cached answer to a frequently asked question. The fact is that most users–exceptions include lawyers and expert intelligence operatives–don’t do “natural lanaguage queries”. Most users type some words like weather 40202 and hit the Enter key. NLP sounds great and is often used in the same sentence with latent semantic indexing, semantic search, and linguistic technology. These are useful technologies, but most users type their 2.3 words and take the first hit on the results list.

Semantic search. See natural language search. Semantic technologies are important and finally practical in every day business operations. Used inside search systems, today’s fast processors and cheap storage make it possible to figure out some nuances in content and convert those nuances to metatags. It’s easy for vendors to bandy about the term semantic and Semantic Web than explain what it delivers in terms of precision and recall. There are serious semantic-centric vendors, and there are a great many who use the phrase because it helps make sales. An important vendor of semantic technology is Siderean Software. I profile others in “Beyond Search”.

Value-added search. This is a coinage that means roughly, “When our search system processes content, we find and index more stuff.” “Stuff”, obviously, is a technical word that can mean the file type or concepts and entities. A value-added search system tries to tag concepts and entities automatically. Humans used to do indexing but there is too much data and not enough skilled indexers. So, value-added search means “indexing like a human used to do.” Once a result set has been generated, value-added search systems will display related information; that is, “See Also” references. An example is Internet the Best. Judge for yourself if the technique is useful.

Side search. I like this phrase. It sounds nifty and means nothing to most people in a vendor’s marketing presentation. What I think the vendors who use this term mean is additional processes that run to generate “Use For” and “See Also” references. The implication is that the user gets a search bonus or extra sugar in their coffee. Some vendors have described a “more like this” function as a side search. The idea is that a user sees a relevant hit. By clicking the “more like this” hot link, the system uses the relevant hit as the basis of a new, presumably more precise, query. A side search to me means any automatic query launched without the user having to type in a search box. The user may have to click the mouse button, but the heavy lifting is machine-assisted. Delicious offers a side search labeled as related terms. Just choose a tag from the list of the right side of the Web page, and you see more hits like these. The idea is that you get related information without reentering a query.

Sentiment search. I have just looked at a new search system called Circos. This system lets me search in “color”. The idea is that emotions or feeling can be located. People want systems that provide a way to work emotion, judgment, and nuance into their results. Lexalytics, for examples, offers a useful, commercial system that can provide brand managers with data about whether customers are positive or negative toward the brand. Google, based on their engineering papers, appears to be nosing around in this sentiment search as well. Worth monitoring because using algorithms to figure out if users like or dislike a person, place, or thing can be quite significant to analysts.

Visual search. I don’t know what this means. I have seen the term used to describe systems that allow the user to click on pictures in order to see other pictures that share some colors or shapes of the source picture. If you haven’t seen Kartoo, it’s worth a look. Inxight Software offers a “search wall”. This is a graphic representation of the information in a results list or a collection as a three-dimensional brick wall. Each brick is a content object. I liked the idea when I first saw in five or six years ago, but I find visual search functionality clunky. Flying hyperbolic maps and other graphic renderings have sizzle, but instead of steak I get boiled tofu.

Parametric search. Structured search or SQL queries with training wheels are loose synonyms for parametric search and close enough for horse shoes. The term parametric search has value, but it is losing ground to structured search. Today, structured data are fuzzed with unstructured data by vendors who say, “Our system supports unstructured information and structured data.” Structured and unstructured data treated as twins, thus making it hard for a prospect to understand what processes are needed to achieve this delightful state. These data can then be queried by assisted, guided, or faceted search. Some of the newer search systems are, at their core, parametric systems. These systems are not positioned in this way. Marketers find that customers don’t want to be troubled by “what’s under the hood.” So, “fields” become metatags, and other smoothing takes place. It is no surprise to me that content processing procurement teams struggle to figure out what a vendor’s system actually does. Check out Thunderstone‘s offering and look for my Web log post about parametric (structured search) in a day or two. In Beyond Search, I profile two vendors’ systems each with different but interesting parametric search functionality. Either of these two vendors’ solutions can help you deal with the structured – unstructured dichotomy. You will have to wait until April 2008 when my new study comes out. I’m not letting these two rabbits out of my hat yet.

Unstructured search. This usually implies running a query against text that has been indexed for its key words because the source lacks “tags” or “field names”. Email, PDFs, and some Word documents are unstructured. A number of content processing systems can also index bound phrases like “stock market” and “white house”. Others include some obvious access points such as file types. Today, unstructured search blends into other categories. But unstructured search has less perceived value than flashier types of search or a back office ERP (enterprise resource planning) application. Navigate to ArnoldIT.com and run a query in my site’s search box. That’s an unstructured search, provided by Blossom Software, which is quite interesting to me.

Hyperbolic search. There are many variations of this approach which is called “buzzword fog”. Hyperbolic geometry and modular forms play an important role is some vendors’ systems. But these functions are locked away out of sight and fiddling by licensees. When you hear terms other than plain English, you are in the presence of “fog rolling in on little cat’s feet.” The difference is that this fog doesn’t move on. You are stuck in an almost-impenetrable mist. When you see the collision coming, it is almost always too late to avoid. I think the phrase means, “Our engineers use stuff I don’t understand, but it sure sounds good.”

Intuitive search. This is a term used to suggest that the interface is easy enough for the marketer’s mother to use without someone telling her what to do. The interface is one visible piece of the search system itself. Humans like to look at interfaces and debate which color or icon is better for their users. Don’t guess on interfaces. Test different ones and use what gets the most clicks. Interfaces that generate more usage are generally better than interfaces designed by the senior vice president’s daughter who just graduated with an MFA from the University of Iowa. Design opinion is not search; it’s technology decoration. For an example, look at this interface from Yahoo. Is it intuitive to you?

Real-time search. This term means that the content is updated frequently enough to be perceived as real time. It’s not. There is latency in search systems. The word “search,” therefore, doesn’t mean real-time by definition. Feed means “near real time”. There are a lot of tricks to create the impression of real time. These include multiple indexes, caching, content boosting, and time stamp fiddling. Check out ZapTXT. Next compare Yahoo News, AllTheWeb.com news, and Google News. Okay, which is “real time”? Answer: none.

Audio, video, image search. The idea is that a vendor indexes a particular type of non-text content. The techniques range from indexing only metadata and not the information in the binary file to converting speech to ASCII, then indexing the ASCII. In Japan, I saw a demonstration of a system that allowed a user to identify a particular image — for example, a cow. The system then showed pictures the system thought contained cows. These type of search systems address a real need today. The majority of digital content is in the form of digitized audio, video, and image files. Text is small potatoes. We don’t do a great job on text. We don’t do very well at all on content objects such as audio, video, and images. I think Blinkx does a reasonably good job, not great, reasonable.

Local search. This is a variation on vertical search. Information about a city or particular geographic area is indexed and made available. This is Yellow Pages territory. It is the domain of local newspaper advertising. A number of vendors want to dominate this sector; for example, Google, Microsoft, and Yahoo. Incumbents like telcos and commercial directory firms aren’t sure what actions to take as online sites nibble away at what was a $32 billion dollar paper directory business. Look at Ask City. Will this make sense to your children?

Intelligent search. This is the old “FOAI” or familiar old artificial intelligence. Most vendors uses artificial intelligence but call it machine learning or computational intelligence. Every major search engine uses computational intelligence. Try Microsoft’s Live.com. Now try Google’s “ig” or Individualized Google service. Which is relying more on machine learning?

Key word search. This is the ubiquitous, “naked” search box. You can use Boolean operators, or you can enter free text and perform a free text search. Free text search means no explicit Boolean operators are required of a user. Enlightened search system vendors add an AND to narrow the result set. Other system vendors, rather unhelpfully, add an OR, which increases the number of results. Take a look at the key word search from Ixquick, a New York City investment banker developed engine now owned by a European company. What’s it doing to your free text query?

Search without search. Believe me, this is where the action is. The idea is that a vendor — for example, Google — will use information about information, user behavior, system processes, and other bits and pieces of data — to run automatically and in the background, queries for a user. Then when the user glances at his / her mobile device, the system is already displaying the information most likely to be wanted at that point of time by that user. An easy way to think of this is to imagine yourself rushing to the airport. The Google approach would look at your geo spatial coordinates, check your search history, and display flight departure delays or parking lot status. I want this service because anyone who has ridden with me knows that I can’t drive, think about parking, and locate my airline reliably. I can’t read the keyboard on my mobile phone, so I want Google to convert the search result to text, call me, and speak the information as I try to make my flight. Google has a patent application with the phrase “I’m feeling doubly lucky.” Stay tuned to Google and its competitors for more information on this type of search.

This short list of different types of search helps explain why there is confusion about which systems do what. Search is no longer something performed by a person training in computer science, information science, or a similar discipline. Search is something everyone knows, right? Wrong. Search is a service that’s readily available and used by millions of people each day. Don’t confuse using an automatic teller machine with understanding finance. The same applies to search. Just because a person can locate information about a subject does not mean that person understands search.

Search is among the most complex problems in computer science, cognitive psychology, information retrieval, and many other disciplines. Search is many things, but it definitely is not easy, well understood, or widely recognized as the next application platform.

Stephen Arnold, January 30, 2008

Written by Stephen E. Arnold · Filed Under Search | 7 Comments

Vivisimo’s Remix

January 29, 2008

I’ve been interested in Vivisimo since I learned about the company in 2000. Disclaimer: my son worked for Vivisimo for several years, and I was involved in evaluating the technology for the U.S. Federal government. A new function, called “Remix“, caught my attention and triggered this essay.

Background

Carnegie Mellon University ranks among the top five or six leading universities in computer science. Lycos was a product of the legendary Fuzzy and his team. Disclaimer: my partner (Chris Kitze) and I sold search technology to Lycos in the mid-1990s. Dr. David Evans has practiced his brand of innovation with several successful search-centric start ups, including a chunk of the technology now used in JustSystems‘ XML engine. (Disclaimer: I have done some work for JustSystems in Tokyo, Japan.) Vivisimo, founded by Raul Valdes-Perez and Jerome Pesenti, was among the first of the value-added processing search systems. I have been paying attention to Vivisimo for more than a decade.

I’ve been impressed with Vivisimo’s innovations, and I have appropriated Mr. Valdes-Perez’s coinage, “information overlook” in my verbal arsenal. As I understand the term, “overlook” is a way for a person looking for information is a way to get a broader view of the information in the results list. I think of it in terms of standing on a bluff and being able to see the lay of the land. As obvious as an overlook may be, it is a surprisingly difficult problem in information retrieval. You’ve heard the expression “We can’t tell the forest from the trees,”. Information overlook attempts to get the viewer into a helicopter. From that vantage point, it’s easier to see the bigger picture.

A Demonstration Query

Vivisimo’s technology has kept that problem squarely in focus. With each iteration and incremental adjustment to the Vivisimo technology, overlook has been baked in to the Vivisimo approach to search-and-retrieval. Here’s an example.

Navigate to Clusty.com, Vivisimo’s public facing search system. Note that Clusty is a metasearch system. Your query is passed to other search systems such as Live.com and Yahoo. The results are retrieved and processed before you see them. Now enter the query ArnoldIT. You will see a main results page and a list of folders in the left hand column of your screen. You can browse the main results. Note that Vivisimo removes the duplicates for you, so you are looking at unique items. Now scan the folder names.

Those names represent the main categories or topics in that query’s result list. For ArnoldIT, you can see that my Web site has information about patents, international search, and so on. Let me highlight several points about the foundation of Vivisimo:

First, I’ve been impressed with Vivisimo’s on-the-fly clustering. It’s fast, unobtrusive, and a very useful way to get a view of what topics occur in a query’s result set. I use Vivisimo when I begin a research project to help me understand what topics can be researched via the Web and which will require the use of analysts making telephone calls.

Second, in the early days of online, deduplication was impossible. Dialog and Orbit, two of the earliest online systems, manipulated fielded flat files. A field name variation make it computationally expensive to recurse through records to identify and remove duplicate entries. When I was paying for results from commercial online sysetms, these duplicates cost me money. When I learned about Vivisimo’s duplicate detection function, I looked at it closely. No one at Vivisimo would give me the details of the approach, but it worked and still works well. Other systems have introduced deduplication, but Vivisimo made this critical function a must-have.

Third, Vivisimo’s implementation of metasearch remains speedy. There are a number of interesting approaches to metasearch, including the little-known ez2Find.com system developed by a brother and sister team working in the south of France. I also admire the Devilfinder search engine that is now one of the faster metasearch systems available. But in terms of features, Vivisimo ranks at the top of the list, easily outperforming ixquick, Dogpile, and other very useful tools.

Fourth, like Exalead, Vivisimo has been engineered using the Linux tricks of low-cost scaling and clustering for high performance. These engineering approaches are becoming widely known, but many of these innovations originated at Stanford, Uniersity of Waterloo, MIT, and Carnegie Mellon University.

The Shift to the Enterprise

Three years ago, Vivisimo made the decision to expand its presence in organizations. In effect, the company wanted to move from a specialist provider of clustering technology to delivering behind-the-firewall search. When Vivisimo’s management told me about this new direction, I explained that the market for behind-the-firewall search was a contentious, confused sector. Success would require more marketing, more sales professionals, and a tougher hide. Mr. Valdes-Peres looked at me and said, “No problem. We’re going to do it.”

The company’s first high-profile win was the contract for indexing the U.S. Federal government’s unclassified content. This contract was originally held by Inktomi in 2000 to 2001. Then Fast Search & Transfer with its partner AT&T held the contract from 2001 to 2005. When Vivisimo displaced Fast Search’s technology, the company was in a position to pursue other high-profile search deals.

Today, Vivisimo is one of the up-and-coming vendors of behind-the-firewall search solutions. I have learned that the company has just won another major search deal. I’m not able to reveal the name of the new client, but the organization touches the scientific and technical community worldwide. Based on my understanding of the information to be processed, Vivisimo will be making the research work of most US scientists and engineers more productive.

Remix

This essay is a direct result of my learning about a new Vivisimo function, Remix. You can use the remix function when you have a result set visible in your Clusty.com results display. In our earlier sample query, ArnoldIT, you see the top 10 topics or clusters of results for that query. When you select Remix, the system, according to Vivismo, “With a single click, remix clustering answers the question: What other, subtler topics are there? It works by clustering again the same search results, but with an added input: ignore the topics that the user just saw. Typically, the user will then see new major topics that didn’t quite make the final cut at the last round, but may still be interesting.”

The function is important for three reasons:

First, Vivisimo has made drill down easy. Some systems perform a similar function, but the user is not always aware of what’s happened or where the result list originated. Vivisimo does a good job of keeping the user in control and aware of his / her location in the results review sequence.

Second, Remix allows one-click access to categories that otherwise would not be seen by the Clusty user. The benefit of Remix is that the result sets do not duplicate any topics the user saw before clicking the Remix button. Just as Vivisimo’s original deduplication function worked invisibly, so does Remix. The function just happens.

Third, the function is speedy. Vivisimo has a number of innovations in its system to make on-the-fly processing of search results take place without latency–the annoying delays some systems impose upon me. Vivisimo’s value-added processing occurs almost immediately. Like Google, Vivisimo has focused on delivering fast response time and rocket science for the busy professional.

Some Challenges

Companies like Vivisimo will have to deal with the marketing challenges of today’s search-and-retireval marketplace. The noise created by Microsoft’s acquisition of Fast Search and Endeca‘s injection of cash from Intel and SAP means that interesting companies like Vivisimo have to make themselves known. I don’t envy the companies trying to get traction is the search sector.

If you are looking for a behind-the-firewall system, you will want to take a look at Vivisimo’s system. In fact, you will want to spend additional time reviewing the search solutions available from the up-and-comers I profile in my new study “Beyond Search”, due out in April 2008. You will find that you can deliver a robust solution without the teeth-ratting licensing fees required by some of the higher-profile vendors.

I can’t say that any one search system will be better for you than another. In fact, when you compare ISYS Search Software, Siderean Software, and Exalead with Vivisimo, you may find that each is an exceptionally robust solution. Which system you find is best for you comes down to your requirements. The key point is that the up-and-coming systems must not be excluded from your short list because the companies are not making headlines on a daily basis.

If you have the impression that Vivisimo is not up to an enterprise-scale content processing job, you have flawed information. Give Vivisimo’s technology a test drive. Judge for yourself. I wrote about Vivisimo in the first, second, and third editions of The Enterprise Search Report. I won’t be repeating that information in Beyond Search. You can explore Vivisimo and learn more about the system from the company’s useful white papers and case studies.

Stephen E. Arnold, January 29, 2008

Written by Stephen E. Arnold · Filed Under Search | 3 Comments

Will Search “Save eBay”?

January 29, 2008

The Monday, January 28, 2008, New York Times, contained a short item that originally appeared in Bits, the technology blog “updated all day at nytimes.com/bits.”

The article in question carries this provocative headline: “The Plan to Save eBay: Better Search.” The author is Saul Hansell, whose writing I admire. I was tickled to learn his Web log entry made the leap from the Times‘s Web site to its printed newspaper. This revealed two facts to me: [a] the editors read what’s on the New York Times‘s Web site, a very good sign, and [b] the Web log itself contained newsprint-worthy information.

I want to quote a small snippet in which John Donahoe, the new eBay boss, is the primary source of information for the story. (I urge you to read the original posting or newspaper article.) I added the bold because I want to reference some of these words in my discussion of eBay.

“Let’s say you wanted to buy a BlackBerry, he [Donahoe] said. Last time I [Donahoe] checked, we had 25,000 BlackBerry listings.This is a fairly confusing experience. A year from now, you will be able to say, ‘I want a BlackBerry. Boom. Show me the latest models at the cheapest price.’ The same screen, he added, will show used and older models as well. ‘We also want to surface the six-month old version, still brand new, that may be in an auction format because its value is less certain, he said.”

As I understand this statement, eBay is going to [a] improve search because getting 25,000 results is confusing, and I agree, [b] maybe support voice search because the phrase “you will be able to say” seems to suggest that functionality, and [c] eBay “will show used and older models as well”. The word “show” connotes some graphical representation or interface innovation to help users make sense of 25,000 BlackBerry listings.

Any one of these technology-centric fixes would be a lot of work and could take considerable time. A year seems too little time to get these innovations planned, procured, debugged, and online for customers.

I know from my search work that most users don’t feel comfortable with laundry lists of results. I generally look at the auctions closing within a few minutes or check out the “Buy It Now” products. I no longer rummage through eBay’s sorting and filtering functions. For me, those functions are too hard to find. I prefer Google’s approach to its “Sort by Price” function. I also like eCost’s sending an email with time-sensitive deals. I want what I want now with the least hassle, the lowest price at that moment, and the simplest possible interface. Let’s look at some of the words I highlighted in the article.

I’m puzzled by the notion of Mr. Donahoe’s use of the word “surface”. I am not sure about its meaning in this context. “Surface” makes me think of whales, as in “Save the whales.”

When Mr. Donahoe’s uses the word “say”, I thought of my mobile phone speech recognition function. I talk to my phone now, mostly unsuccessfully. I do use Google’s voice recognition service for 411, and it’s pretty good. My mobile phone has a small screen, and I can’t figure out how eBay will be able to display some of the 25,000 results so I can read them. I use the new Opera mobile browser. I don’t like its miniature rendering of a Web page. When I want to look at something, Opera uses a zoom function that is a hassle for me to use on my mobile’s Lilliputian keyboard. eBay has to do better than Opera’s interface.

Most of the gizmos I look for on eBay or Google’s shopping service come in quantities of a couple dozen if the product is even available. For example, I recently scoured the Web for a replacement fan for one of my aging Netfinity 5500 servers. Zero hits for me on eBay the day I ran the query. I fixed the fan myself. Last week, I tried to buy a Mac Mini on eBay, but I got a better deal through Craigslist.

Enough old-guy grumpiness.

I knew that eBay’s search system was and is a work in progress. Years ago, the eBay Web site carried a Thunderstone logo. I assumed that Thunderstone, a vendor of search systems, provided search technology to eBay. Then one day the little blue Thunderstone logo vanished. No one at Thunderstone would tell me what happened. Somewhere along the line, Louis Monier (a search wizard) joined eBay. Then he jumped to Google, and I don’t know who had to fill his very big shoes. I asked eBay to fill me in, but eBay’s team did not respond to me. I call this search churn. It’s expensive and underscores a lack of certainty about how to deliver a foundation service in my experience.

But I really wasn’t surprised at the lack of response to my email.

When I have a problem with an eBay vendor who snookers me, I have had time-consuming work to get to a human. Someone told me that I have a negative reputation score because I am a “bad buyer”. I don’t sell anything on eBay, but I suppose eBay rates customers who grouse when a sale goes out of bounds.

Search won’t fix a business model, customer support, and giving annoyed customers a grade of “D” in buying. To my way of thinking, search is not eBay’s only problem. Search is not eBay’s major problem.

One of my colleagues in San Francisco told me that eBay was reluctant to license his software because eBay’s system at that time was “flaky”. His word, not mine. At that time eBay was relying on Sun Microsystems’ servers. I was a Sun Catalyst reseller and a Sun cheerleader. I know that, in general, Sun boxes are reliable, fast, stable, and scalable when properly set up and resourced. Ignore Sun’s technical recommendations, and you will definitely have excitement on your hands. When I hear rumors of high-end systems being “flaky”, I’m inclined to believe that some technical and management mistakes were made. Either money or expertise is in short supply, so a problem gets a temporary fix, not a real fix.

After reading the New York Times‘s article, I asked myself, “Is eBay so sick it has to be saved?”

For example, I bought a watchband on eBay not long ago, and everything worked as I expected. I used the search engine to find “brown watchband 20mm”. I got a page or two of results. I picked a watchband, won the auction, and I paid via PayPal — actually tried to pay. I had registered a new credit card a few weeks before the purchase. Before I could consummate my purchase, I had to locate a secret four digit number printed next to a $2 eBay transaction on my last credit card statement. After coming home from a 18-day trip , I had a hefty credit card statement. Hunting down the secret code definitely put a hitch in my getalong, but I found the number after some hunting. About a week later my watch band arrived. I liked its hot pink color and its 18mm width. Yep, another eBay purchase that went awry. I lived with the error. I’ve learned that when I file a negative comment about a transaction, I get emails from the offending merchant asking me to revise my opinion. I don’t need busy work.

Now let’s think about this “save eBay” effort. When I was in Australia in November 2007, the Australian government said it would take action to protect the whales. I didn’t think this would help. There are more whale hunters than Australian patrols. The Pacific ocean is a big expanse. Whale hunters with radar, infrared sensors, Google Earth, and super-tech harpoons can find and kill whales more easily than Australians protective forces can find the whale killers.

If “save eBay” is like saving the whales, eBay has a thankless job to do. But just fixing search won’t save eBay. The business processes and the business model need some attention. eBay’s new president (the former blue-chip consultant) is putting in place a one-year program in which search plays a leading role. eBay is going to use its “closed transaction data” to be smarter about using those data to “provide the most relevant search experience.”

I am confident that a Bain consultant can deliver on his agenda. What bothers me is that I think his timeline lacks wiggle room. He has to clear some hurdles:

First, annoyed sellers who are looking for other ways to move products.

Second, annoyed buyers who look for other ways to move their goods.

Third, the “new” or “better” search system.

Fourth, increasingly complex security actions that remind me that maybe eBay is not as secure as I believed it to be.

If the new president can’t revivify eBay, we might be looking at an eAmazon or a Google-Bay. If eBay swings for a search home run, it won’t be enough. eBay has to make more informed decisions about its customers, security, sellers, and business model. Otherwise, eBay may be an ecommerce whale pursued by some hungry sushi lovers.

Stephen Arnold, January 28, 2008

Written by Stephen E. Arnold · Filed Under Search | 4 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Hit Boosting: SEO for Intranet Search Systems

Simple Math = Big Challenge: MSFT & YHOO

Where We Are

Who Benefits?

More Simple Math

Observations

Lotsa Search at Yahoo!

Search Frustration: 1980 and 2008

The Past: 1980

The Present: 2008

Observations

Search Saber Rattling

Study Spam: Caveat Emptor

Autonomy: Right but for the Wrong Reasons

Search: The Problem with Words and Their Misuse

Vivisimo’s Remix

Background

A Demonstration Query

The Shift to the Enterprise

Remix

Some Challenges

Will Search “Save eBay”?

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Where We Are

Who Benefits?

More Simple Math

Observations

Share this:

Share this:

The Past: 1980

The Present: 2008

Observations

Share this:

Share this:

Share this:

Share this:

Share this:

Background

A Demonstration Query

The Shift to the Enterprise

Remix

Some Challenges

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta