Map Reduce: The Great Database Controversy

January 18, 2008

I read with interest the article “Map Reduce: A Major Step Backwards” by David DeWitt. The article appeared in “The Database Column” on January 17, 2008. I agree that Map Reduce is not a database, not a commercial alternative for products like IBM’s DB2 or any other relational database, and definitely not the greatest thing since sliced bread.

Map Reduce is one of the innovations that seems to have come from top-notch engineers Google hired from AltaVista.com. Hewlett Packard orphaned an interesting search system because it was expensive to scale in the midst of the Compaq acquisition. Search, to Hewlett Packard’s senior management, was expensive, generated no revenue, and a commercial dead end. But in term s of Web search, AltaVista.com was quite important because it allowed its engineers to come face to face with the potential and challenges of multi-core processors, new issues in memory management, and programming challenges for distributed, parallel systems. Google surfed on AltaVista.com’s learnings. Hewlett Packard missed the wave.

So, Map Reduce was an early Google innovation, and my research suggests it was influenced by technology that was well known among database experts. In my The Google Legacy: How Search Became the Next Application Platform (Infonortics, Ltd. 2005) I tried to explain in layman’s terms how Map Reduce bundled and optimized two Lisp functions. The engineering wizardry of Google was making these two functions operate at scale and quickly. The engineering tricks were clever, but not like Albert Einstein’s sitting in a patent office thinking about relativity. Google’s “invention” of Map Reduce was driven by necessity. Traditional ways to match queries with results were slow, not flawed, just turtle-like. Google needed really fast heavy lifting. The choke points that plague some query processing systems had to be removed in an economical, reliable way. Every engineering decision involves trade offs. Google sacrificed some of the cows protected by certain vendors in order to get speed and rock bottom computational costs. (Note: I did not update my Map Reduce information in my newer Google study, Google Version 2.0 (Infonortics, Ltd. 2007). There have been a number of extensions to Map Reduce in the last three years. A search for the term MapReduce on Google will yield a treasure trove of information about this function, its libraries, its upside, and its downside.)

I am constantly surprised at the amount of technical information Google makes available as Google Papers. Its public relations professionals and lawyers aren’t the best communicators. I have found Google’s engineers to be remarkably communicative in technical papers and at conferences. For example, Google engineers rely on MySQL and other tools (think Oracle) to perform some data processes. Obviously Map Reduce is only one cog in the larger Google “machine.” Those of you who have followed my work about Google’s technology know that I refer to the three dozen server farms, the software, and the infrastructure as The Googleplex. Google uses this term to refer to a building, but I think it is a useful way to describe the infrastructure Google has been constructing for the last decade. Keep in mind that Map Reduce–no matter how good, derivative, or clever–is a single component in its digital matroska.

My analyses of Map Reduce suggest that Google’s engineers obsess about scale, not break through invention. I was surprised to learn that much of Google’s technology is available to any one; for example, hadoop. Some of Google’s technology comes from standard collections of algorithms like Numerical Recipes with Source Code CD-ROM 3rd Edition: The Art of Scientific Computing. Other bits and pieces are based on concepts that have been tested in various university computer science labs supported by U.S. government funds. And, there’s open source code kept intact but “wrapped” in a Google technical DNA for scale and speed. Remember that Google grinds through upwards of four score petabytes of data every 24 hours. What my work tells me is that Google takes well-known engineering procedures and makes them work at Google scale on Google’s infrastructure.

Google has told two of its “partners,” if my sources are correct, that the company does not have a commercial database now, nor does it plan to make a commercial database like IBM’s, Microsoft’s, or Oracle’s available. Google and most people involved in manipulating large-scale data know that traditional databases can handle almost unlimited amounts of information. But it’s hard, expensive, and tricky work. The problem is not the vendors. The problem is that Codd databases or relational database management systems (RDBMS) were not engineered to handle the data management and manipulation tasks at Google scale. Today, many Web sites and organizations face an information technology challenge because big data in some cases bring systems to their knees, exhaust engineers and drain budgets in a nonce.

Google’s publicly-disclosed research and its acquisitions make it clear that Google wants to break free of the boundaries, costs, reliability, and performance issues of RDBMS. In my forthcoming study Beyond Search, I devote a chapter to one of Google’s most interesting engineering initiatives for the post-database era. For the data mavens among my readers, I include pointers to some of Google’s public disclosures about their approach to solving the problems of the RDBMS. Google’s work, based on the information I have been able to gather from open sources, is also not new. Like Map Reduce, the concepts have been kicking around in classes taught at the University of Illinois, the University of Wisconsin – Madison, University of California – Berkeley, and the University of Washington, among others, for about 15 years.

If Google is going to deal with its own big data challenges, it has to wrap Map Reduce and other Google innovations in a different data management framework. Map Reduce will remain for the foreseeable future one piece of a very large technology mosaic. When archeologists unearth a Roman mosaic, considerable effort is needed to reveal the entire image. Looking at a single part of the large mosaic tells us little about the overall creation. Google is like that Roman mosaic. Focusing on a single Google innovation such as Chubby, Sawzall, containers (not the XML variety), the Programmable Search Engine, the “I’m feeling doubly lucky” invention, or any one of hundreds of Google’s publicly disclosed innovations yields a distorted view.

In summary, Map Reduce is not a database. It is no more of a database than Amazon’s SimpleDB service is. It can perform some database-like functions, but it is not a database. Many in the database elite know that the “next big thing” in databases may burst upon the scene with little fanfare. In the last seven years, Map Reduce has matured and become a much more versatile animal. Map Reduce can perform tasks its original designers did not envision. What I find delightful about Google’s technology is that it often does one thing well like Map Reduce. But when mixed with other Google innovations, unanticipated functionality comes to light. I believe Google often solves one problem and then another Googler figures out another use for that engineering process.

Google, of course, refuses to comment on my analyses. I have no affiliation with the company. But I find its approach to solving some of the well-known problems associated with big data interesting. Some Google watchers may find it more useful to ask the question, “What is Google doing to resolve the data management challenges associated with crunching petabytes of information quickly?” That’s the question I want to try and answer.

Stephen E. Arnold, January 18, 2008

Written by Stephen E. Arnold · Filed Under Database, Google | 1 Comment

Autonomy: Marketing Chess

January 18, 2008

The Microsoft – Fast deal may not have the impact of the Oracle – BEA Systems deal, but dollar for dollar, search marketers will be working overtime. The specter of what Microsoft might do begs for prompt, immediate action. The “good offense is the best defense” seems to be at work at Autonomy plc, arguably one of the world’s leading vendors of behind-the-firewall search and various applications that leverage Autonomy’s IDOL (integrated data operating layer) and its rocket-science mathematics.

The Microsoft – Fast Search & Transfer has flipped Autonomy’s ignition switch. On January 14, 2008, CBROnline reported that Autonomy’s Integrated Data Operating Layer gets a Power Pack for Microsoft Vista. IDOL can now natively process more than 1000 file formats. Autonomy also added support for additional third-party content feeds.

The goal of the enhancements is to make it easier for a Microsoft-centric organization to make use of the entity extraction, classification, categorization and conceptual search capabilities. Autonomy’s tailoring of IDOL to Microsoft Windows began more than a 18 months ago, possibly earlier. Microsoft SharePoint installations now have more than 65 million users. Despite some grousing about security and sluggish performance, Microsoft’s enterprise initiatives are generating revenue. The Dot Net framework keeps getting better. Companies that offer a “platform alternative” face a hard fact — Microsoft is a platform powered by a company with serious marketing and sales nitromenthane. No senior manger worth his salary and bonus can ignore a market of such magnitude. Now, with the acquisition of Fast Search & Transfer, Autonomy is faced with the potential threat of direct and indirect activity by Microsoft to prevent third-party vendors like Autonomy from capturing customers. Microsoft wants the revenue, and it wants to keep other vendors’ noses out of its customers’ tents.

Autonomy has never shown reluctance for innovative, aggressive, and opportunistic marketing (Remember the catchy “Portal in a Box” campaign?) It makes a great deal of business sense for Autonomy to inject steroids into its its Vista product. I expect Autonomy to continue to enhance its support for Microsoft environments on a continuous basis. To do less would boost Microsoft’s confidence in its ability to alter a market with an acquisition. I call this “money rattling.” The noise of the action scares off the opposition.Â

Other search vendors will also keep a sharp eye on Microsoft and its SharePoint customers. Among the companies offering a snap-in search or content processing solution are Coveo, dtSearch, Exalead, and ISYS Search Software, among others. It’s difficult for me to predict with accuracy how these companies might respond to Autonomy’s sharp functional escalation of IDOL in articular and the Microsoft – Fast tie up in general. I think that Microsoft will want to keep third-party vendors out of the SharePoint search business. Microsoft wants a homogeneous software environment, and, of course, more revenue from its customers. Let my think out load, describing several hypothetical scenarios that Microsoft might explore:

Microsoft reduces the license fee for Fast Search & Transfer’s SharePoint adaptor and Fast Search’s ESP (enterprise search platform). With Fast Search’s pre-sell out license fees in the $200,000 range, a price shift would have significant impact upon Autonomy and other high-end search solutions. This is the price war option, and it could wreck havoc on the fragile finances of some behind-the-firewall search system vendors.
Microsoft leaves the list price of Fast Search unchanged but begins bundling ESP with other Microsoft applications. The cost for an enterprise search solution is rolled into a larger sale for Microsoft’s customer relationship management system or a shift from either IBM DB2 or Oracle’s database to enterprise SQL Server. Microsoft makes high-end search a functional component of a larger, enterprise-wide, higher value solution. This is the bundled feature option, and it makes a great deal of sense to a chief financial officer because one fee delivers the functionality without additive administrative and operational costs of another enterprise system.
Microsoft makes changes to its framework(s), requiring Microsoft Certified Partners to modify their systems to keep their certification. Increasing the speed of incremental changes could place a greater technical and support burden on some Certified Partners developing and marketing replacements for Microsoft search solutions for SharePoint.Â I call this Microsoft’s fast-cycle technical whip saw option. Some vendors won’t be able to tolerate the work needed to keep their search application certified, stable, and in step with the framework.
Microsoft does nothing different, allowing Fast Search and its competitors to raise their stress levels and (maybe) make a misstep implementing an aggressive response to … maybe no substantive action by Microsoft. I think of this as the Zen of uncertainty option. Competitors don’t know what Microsoft will or will not do. Some competitors feel compelled to prepare for some Microsoft action. These companies burn resources in to get some type of insurance against an unknown future Microsoft action.

Microsoft’s actions with regard to Fast Search will, over time, have an impact on the vendors serving the SharePoint market. I don’t think that the Microsoft – Fast deal will make a substantive change in search and content processing. The reason is that most vendors are competing with substantially similar technologies. Most solutions are similar to one another. And, in my opinion, some of Fast Search’s technology is starting to become weighted down with its heterogeneous mix of original, open source, and acquired technology.

I believe that when a leap-frogging, game-changing technology becomes available, most vendors — including Autonomy, IBM, Microsoft, Oracle, and SAP, among others — will be blindsided. In today’s market, it’s received wisdom to make modest incremental changes and toot the marketing tuba. For the last five or six years, minor innovations have been positioned as revolutionary in behind-the-firewall search. I think that much of the innovation in search has been toÂ handles sales and marketing in a more professional way. The technology has been adequate in most cases. My work suggests that most users of today’s behind the firewall search systems are not happy with their information access tools — regardless of vendor. Furthermore, in terms of precision and recall, there’s not been much improvement in the last few years. Most systems deliver 75 to 80 percent precision and recall upon installation. After tuning, 85 percent scores are possible. Good but not a home run I assert.

I applaud Autonomy for enhancing IDOL for Vista. I will watch Microsoft to see if the company adopts one or more of my hypothetical options. I am also on the look out for a search break through. When that comes along, IÂ will be among the first to jettison the tools I now use for the next big thing. I wonder how many organizations will take a similar approach?Â I want to be on the crest of the wave, not swamped by quotidian tweaks, unable to embrace the “search” future when it arrives.

Stephen Arnold, January 18, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Microsoft | 2 Comments

Two Visions of the Future from the U.K.

January 17, 2008

Two different news items offered insights about the future of online. My focus is the limitations of key word search. I downloaded both articles, I must admit, eager to see if my research were disproved or augmented.

Whitebread

The first report appeared on January 14, 2008, in the (London) Times online in a news story “White Bread for Young Minds, Says University Professor.” In the intervening 72 hours, numerous comments appeared. The catch phrase is the coinage of Tara Brabazon, professor of Media Studies at the University of Brighton. She allegedly prohibits her students from using Google for research. The metaphor connotes in a memorable way a statement attributed to her in the Times’s article: “Google is filling, but it does not necessarily offer nutritional content.”

The argument strikes a chord with me because [a] I am a dinosaur, preferring warm thoughts about “the way it was” as the snow of time accretes on my shoulders; [b] schools are perceived to be in decline because it seems that some young people don’t read, ignore newspapers except for the sporty pictures that enliven gray pages of newsprint, and can’t do mathematics reliably at take-away shops; and [c] I respond to the charm of a “sky is falling” argument.

Ms. Brabazon’s argument is solid. Libraries seem to be morphing into Starbuck’s with more free media on offer. Google–the icon of “I’m feeling lucky” research–allows almost anyone to locate information on a topic regardless of its obscurity or commonness. I find myself flipping my dinosaurian tail out of the way to get the telephone number of the local tire shop, check the weather instead of looking out my window, and converting worthless dollars into high-value pounds. Why remember? Google or Live.com or Yahoo are there to do the heavy lifting for me.

Educators are in the business of transmitting certain skills to students. When digital technology seeps into the process, the hegemony begins to erode, so the argument goes. Ms. Brabazon joins Neil Postman Amusing Ourselves to Death: Public Discourse in the Age of Show Business, 1985) and more recently Andrew Keen (The Cult of the Amateur, 2007) among others in documenting the emergence of what I call the “inattention economy.”

I don’t like the loss of what weird expertise I possessed that allowed me to get good grades the old-fashioned way, but it’s reality. The notion that Google is more than an online service is interesting. I have argued in my two Google studies that Google is indeed much more than a Web search system growing fat on advertisers’ money. My research reveals little about Google as a corrosive effect on a teacher’s ability to get students to do their work using a range of research tools. Who wouldn’t use an online service to locate a journal article or book? I remember how comfortable my little study nook was in the rat hole in which I lived as a student, then slogging through the Illinois winter, dealing with the Easter egg hunt in the library stuffed with physical books that were never shelved in sequence, and manually taking notes or feeding 10-cent coins into a foul-smelling photocopy machine that rarely produced a readable copy. Give me my laptop and a high-speed Internet connection. I’m a dinosaur, and I don’t want to go back to my research roots. I am confident that the professor who shaped my research style–Professor William Gillis, may he rest in peace–neither knew nor cared how I gathered my information, performed my analyses, and assembled the blather that whizzed me through university and graduate school.

If a dinosaur can figure out a better way, Tefloned along by Google, a savvy teen will too. Draw your own conclusions about the “whitebread” argument, but it does reinforce my research that suggests a powerful “pull” exists for search systems that work better, faster, and more intelligently than those today. Where there’s a market pull, there’s change. So, the notion of going back to the days of taking class notes on wax in wooden frames and wandering with a professor under the lemon trees is charming but irrelevant.

The Researcher of the Future

The British Library is a highly-regarded, venerable institution. Some of its managers have great confidence that their perception of online in general and Google in particular is informed, substantiated by facts, and well-considered. The Library’s Web site offers a summary of a new study called (and I’m not sure of the bibliographic niceties for this title): A Ciber [sic] Briefing Paper. Information Behaviour of the Researcher of the Future, 11 January 2008. My system’s spelling checker is flashing madly regarding the spelling of cyber as ciber, but I’m certainly not intellectually as sharp as the erudite folks at the British Library, living in rural Kentucky and working by the light of buring coal. You can download this 1.67 megabyte 35 page document Researcher of the Future.

The British Library’s Web site article identifies the key point of the study as “research-behaviour traits that are commonly associated with younger users — impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs — are now becoming the norm for all age-groups, from younger pupils and undergraduates through to professors.” The British Library has learned that online is changing research habits. (As I noted in the first section of this essay, an old dinosaur like me figured out that doing research online faster, easier, and cheaper than playing “Find the Info” in my university’s library.)

My reading of this weirdly formatted document, which looks as if it was a PowerPoint presentation converted to a handout, identified several other important points. Let me share my reading of this unusual study’s findings with you:

The study was a “virtual longitudinal study”. My take on this is that the researchers did the type of work identified as questionable in the “whitebread” argument summarized in the first section of the paper. If the British Library does “Googley research”, I posit that Ms. Brabazon’s and other defenders of the “right way” to do research have lost their battle. Score: 1 for Google-Live.com-Yahoo. Nil for Ms. Brabazon and the British Library.
Libraries will be affected by the shift to online, virtualization, pervasive computing, and other impedimentia of the modern world for affluent people. Score 1 for Google-Live.com-Yahoo. Nil for Mr. Brabazon, nil for the British Library, nil for traditional libraries. I bet librarians reading this study will be really surprised to hear that traditional libraries have been affected by the online revolution.
The Google generation is comprised of “expert searchers”. The reader learns that most people are lousy searchers. Companies developing new search systems are working overtime to create smarter search systems because most online users–forget about age, gentle reader–are really terrible searchers and researchers. The “fix” is computational intelligence in the search systems, not in the users. Score 1 more for Google-Live.com-Yahoo and any other search vendor. Nil for the British Library, nil for traditional education. Give Ms. Brabazon a bonus point because she reached her conclusion without spending money for the CIBER researchers to “validate” the change in learning behavior.
The future is “a unified Web culture,” more digital content, eBooks, and the Semantic Web. The word unified stopped my ageing synapses. My research yielded data that suggest the emergence of monopolies in certain functions, and increasing fragmentation of information and markets. Unified is not a word I can apply to the online landscape.In my BearStearns’ report published in 2007 as Google’s Semantic Web: The Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft, I revealed that Google wants to become the Semantic Web.

Wrap Up

I look forward to heated debate about Google’s role in “whitebreading” youth. (Sounds similar to waterboarding, doesn’t it?) I also hunger for more reports from CIBER, the British Library, and folks a heck of lot smarter than I am. Nevertheless, my Beyond Search study will assert the following:

Search has to get smarter. Most users aren’t progressing as rapidly as young information retrieval experts.
The traditional ways of doing research, meeting people, even conversing are being altered as information flows course through thought and action.
The future is going to be different from what big thinkers posit.

Traditional libraries will be buffeted by bits and bytes and Boards of Directors who favor quill pens and scratching on shards. Publishers want their old monopolies back. Universities want that darned trivium too. These are notions I support but recognize that the odds are indeed long.

Stephen E. Arnold, January 17, 2008

Written by Stephen E. Arnold · Filed Under Online (general) | 2 Comments

MSFT – FAST: Will It Make a Difference?

January 16, 2008

On January 15, I received a telephone call from one of the Seybold Group’s analysts. Little did I know that at the same time the call was taking place, Google’s Rodrigo Vaca, a Googler working in the Enterprise Division, posted “Make a Fast Switch to Google.”

The question posed to me by Seybold’s representative was: “Will Microsoft’s buying Fast Search & Transfer?” My answer, and I am summarizing, “No, certainly not in the short-term. In fact looking 12 to 18 months out, I don’t think the behind-the-firewall market will be significantly affected by this $1.2 billion buy out.”

After I made the statement, there was a longish pause as the caller thought about what I asserted. The follow up question was, “Why do you say that?” There are three reasons, and I want to highlight them because most of the coverage of the impending deal has been interesting but uninformed. Let me offer my analysis:

The technology for behind-the-firewall search is stable. Most of the vendors offer systems that work reasonably well when properly configured, resourced, and maintained. In fact, if I were to demonstrate three different systems to you, gentle reader, you would be hard pressed to tell me which system was demonstrated and you would not be able to point out the strengths and weaknesses of properly deployed systems. Let me be clear. Very few vendors offer a search-and-retrieval solution significantly different from its competitors. Procurements get slowed because those on the procurement team have a difficult time differentiating among the sales pitches, the systems, and the deals offered by vendors. I’ve been doing search-related work for 35 years, and I get confused when I hear the latest briefing from a vendor.
An organization with five or more enterprise search systems usually grandfathers an older system. Every system has its supporters, and it is a hassle to rip and replace an existing system, convert that system’s habitual users. Behind-the-firewall search, therefore, is often additive. An organization leaves well enough alone and uses its resources to deploy the new system. Ergo: large organizations have multiple search and retrieval systems. No wonder employees are dissatisfied with behind-the-firewall search. A person looking for information must search local machines, the content management system, the enterprise accounting system, and whatever search systems are running in departments, acquired companies, and in the information technology department. Think gradualism and accretion, not radical change in search and retrieval.
The technical professionals at an organization have an investment of time in their incumbent systems. An Oracle data base administrator wants to work with Oracle products. The learning curve is reduced and the library of expertise in the DBA’s head is useful in troubleshooting Oracle-centric software and systems. The same holds true with SharePoint-certified engineers. An IT professional who has a stable Fast Search installation, a working DB2 data warehouse, an Autonomy search stub in a BEA Systems’ application server, a Google Search Appliance in marketing, and a Stratify eDiscovery system in the legal department doesn’t want to rock the boat. Therefore, the company’s own technical team stonewalls change.

I’m not sure how many people in the behind-the-firewall business thank their lucky stars for enterprise inertia. Radical change, particularly in search and retrieval, is an oxymoron. The Seybold interviewer was surprised that I was essentially saying, “Whoever sells the customer first has a leg up. An incumbent has the equivalent of a cereal brand with shelf space in the grocery store.”

Now, let’s shift to Mr. Vaca’s assertion that “confused and concerned customers” may want to license a Google Search Appliance in order to avoid the messiness (implied) with the purchase of Fast Search by Microsoft. The idea is one of those that seems very logical. A big company like Microsoft buys a company with 2,500 corporate customers. Microsoft is, well, Microsoft. Therefore, jump to Google. I don’t know Mr. Vaca, but I have an image of a good looking, earnest, and very smart graduate of a name brand university. (I graduated from a cow college on the prairie, so I am probably revealing my own inferiority with this conjured image of Mr. Vaca.)

The problem is that the local logic of Mr. Vaca is not the real-world logic of an organization with an investment in Fast Search & Transfer technology. Most information technology professionals want to live with something that is stable, good enough, and reasonably well understood. Google seems to have made a similar offer in November 2005 when the Autonomy purchase of Verity became known. Nothing changed in 2005, and nothing will change in 2008 in terms of defectors leaving Fast Search for the welcoming arms of Google.

To conclude: the market for behind-the-firewall search is not much different from the market for other enterprise software at this time. However, two things make the behind-the-firewall sector volatile. First, because of the similar performance of the systems now on offer, customers may well be willing to embrace a solution that is larger than information retrieval. A solution for information access and data management may be sufficiently different to allow an innovator to attack from above; that is, offer a meta-solution that today’s vendors can’t see coming and can’t duplicate. Google, for example, is capable of such a meta-attack. IBM is another firm able to leap frog a market.

Second, the pain of getting a behind-the-firewall search up and stable is significant. Remember: there is human effort, money, infrastructure, users, and the mind numbing costs of content transformation operating to prevent sudden changes of direction in the type of organization with which I am familiar.

Bottom line: the Microsoft – Fast deal is making headlines. The deal is fueling increased awareness of search, public relations, and investor frenzy. For the time being, the deal will not make a significant difference in the present landscape for behind-the-firewall search. Looking forward, the opportunity for an innovator is to break out of the search-and-retrieval circumvallation. Mergers can’t deliver the impact needed to change the market rules.

Stephen E. Arnold, January 16, 2008, 9 am

Written by Stephen E. Arnold · Filed Under Enterprise, Google, Microsoft | 3 Comments

Google Responds to Jarg Allegation

January 15, 2008

Intranet Journal reported on January 14, 2008, that Google denies the Jarg allegation of patent infringement. I’m not an attorney, and claims about online processes are complex. You can read US5694593, “Distributed Computer Database System and Method” at the USPTO or Google Patents Service.

As I understand the issue, the Jarg patent covers technology that Jarg believes is used in Google’s “plumbing.” In The Google Legacy and in Google Version 2.0, I dig into some of the inner workings that allow Google to deliver the services that comprise what I call the Googleplex. Note: I borrowed this term from Google’s own jargon for its office complex in Mountain View, California.

If the Jarg allegation has merit, Google may be required to make adjustments, pay Jarg, or some other action. I have read the Jarg patent, and I do see some parallels. In my reading of more than 250 Google patent applications and patents, the key theme is not the mechanics of operations. Most of Google’s inventions make use of technology long taught in college courses in computer science, software engineering, and mathematics.

What sets Google’s inventions apart are the engineering innovations that allow the company to operate at what I call “Google scale.” There are Google presentations, technical papers, and public comments that underscore the meaning of scale at Google. According to Googlers Jeff Dean and Sanjay Ghemawat, Google crunches upwards of 20 petabytes a day via 100,000 MapReduce jobs. A petabyte is a 1,000 terabytes. What’s more interesting is that Google spawns hundreds of sub-processes for each query it receives. The millisecond response time is possible because Google has done a very good job of taking what has been available as standard procedures, blended in some ideas from the scientists doing research at universities, and advanced mathematics to make its application platform work.

Remember that search was an orphan when Google emerged from the Backrub test. Excite, Lycos, Microsoft, and Yahoo saw search as a no-brainer, a dead end of sorts when compared to the notion of a portal. University research chugged along with technology transfer programs at major institutions encouraging experimentation, commercialization, and patent applications.

What makes the Jarg allegation interesting is that most universities and their researchers tap U.S. government funds. Somewhere in the labs at Syracuse University, Stanford University, the University of California at Los Angeles, or the University of Illinois there’s government-funded activity underway. In my experience, when government money sprays over a technology, there is a possibility that the research must comply with government guidelines for any invention that evolves from these dollops of money.

When I read the original Google PageRank patent application US6285999, Method of Node Ranking in a Linked Database (September 4, 2001) I was surprised at one fact in this remarkable blend of engineering and plain old voting. That fact was that the assignee of the invention was not Mr. Page. The assignee was The Board of Trustees of the Leland Stanford Junior University. The provisional patent application was filed on January 10, 1997, and I — almost eight years later — just realized that the work was performed under a U.S. government grant.

I will be interested in the trajectory of the Jarg allegation. I wonder if any of the work disclosed in the Jarg patent has an interesting family tree. I am also curious about the various data management practices, generally well-known in the parallel computing niche, have been widely disseminated by professors teaching their students basic information and illuminating those lectures with real-life examples from the research work conducted in labs in colleges and universities in the U.S.

Litigation in my experience as an expert witness is a tedious, intellectually-demanding process. Engineering does not map point for point to the law. When the U.S. government explicitly encourages recipients of its funds to make an effort to commercialize their inventions, the technology transfer business got a jolt of adrenaline. Patent applications and patents on novel approaches from government-funded research contribute to the flood of patent work choking the desks of USPTO professionals. Figuring out what’s going on in complex inventions and then determining which function is sufficiently novel to withstand the scrutiny of cadres of lawyers and their experts is expensive, time-consuming, and often maddeningly uncertain.

Not surprisingly, most litigation is settled out of court. Sometimes one party runs out of cash or the willingness to pay the cost of operating the litigation process. Think millions of dollars. Measure the effort in person years.

As the Intranet Journal story says: “Google has responded to the patent-infringement lawsuit filed against it by semantic search vendor Jarg and Northeastern University, denying the parties’ claims of patent infringement. Google has also filed a counterclaim, asking the court to dismiss the patent in question as invalid.”

Will this be the computer scientists’ version of the OJ Simpson trial? Stay tuned.

Stephen E. Arnold, January 15, 2008, Noon eastern

Written by Stephen E. Arnold · Filed Under Google, Search | Comments Off on Google Responds to Jarg Allegation

Vertical Search: A Chill Blast from the Past

January 15, 2008

Two years ago, a prestigious New York investment banker asked me to attend a meeting without compensation. I knew my father was correct when he said, “Be a banker. That’s where the money is.” My father didn’t know Willie Sutton, but he has money insight. The day I arrived the bankers’ topic was “vertical search,” the next big money maker in search, according to the vice president who escorted me into a conference room overlooking the East River.

As I understood the notion from these financial engineers, certain parties (translation: publishers) had a goldmine of content (translation: high-value information created by staff writers and freelancers). The question asked was: “Isn’t a revenue play possible using search-and-retrieval technology and a subscription model?”

There’s only one answer that New York bankers want to hear, and that is, “I think there is an opportunity for an upside.” I repeated the catch phrase, and the five money mavens smiled. I was a good Kentucky consultant, and I had on shoes too.

My recollection is that everyone in the Park Avenue meeting room was well-groomed, scrupulously polite, and gracefully clueless about online. The folks asking me to stop by for a chat listened to me for about 60 seconds and then fired questions at me about Web 2.0 technology (which I don’t fully grasp), online stickiness (which means repeat visitors and time spent on a Web site), and online revenue growth (which I definitely understand after getting whipsawed with costs in 1993 when I was involved with The Point (Top 5% of the Internet). Note: we sold this site to Lycos in 1995, and I vowed not to catch spreadsheet fever again. Spreadsheet fever is particularly contagious in the offices of New York banks.

This morning — Tuesday, January 15, 2008 — I read a news story about Convera’s vertical search solution. The article explained that Lloyd’s List , a portal reporting the doings in the shipping industry, was going online with a “vertical search solution.”

The idea, as I understand it, is that a new online service called Maritime Answers will become available in the future. Convera Corporation, a one-time big dog in the search-and-retrieval sled races, would use its “technical expertise to provide a powerful search tool for the shipping community.” (Note: in this essay I am not discussing the sale of Convera’s search-and-retrieval business to Fast Search & Transfer or the capturing by Autonomy of some of Convera’s key sales professionals in 2007.)

Vertical Search Defined

In my first edition of The Enterprise Search Report, I included a section about vertical search. I cut out that material in 2003 because the idea seemed outside the scope of “behind the firewall” search. In the last five years, the notion of vertical search has continued to pop up as a way to serve the needs of a specific segment or constituency in a broader market.

Vertical search means limiting the content to a specific domain. Examples include information for attorneys. Companies in the vertical search business for lawyers include Lexis Nexis (a unit of Reed Elsevier) and Westlaw (a service absorbed into the the Thomson Corporation). A person with an interest in a specific topic, therefore, would turn to an online system with substantial information about a particular field. Examples range from the U.S. government’s health information available as Medline Plus to Game Trade Magazine with tens of thousands of other examples. One could make a good case that Web logs on a specific topic and a search box are vertical search systems.

The idea is appealing because if one looks for information on a narrow topic, a search system with information only on that topic, in theory, makes it easier to find the nugget or answer the user seeks — at least to someone who doesn’t know much about the vagaries of online information. I will return to this idea in a moment.

Commercial Databases: The Origin Vertical Search

Most readers of this Web log will have little experience with using commercial databases. The big online vendors have found themselves under siege by the Web and their own actions.

In the late 1960s when the commercial online business began with an injection of U.S. government funding, the only kind of database possible was one that was very narrow. The commercial online services offered specific collections of information on very narrow topics or information confined to a specific technical discipline. By 1980, there were some general business databases available, but these were narrowly constrained by editorial policies.

In order to make the early search-and-retrieval systems useful, database publishers (the name given to the people and companies who built databases) had to create fields or what today would be called “fields” or “XML document type definitions.” The database builders would pay indexers to put the name of the author, the title of the source, the key words from a controlled term list, and other data (now called metadata) into these fields.

The user would in 1980 pay a fee to get an account with an online vendor. Leaders a quarter century ago, mean very little to most online users today. The Googles and Microsofts of 1980 were Dialog Corporation, BRS, SDC, and a handful of others such as DataStar.

Every database or “file” on these systems was a vertical database. Users of these commercial systems would have to learn the editorial policy of a particular database; for example, ABI / INFORM or PROMT. When Dialog was king, the service offered more than 300 commercial databases, and most users picked a particular file and entered queries using a proprietary syntax. For example, to locate marketing information from the most recent update to the ABI / INFORM database one would enter into the Dialog command line: SS UD=9999 and CC=76?? and marketing. If a user wanted chemical information, the Chemical Abstracts service required the user to know the specific names and structures of chemicals.

Characteristics of These Original Vertical Databases

A peculiar characteristic of a collection of information on a topic or in a field is not understood by most users or investment bankers. The more narrow the content collection, the greater the need for a specialized vocabulary. Let me give an example. In the ABI / INFORM file it was pointless to search for the concept via the word “management.” The entire database was “about” management. Therefore, a careless query would, in theory, return a large number of hits. We, therefore, made “management” a stop word; that is, one that would not return results. We forced users to access the content via a controlled vocabulary, complete with Use For and See Also cross references. We created a business-centric classification coding scheme so a user could retrieve the marketing information using the command CC=76??.

Another attribute of vertical content or deep information on a narrow subject is that the terminology shifts. When a new development occurs in oil and gas, the American Petroleum Institute had to identify this term and take steps to map the new idea to content “about” that new subject. Let me give an example from a less specialized field than oil exploration. You know about an acquisition. The term means one company buys another. In business, however, the word takeover may be used to describe this action. In financial circles, there will be leveraged buyouts, a venture capital buyout, or a management buyout. In short, the words used to describe an acquisition evidence the power of English and the difficulty of creating a controlled vocabulary for certain fields. The paradox is that the deeper the content in detail and through time, the more complicated the jargon becomes. A failure to search for the appropriate terms means that information on the topic is not retrieved. In the search systems of yore, the string required to get the information from ABI / INFORM on acquisitions would require an explicit query with all of the terms present.

Vertical Search 2008

Convera is a company that has faced some interesting and challenging experiences. The company’s core technology was rooted in scanning paper documents, converting these documents to ASCII via optical character recognition, and then making the documents searchable via an interface. The company acquired for $33 million in 1995 ConQuest Software, developed by a former colleague of mine at Booz, Allen & Hamilton. Convera also acquired Semio’s Claude Vogel in 2002, a rocket scientist who has since left Convera. Convera from Allen & Co., a New York firm, and embarked on a journey to reinvent itself. This is an intriguing case example, and I may write about it in the future.

The name “Convera” was adopted in 2000 when Excalibur Technologies landed a deal with Intel. After the Intel deal went south about the same time a Convera deal with the NBA ran aground, the Convera name stuck. Convera in the last eight years has worked to reduce its debt, find new sources of revenue, and finally divested itself of its search-and-retrieval business, emerging as a provider of vertical search. I have not done justice to a particularly interesting case study in the hurdles companies face when those firms try to make money without a Google-type business model.

Now Convera is in the vertical search business. It uses its content acquisition technology or crawlers and parsers to build indexes. Convera has word lists for specific markets such as law enforcement and heath as well as technology that automatically indexes, classifies, and tags processed content. The company also has server farms that can provide hosted or managed search services to its customers.

Instead of competing with Google in the public Web indexing space, Convera’s business model, as I understand it, approaches a client who wants to build a vertical content collection. Convera then indexes the content of certain Web sites and any content the customer such as a publisher has. The customer pays Convera for its services. The customer either gives away access to the content collection or charges the customer a fee to access the content.

In short, Convera is in the vertical search business. The idea is that Convera’s stakeholders get money by selling services, not licensing a search-and-retrieval engine to an enterprise. Convera’s interesting history makes clear that enterprise software and joint ventures such as those with Intel can lose big money, more than $600 million give or take a couple hundred million. Obviously Convera’s original business model lacked the lift its management teams projected.

The Value of Vertical Search

The value of vertical search depends upon several factors that have nothing to do with technology. The first factor is the desire of a customer such as a publisher like Lloyd’s List to find a new way to generate growth and zip from a long-in-the-tooth information service. Publishers are in a tough spot. Most are not very good at technical foresight. More problematic, the online options can cannibalize their existing revenues. As a business segment, traditional publishing is a hostile place for 17th-century business models.

Another factor is the skill of the marketers and sales professionals. Never underestimate the value of a smooth talking peddler. Big deals can be done on the basis of charm and a dollop of FUD, fear-uncertainty-doubt.

A third element is the environmental pressures that come from companies and innovators largely indifferent to established businesses. One example is the Google-Microsoft-Yahoo activity. Each of these companies is offering online access to information mostly without direct fees to the user. The advertisers foot the bill. All three are digitizing books, indexing Web logs or social media, and working with certain third parties to offer certain information. Even Amazon is in the game with its Kindle device, online service, and courtesy fee for certain online Web log content. Executives at these companies know about the problems publishers face, but there’s not much executives at these companies can do to alter the tectonic shift underway in information access. I know I wouldn’t start a traditional magazine or newspaper even though for decades I was an executive in newspaper and publishing companies like the Courier Journal & Louisville Times and Ziff Communications.

Vertical Search: Google Style

You can create your own vertical search system now. You don’t have to pay Convera’s wizards for this service. In fact, you don’t have to know how to program or do much more than activate your browser. Google will allow anyone to create a custom search engine, which is that company’s buzzword for vertical search system. Navigate to Google’s CSE page and explore. If you want to see the service in action, navigate to Macworld’s beta.

We’ve come full circle in a sense. The original online market was only vertical search; that is, very specific collections of content on a particular topic or discipline. Then we shifted to indexing the world of information. Now, the Google system allows anyone to create a very narrow domain of content.

What’s this mean? First, I am not sure the Convera for-fee approach will be a financially rewarding as the company’s stakeholders expect. Free is tough to beat. For a publisher wanting to index proprietary content, Google will license a Google Search Appliance . With the OneBox API, it is possible to integrate the Google CSE with the content processed by the GSA. Few people recognize that Google’s approach allows a technically savvy person or one who is Googley to replicate most of the functionality on offer from the hundreds of companies competing in the “beyond search” markets.

Second, a narrow collection built on spidering a subset of Web sites, by definition, will face some cost hurdles. As costs rise, companies providing custom subsets by direct spidering and content processing will face rising costs. These costs will be controllable by cutting back on the volume of content spidered and processed. Alternatively, the quality of service or technical innovations will have to be scaled to match available resources. Either way, Google, Microsoft, and Yahoo may control the fate of the vertical search vendors.

Finally, the enthusiasm for vertical search may be predicated on misunderstanding available information. There is a big market for vertical search in law enforcement, intelligence, and pharmaceutical competitive intelligence. There may be a market in other sectors, but with a free service like Google’s getting better with each upgrade to the Google service array, I think secondary and tertiary markets may go with the lower-cost alternative.

Stakeholders in Convera don’t know the outcome of Convera’s vertical search play. One thing is certain. New York bankers are mercurial, and their good humor can disappear with a single disappointing earnings report. I will stick with the motto, “Surf on Google” and leave certain types of search investments to those far smarter than I.

Stephen E. Arnold
January 15, 2008, 10 am

Written by Stephen E. Arnold · Filed Under Vertical search | 2 Comments

Library Automation: SirsiDynix and Brainware

January 14, 2008

On January 9, 2008, Marshall Breeding, an industry watcher in the library automation space, posted a story called “Perceptions 2007: an International Survey of Library Automation.” I urge anyone interested in online information retrieval to pay particular attention to the data presented in Mr. Breeding’s article. One finding caught my attention. The products of SirsiDynix, Unicorn and Horizon, received low satisfaction scores from libraries responding to the survey. Unicorn, the company’s flagship ILS performed somewhat better than Horizon. 14% of libraries running Unicorn and about half of those with Horizon indicate interest in migrating to another system–not surprising considering SirsiDynix’s position not to develop that system into the future. Horizon libraries scored high interest in open source ILS alternatives. The comments provided by libraries running Horizon voiced an extremely high level of frustration with SirsiDynix as a company and its decision to discontinue Horizon. Many indicated distrust toward the company. The comments from libraries running Unicorn, the system which SirsiDynix selected as the basis for its flagship Symphony ILS, also ran strongly negativeâ€”some because of issues with the software some because of concerns with the company.

SirsiDynix recently announced that it will use an interesting search-and-retrieval system marketed by Brainware, a company located in Northern Virginia, not far from Dulles Airport.

In my forthcoming Beyond Search study, I am profiling the Brainware technology and paying particular attention to the firm’s approach to content processing. SirsiDynix conducted a thorough search for an access technology that would handle diverse content types and deliver fast throughput. The firm selected the Brainware technology to provide its customers with a more powerful information access tool.

Mr. Breeding’s report provides some evidence that SirsiDynix may want to address some customer satisfaction issues. Innovation or lack thereof, seems to be on the top of the list. SirsiDynix’s decision to partner with Brainware for search-and-retrival should go a long way in addressing their customer’s concerns in this important area. This decision is also a testament to the strength of the Brainware solution. Accordingly, Brainware warrants close consideration when intelligent content processing is required.

Most library automation vendors integrate technology in order to deliver a comprehensive solution. The vendors providing these technologies on an OEM or original equipment manufacturing basis are not able to influence in a significant way how their licensees deploy the licensed technology.

In my take on the data in Mr. Breeding’s article, the challenges SirsiDynix faces are not those of Brainware, a company enjoying 50 percent growth in 2007. In Beyond Search, I’m rating Brainware as a “Warrants a Close Look”. I respect the findings in the survey data reported by Mr. Breeding. But let me be clear: don’t mix up SirsiDynix’s business challenges with the Brainware technology. These are separate matters. SirsiDynix, like many library automation companies, face wide set of challenges, and extraordinary demands from library customers. Brainware provides advanced content processing solutions that should address some of those demands.

Stephen E. Arnold, January 15, 2009

Written by Stephen E. Arnold · Filed Under Library automation | 5 Comments

Search Turbocharging: A Boost for Search Company Valuations?

January 13, 2008

PCWorld’s January 12, 2008, story “Micrsoft’s FAST Bid Signals a Shift in Search.” The story is important because it puts “behind the firewall” search in the high beams.

A Bit of History

Fast Search & Tansfer got out of the online Web search and advertising business in early 2003. CNet covered the story thoroughly. Shortly after the deal either John Lervik or Bjorn Laukli, both Fast Search senior executives, told me, “Fast Search will become the principal provider of enterprise search.” In 2003, there was little reason to doubt this assertion. Fast Search was making progress with lucrative U.S. government contracts via its partner AT&T. Google’s behind-the-firewall search efforts were modest. Autonomy and Endeca each had specific functionality that generlly allowed the companies to compete in a gentlemanly way, often selling the same Forture 1000 company their search systems. Autonomy was automatic and able to process large volumes of unstructured content; Endeca at that time was more adept at handling structured information and work flow applications. Fast Search was betting that it could attack the enterprise market and win big.

Now slightly more than four years later, the bold bet on the enterprise market has created an interesting story. The decision to get out of Web search and advertising may prove to be one of the most interesting decisions in search and retrieval. Most of the coverage of the Microsoft offer to buy Fast Search focuses on the here and now, not the history. Fast Search suffered some financial set backs in 2006 and 2007, but the real setback from my point of view is in the broader enterprise market.

Some Rough Numbers for Context

Specifically, after four years of playing out its enterprise strategy, Fast Search has fallen behind Autonomy. That company’s revenues are likely to be about 30 percent higher than Fast Search’s on an annualized basis, roughly $300 million to $200 million over the last 12 months. (I’m rounding gross revenues for illustrative purposes.) Endeca is likely to hit the $90 to $100 million target in 2008, so these three companies generate collectively gross revenues of about $600 million. Now here’s the kicker. Google’s often maligned Google Search Appliance has more than 8,000 licensees. I estimate that the gross revenue from the GSA is about $350 million per year. Even if I am off in my estimates (Google refuses to answer my questions or acknowledge my existence), my research suggests that as of December 31, 2007, Google was the largest vendor of “behind the firewall” search. This estimate excludes the bundled search in the 65 million SharePoint installations and the inclusion of search in other enterprise applications.

One more data point, and again I am going to round off the numbers to make a larger point. Google’s GSA revenue is a fraction of Google’s projected $14 billion gross revenue in calendar 2007. Recall that at the time Fast Search got out of Web search and advertising, Google was generating somewhere in the $50 to $100 million per year and Fast Search was reporting revenue of about $40 million. Since 2003, Google has caught up with Fast Search and bypassed it in revenue generated revenue from the enterprise search market sector.

The Fast Search bet bought the high octane performance Microsoft bid. However, revenue issues, employee rationalization, and eroding days sales outstanding figures suggest that the Fast Search vehicle has some mechanical problems. Perhaps technology is the issue? Maybe management lacked the MBA skills to keep the pit crew working at its peak? Could the market itself changed in a fundamental way, looking for a something that was simpler and required less tinkering? I certainly don’t know.

What’s Important in a Search Acquisition?

Now back to the PCWorld story by IDG’s Chris Kanaracus. We learn that Microsoft got a deal at $1.2 billion and solid technology. Furthermore, various pundits and industry executives focus on the “importance” of search. One type of “importance” is financial because $1.2 billion for a company with $200 million in revenue translates to six times annual revenue. Another type of importance is environmental because the underperforming “behind the firewall” search sector got some much-needed publicity.

What we learn from this article is that “behind the firewall” search is still a highly uncertain. There’s nothing in the Micrsoft offer that clarifies the specifics of Micrsoft’s use of the Fast Search technology. The larger market remains equally murky. Search is not one thing. Search is key word indexing, text mining, classifying, and metatagging. Each of these components is complicated and tricky to set up and maintain. Furthermore, the vendors in the “behind the firewall” space can change their positioning as easily as a n F-1 team switches the decals on its race car.

Another factor is that no one outside of Google knows what Google, arguably the largest vendor of “behind the firewall” search will or will not do. Ignoring Google in the enterprise market is easy and convenient. A large number of “behind the firewall” search systems skirt Google or dismiss the company’s technology by commenting about it in an unflattering manner.

I think it’s a mistake. Before the pundits and the vendors start calculating their big paydays from Microsoft’s interest in Fast Search & Technology, Google cannot be ignored; otherwise, the dip in Microsoft shares cited in the PCWorld article might like a flashing engine warning light. Shifting into high gear is useless if the engine blows up.
Stephen E. Arnold
January 14, 2008

Written by Stephen E. Arnold · Filed Under Search | 1 Comment

Computerworld’s Take on Enterprise Search

January 12, 2008

Several years ago I received a call. I’m not at liberty to reveal the names of the two callers, but I can say that both callers were employed by the owner of Computerworld, a highly-regarded trade publication. Unlike its weaker sister, InfoWorld, Computerworld remains both a print and online publication. The subject of the call was “enterprise search” or what I now prefer to label “behind-the-firewall search.”

The callers wanted my opinion about a particular vendor of search systems. I provided a few observations and said, “This particular company’s system may not be the optimal choice for your organization.” I was told, “Thanks. Goodbye” IDG promptly licensed the system against which I cautioned. In December 2007 at the international online meeting in London, England, an aquaintance of mine who works at another IDG company complained about the IDG “enterprise search” system. When I found myself this morning (January 12, 2008) mentioned in an article authored by a professional working at an IDG unit, I invested a few moments with the article, an “FAQ” organized as questions and answers.

In general, the FAQ snugly fitted what I believe are Computerworld’s criteria for excellence. But a few of the comments in the FAQ nibbled at me. I had to work on my new study Beyond Search: What to Do When Your Search System Doesn’t Work, and I had this FAQ chewing at my attention. A Web can be a useful way to test certain ideas before “official” publication. Even more interesting is that I know that IDG’s incumbent search system, ah, disappoints some users. Now, before the playoff games begin I have an IDG professional cutting to the heart of search and content processing. The article “FAQ: Why Is Enterprise Search Harder Than Google Web Search?” references me. The author appears to be Eric Lai, and I don’t know him, nor do I have any interaction with Computerworld or its immedite parent, IDC, or the International Data Group, the conglomerate assembled by Patrick McGovern (blue suit, red tie, all the time, anywhere, regardless of the occasion).

On the article’s three Web pages (pages I want to add that are chock full of sidebars, advertisements, and complex choices such as Recommendations and White Papers) Mr. Lai’s Socratic dialog unfurls. The subtitle is good too: “Where Format Complications Meet Inflated User Expectations”. I cannot do justice to the writing of a trained, IDC-vetted journalist backed by the crack IDG editorial resources, of course. I’m a lousy writer, backed by my boxer dog Tyson and a moonshine-swilling neighbor next hollow down in Harrods Creek, Kentucky.

Let me hit the key points of the FAQ’s Socratic approach to the thorny issues of “enterprise search”, which is remember “behind-the-firewall search” or Intranet search. After thumbnailing each of Mr. Lai’s points, I will offer comments. I invite feedback from IDC. IDG, or anyone who has blundered into my Beyond Search Web log.

Point 1: Function of Enterprise Search

Mr. Lai’s view is that enterprise search makes information “stored in their [users’] corporate network available. Structured and unstructured data must be manipulated, and Mr. Lai on the authority of Dr. Yves Schabes, Harvard professor and Teragram founder, reports that a dedicated search system executes queries more rapidly “though it can’t manipulate or numerically analyze the data.”

Beyond Search wants to add that Teragram is an interesting content processing system. In Mr. Lai’s discussion of this first FAQ point, he has created a fruit salad mixed in with his ones and zeros. The phrase “enterprise search” is used as a shorthand way to refer to the information on an organization’s computers. Although a minor point, there is no “enterprise” in “enterprise search” because indexing behind-the-firewall information means deciding what not to index or at least, what content is available to whom under what circumstances. One of the gotchas in behind-the-firewall search, therefore, is making sure that the system doesn’t find and make available personal information, health and salary information, certain sensitive information such as what division is up for sale, and the like. A second comment I want to make is that Teragram is what I classify as a “content processing system provider”. Teragram’s technology, which has been used at the New York Times and America Online can be an enhancement to other vendors’ technology. Finally, the “war of words” that rages between various vendors about performance of database systems is quite interesting. My view is that behind-the-firewall search and the new systems on offer from Teragram and others in the content processing sector are responding to a larger data management problem. Content processing is a first step toward breaking free of the limitations of the Codd database. We’re at an inflection point and the swizzling of technologies presages a far larger change coming. Think dataspaces, not databases, for example. I discuss dataspaces in my new study out in April 2008, and I hope my discussion will put the mÃ©lange of ideas in Mr. Lai’s first Socratic question in a different context. The change from databases to dataspaces is more than a two consonants.

Point 2: Google as the Model for Learning Search

Mr. Lai’s view is that a user of Google won’t necessarily be able to “easily learn” [sic] “enterprise search” system.

I generally agree with the sentiment of the statement. In Beyond Search I take this idea and expand it to about 250 pages of information, including profiles of 24 companies offering a spectrum of systems, interfaces, and approaches to information access. Most of the vendors’ systems that I profile offer interfaces that allow the user to point-and-click their way to needed information. Some of the systems absolve the user of having to search for anything because work flow tools and stored queries operated in the background. Just-in-time information delivery makes the modern systems easier to use because the hapless employee doesn’t have to play the “search box guessing game.” Mr. Lai, I believe, finds query formulation undaunting. My research reveals the opposite. Formulating a query is difficult for many users of enterprise information access systems. When a deadline looms, employees are uncomfortable trying to guess the key word combination that unlocks the secret to the needed information.

Point 3: Hard Information Types

I think Mr. Lai reveals more about his understanding of search in this FAQ segment. Citing our intrepid Luxembourgian, Dr. Schabes, we learn about eDiscovery, rich media, and the challenge of duplicate documents routinely spat out by content management systems.

The problem is the large amounts of unstructured data in an organization. Let’s reign in this line of argument. There are multiple challenges in behind-the-firewall search. What makes information “hard” (I interpret the word “hard” as meaning “complex”) involves several little-understood factors colliding in interesting ways. [a] In an organization there may be many versions of documents, many copies of various versions, and different forms of those documents; for example, a sales person may have the Word version of a contract on his departmental server, but there may be an Adobe Portable Document Format version attached to the email telling the client to sign it and fax the PDF back. You may have had to sift through these variants in your own work. [b] There are files types that are in wide use. Many of these may be renegades; that is, the organization’s over-worked technical staff may be able to deal with some of them. Other file types such as iPod files, digital videos of a sales pitch captured on a PR person’s digital video recorder, or someone’s version of a document exported using Word 2007’s XML format are troublesome. Systems that process content for search and retrieval have filters to handle most common file types. The odd ducks require some special care and feeding. Translation: coding filters, manual work, and figuring out what to do with the file types for easy access. [c] Results in the form of a laundry list are useful for some types of queries but not for others. The more types of content processed by the system, the less likely a laundry list will be useful. Not urprisingly, advanced content processing systems produce reports, graphic displays, suggestions, and interactive maps. When videos and audio programs are added to the mix, the system must be able to render that information. Most organizations’ networks are not set up to shove 200 megabyte video files to and fro with abandon or alacrity. You can imagine the research, planning, and thought that must go into figuring out what to do with these types of digital content. None is “hard”. What’s difficult is the problem solving needed to make these data and information useful to an employee so work gets done quickly and in an informed manner. Not surprisingly, Mr. Lai’s Socratic approach leaves a few nuances in the tiny spaces of the recitation of what he thinks he heard Mr. Schabes suggest. Note that I know Mr. Schabes, and he’s an expert on rule-based content processing and Teragram’s original rule nesting technique, a professor at Harvard, and a respected computer scientist. So “hard” may not be Teragram’s preferred word. It’s not mine.

Point 4: Enterprise Search Is No More Difficult than Web Search

Mr. Lai’s question burrows to the root of much consternation in search and retrieval. “Enterprise search” is difficult.

My view is that any type of search ranks as one of the hardest problems in computer science. There are different types of problems with each variety of search–Web, behind-the-firewall, video, question answering, discovery, etc. The reason is that information itself is a very, very complicated aspect of human behavior. Dissatisfaction with “behind-the-firewall” search is due to many factors. Some are technical. In my work, when I see yellow sticky notes on monitors or observe piles of paper next to a desk, I know there’s an information access problem. These signs signal the system doesn’t “work”. For some employees, the system is too slow. For others, the system is too complex. A new hire may not know how to finagle the system to output what’s needed. Another employee may be too frazzled to be able to remember what to do due to a larger problem which needs immediate attention. Web content is no walk in the park either. But free Web indexing systems have a quick fix for problem content. Google, Microsoft, and Yahoo can ignore the problem content. With billions of pages in the index, missing a couple hundred million with each indexing pass is irrelevant. In an organization, nothing angers a system user quicker than knowing a document has been processed or should have been processed by the search system. When the document cannot be located, the employee either performs a manual search (expensive, slow, and stress inducing) or goes ballistic (cheap, fast, and stress releasing). In either scenario or one in the middle, resentment builds toward the information access system, the IT department, the hapless colleague at the next desk, or maybe the person’s dog at home. To reiterate an earlier point. Search, regardless of type, is extremely challenging. Within each type of search, specific combinations of complexities exist. A different mix of complexities becomes evident within each search implementation. Few have internalized these fundamental truths about finding information via software. Humans often prefer to ask another human for information. I know I do. I have more information access tools than a nerd should possess. Each has its benefits. Each has its limitations. The trick is knowing what tool is needed for a specific information job. Once that is accomplished, one must know how to deal with the security, format, freshness, and other complications of information.

Point 5: Classification and Social Functions

Mr. Lai, like most search users and observers, have noses that twitch when a “new” solution appears. Automatic classification of documents and support of social content are two of the zippiest content trends today.

Software that can suck in a Word file and automatically determine that the content is “about” the Smith contract, belongs to someone in accounting, and uses the correct flavor of warranty terminology is useful. It’s also like watching Star Trek and hoping your BlackBerry Pearl works like Captain Kirk’s communicator. Today’s systems, including Teragram’s, can index at 75 to 85 percent accuracy in most cases. This percentage can be improved with tuning. When properly set up, modern content processing systems can hit 90 percent. Human indexers, if they are really good, hit in the 85 to 95 percent range. Keep in mind that humans sometimes learn intuitively how to take short cuts. Software learns via fancy algorithms and doesn’t take short cuts. Both humans and machine processing, therefore, have their particular strengths and weaknesses. The best performing systems with which I am familiar rely on humans at certain points in system set up, configuration, and maintenance. Without the proper use of expensive and scarce human wizards, modern systems can veer into the ditch. The phrase “a manager will look at things differently than a salesperson” is spot on. The trick is to recognize this perceptual variance and accommodate it insofar as possible. A failure to deal with the intensely personal nature of some types of search issues is apparent when you visit a company where there are multiple search systems or a company where there’s one system–such as the the one in use at IDC–and discover that it does not work too well. (I am tempted to name the vendor, but my desire to avoid a phone call from hostile 20-year-olds is very intense today. I want to watch some of the playoff games on my couch potato television.)

Point 6: Fast’s Search Better than Google’s Search

Mr. Lai raises the question that is similar to America’s fascination with identifying the winner in any situation.

We’re back to a life-or-death, winner-take-all knife fight between Google and Microsoft. No search technology is necessarily better or worse than another. There are very few approaches that are radically different under the hood. Even the highly innovative approaches of companies such as Brainware and its “associative memory” approach or Exegy with its juiced up hardware and terabytes of on board RAM appliance share some fundamentals with other vendors’ systems. If you slogged through my jejune and hopelessly inadequate monographs, The Google Legacy (Infonortics, 2005) and Google Version 2.0 (Infonortics, 2007), and the three editions I wrote of The Enterprise Search Report (CMSWatch.com, 2004, 2005, 2006) you will know that subtle technical distinctions have major search system implications. Search is one of these areas with a minor tweak can yield two quite distinctive systems even though both share similar algorithms. A good example is the difference between Autonomy and Recommind. Both use Bayesian mathematics, but the differences are significant. Which is better? The answer is, “It depends.” For some situations, Autonomy is very solid. For others, Recommind is the system of choice. The same may be said of Coveo, Exalead, ISYS Search Software, Siderean, or Vivisimo, among others. Microsoft will have some work to do to understand what it has purchased. Once that learning is completed, Microsoft will have to make some decisions about how to implement those features into its various products. Google, on the other hand, has a track record of making the behind-the-firewall search in its Google Search Appliance better with each point upgrade. The company has made the GSA better and rolled out the useful OneBox API to make integration and function tweaking easier. The problem with trying to get Google and Microsoft to square off is that each company is playing its own game. Socratic Computerworld professionals want both companies to play one game, on a fight-to-the-death basis, now. My reading of the data I have is that a Thermopylae is not now or in the near future in the interests of either Google of Microsoft to clash too much. The companies have different agendas, different business models, and different top-of-mind problems to resolve. The future of search is that it will be invisible when it works. I don’t think that technology is available from either Google or Microsoft at this time.

Point 7: Consolidation

Mr. Lai wants to rev the uncertainty engine, I think. We learn from the FAQ that search is still a small, largely unknown market sector. We learn that big companies may buy smaller companies.

My view is that consolidation is a feature of our market economy. Mergers and acquisitions are part of the blood and bones of business, not a characteristic of the present search or content processing sector. The key point that is not addressed is the difficulty of generating a sustainable business selling a fuzzy solution to a tough problem. Philosophers have been trying to figure out information for a long time and have done a pretty miserable job as far as I can tell. Software that ventures into information is going to face some challenges. There’s user satisfaction, return on investment, appropriate performance, and the other factors referenced in this essay. The forces that will ripple through behind-the-firewall search are:

Business failure. There are too many vendors and too few buyers willing to pay enough to keep the more than 350 companies’ sustainable
Mergers. A company with customers and so-so technology is probably more valuable than a company with great technology and few customers. I have read that Microsoft was buying customers, not Fast Search & Transfer’s technology. Maybe? Maybe not.
Divestitures and spin outs. Keep in mind that Inxight Software, an early leader in content processing, was pushed out of Xerox’s Palo Alto Research Center. The fact that it was reported as an acquisition by Business Objects emphasized the end game. The start was, “Okay, it’s time to leave the nest.”

The other factor is not consolidation; it is absorption. Information is too important to leave in a stand-alone application. That’s why Microsoft’s Mr. Raikes seems eager to point out that Fast Search would become part of SharePoint.

Net-Net

The future, therefore, is that there will be less and less enthusiasm for expensive, stand-alone “behind-the-firewall” search. Information access is part of larger, higher-value information access solutions.

Stephen E. Arnold
January 13, 2008

Written by Stephen E. Arnold · Filed Under Search | 2 Comments

A Turning Point in Search? Is the Microsoft-FAST Deal Significant?

January 11, 2008

The interest in the Microsoft purchase of Fast Search & Transfer is remarkable. What caught my attention was the macho nature of some of the news stories in my Yahoo! Alert this morning. I found these revelatory:

First, The January 9, 2008, story, “Microsoft Goes for Google Jugular with Search Buy“. The story is good, and it presents what I call an MBA analysis of the deal, albeit in about 900 words. From my point of view, the key point is in the last paragraph, which quotes Fast Search & Transfer’s John Lervik as saying: “We have simply not focused on (operational execution),” Lervik told ComputerWire last year. Had it focused a little more on execution, it might have gone on to become a gorilla in enterprise search. Instead, it has succumbed to acquisition by a company playing catch-up in the space.”

Second, the January 8, 2008, Information Week story by Paul McDougall. The title of this story is “Microsoft’s Fast Search Bid Puts Heat on Google, IBM”. The notion of “heat” is interesting, particularly in the behind-the-firewall market. The key point for me in this analysis is: “Microsoft plans to marry Fast’s enterprise technology with its SharePoint software — a suite of server and desktop products that give workers an interface through which they can retrieve information and collaborate with colleagues. It’s a combination that Google and IBM will have to match — and analysts say that’s sure to put Autonomy and Endeca in play.” The “heat”, I believe, refers to increased intensity among the identified vendors; that is, blue-chip companies such as Autonomy, Endeca, and IBM.

The third — and I want to keep this list manageable — is Bill Snyder’s January 8, 2008, story in InfoWorld (a publication that has been trying to make a transition from print money pit to online Web log revenue engine). This story sports the headline: “Microsoft Tries an End Run around Google“. The most interesting point in this analysis for me was this sentence: “Despite all of Microsoft’s efforts, the most recent tally of search market share by Hitwise, shows Google gaining share at Microsoft’s expense. Here are the numbers. Google’s share in December was 65.98%, up from 65.10% the previous month; while Microsoft’s share (including MSN search and Live.com) dropped to 7.04% from 7.09%.” The Web search share has zero to do with enterprise search share, but that seems to be of little significance. Indifference to the distinctions within the generalization about search is a characteristic of many discussions about information access. Search is search is search. Alas, there are many flavors of search. Fuzziness does little to advance the reader’s understanding of what’s really going on in the “”behind the firewall” search sector.

The Range of Opinion

These stories provide some texture to the $1.2 billion offer Microsoft made for Fast Search & Transfer. The “jugular” story underscores the growing awareness among some pundits and journalists that Microsoft has to “kill” Google. Acquisitions of complex technology with some management challenges added to spice up the marriage require some internal housekeeping. Once the union is complete, the couple can figure out how to cut another company’s jugular. Unfortunately, the headline tells us more about a perception that Microsoft has not been able to respond to Google. So, a big play is positioned as the single action that will change the “battle” between Google and Microsoft. I believe that there is no battle. Google is Googling along with its “controlled chaos” approach to many markets, not just enterprise search. In fact, Google seems almost indifferent to the so-called show downs in which it is engaged. Google went on holiday when the FCC bids were due, and the company seems somewhat unfocused with regard to its challenges to Microsoft if I read what its executives say correctly. The choice of the word “jugular”, therefore, suggests an attack. My view is that Microsoft wanted customers, a presence, engineers, and technology. If Microsoft goes for Google’s jugular, it will take some time before the knife hits Googzilla.

The second story reveals that consolidation is underway. I can’t disagree with this implication. Endeca’s initial public offering, once scheduled for 2007, failed to materialize. My sources tell me that Endeca continues to discuss partnerships, mergers, and other relationships to leverage the company “guided navigation” technology. You can see it in action on the U.K. newspaper The Guardian‘s Web site. My tally of companies in the search business now has more than 350 entries. I try to keep the list up to date, but with companies going out of business (Arikus, Entopia, WiseNut, and to name three I recall without consulting my online files), entering the business (ZoomItIn), repositioning (Brainware), hiding in perpetual stealth mode (PowerSet), and pretending that marketing is not necessary (Linguamatics), and other swirls and eddies. But, the metaphor “heat” is acceptable. My view is that the search sector is a very complex, blurry, and evolving Petri dish of technology in the much larger “information access space”. Investors will react to the “heat” metaphor because it connotes a big payday when a deal like Microsoft’s takes place.

The third story is interesting for two reasons. First, the author has the idea that Microsoft does not want to cut through Google. I thought of a Spartan phalanx marching relentlessly toward a group of debating Athenians. Mr. Snyder or his editor uses a football analogy. The notion of Microsoft using the tactic of skirting a battle and attacking on the flank or the rear is intriguing. The issue I have is that if Google is indeed in “controlled chaos” mode, it can field disorganized, scattered groups of war fighters. As a result, there may be no battle unit to outflank or get behind. If the Googlers are consistent “chaotic”, Google’s troopsmay be playing one game; Microsoft, another. Second, the data quoted pertain to Web search, not Intranet search. Apples and oranges I submit.

So what?

Six Key Moments in Search and Retrieval

The point of my looking at these three excellent examples of deal analysis is to lead up to the list below. These events are what I consider key turning points in “behind the firewall” search. The Microsoft – Fast deal, while significant as the first and, therefore, the biggest search deal of 2008, has yet to demonstrate that it will rank with these events:

In-Q-Tel. Set up in 1998, this is the venture arm of the Central Intelligence Agency. With Federal funding and the best entrepreneurs in the U.S. government, In-Q-Tel seeded companies with promising information access and “beyond search” technology. Without In-Q-Tel and the support from other U.S. entities such as DARPA (Defense Advanced Research Projects Agency), the search market as we know would not exist today. Even Google snagged some Federal money for its PageRank project.
Personal Library Software. Matt Koll doesn’t get the same attention accorded to today’s wizards. His 1993 vision and product of Intranet and desktop search were revolutionary products. Mr. Koll delivered desktop search and actively discussed extending the initial product to the “behind the firewall” repositories that are so plentiful today.
The Thunderstone Appliance. EPI Thunderstone has been a developer of search tools such as its bullet-proof stemmer for more than a decade. The company introduced its search appliance in 2003. The company correctly anticipated the need some organizations had for a plug-and-play solution. Google was quick to roll out its own appliance not long after the Thunderstone product became available. The appliances available from Index Engines, Exegy, and Planet Technologies, among others owe a tip of the hat to the Thunderstone engineers.
IBM STAIRS and BRS. Most readers will not be familiar with IBM STAIRS, circa 1971. I do include some history in the first three editions of The Enterprise Search Report which I wrote between 2003 and 2006, and I won’t repeat that information here. But text retrieval received the IBM Big Blue seal of approval with this product. When development flagged at IBM, both in the U.S. and Germany, the Bibliographic Retrieval Service picked up the mantle. You can buy this system from Open Text even today.
In 2002, Yahoo bought Inktomi. At the time, Inktomi was a solid choice for Web search. The company was a pioneer in high-speed indexing and it was one of the first companies to find ways to make use of random access memory to speed query processing. The purchase of Inktomi, however, marked the moment when Google seized control of Web search and, shortly thereafter, Web advertising. With the cash from its dominance of the Web, Google continues its spectacular, unencumbered rise. Had Yahoo done a better job of setting priorities, Google may not have had the easy path it found. Yahoo, as you may recall, made search subordinate to what I call “the America Online” approach to information access. Others talk about the Yahoo! portal or the Yahoo! start page. This deal — not Autonomy’s buying Verity or Microsoft’s purchase of Fast Search & Retrieval — is the key buy out. Yahoo dropped Alta Vista http://www.altavista.com for Inktomi. The Inktomi deal sealed the fate of Yahoo! and the rest is Wall Street earnings history. Google has some amazing brainpower, honed at Digital – Compaq – HP. Jeffrey Dean and Sanjay Ghemawat are, I believe, graduates of the Alta Vista “Gurus’ Academy of Search.” Exalead‘s Francois Bourdoncle did a stint with Louis Monier when both were at Alta Vista. Mr. Monier is now at Google.
Hewlett Packard Ignores Alta Vista. The media have defined search as Google. Wrong. Search is Google today, but it was Alta Vista. If Hewlett Packard had been able to understand what it acquired when it tallied the assets of Compaq Computer, it would have “owned” search. Mismanagement and ignorance allowed two key things to take place. First, the Alta Vista brain trust was able to migrate to other companies. For example, eBay, Exalead, and Google have been direct beneficiaries of the years and millions of Digital Equipment research dollars. HP let that asset get away. Even today, I don’t think HP knew how significant its failure to capitalize on Alta Vista was and continues to be. Second, by marginalizing Alta Vista, the promising early desk top search product pioneered a market and then abandoned it. HP is a mass market PC vendor and printer ink company. It could have been Google. In 2003, HP sold what remained of Alta Vista to Overture, later acquired by Yahoo. The irony of this decision is that Yahoo had an opportunity to challenge Google, possibly on patent infringement allegations. Yahoo! did not. Google now controls more than 60 percent of the Web search market and is “the elephant in the room” in “behind the firewall” search. Yahoo’s link up with IBM may not be enough to keep the company in the “behind the firewall” search sector.

Observations

Today’s market has not been a smooth or steady progression. Most of the systems available today use mathematics centuries old. Most systems share more similarities than differences, despite the protestations of marketing professionals at each company. Most systems disappoint users, creating the unusual situation in large organizations where employees have to learn multiple ways to locate the information needed to do work.

The acquisition of Fast Search & Transfer is a financially significant event. It has yet to stand the test of time. Specifically, these questions must be answered by the marketplace:

Will companies buy SharePoint with technology developed elsewhere, built from multiple acquisitions’ engineering, and open source code?
Can Microsoft retain the Fast Search customer base?
Will customers working through some of the Fast ESP (enterprise search platform) complexities have the patience to maintain their licenses in the face of uncertainty about ESP’s future?
Will Microsoft engineers and Fast Search engineers mesh together, successfully integrating Microsoft’s product manager approach with 10,000 sailboats generally heading in one direction with Fast Search’s blend of rigid Nordic engineering with a cadre of learning-while-you-work technologists?

My view is that I don’t know enough to answer these questions. I see an upside and a downside to the deal. I do invite readers to identify the most important turning points in search and offer their ideas about the impact of the Microsoft-Fast Search tie up.

Stephen E. Arnold
11 January 2008, 3 21 pm Eastern

Written by Stephen E. Arnold · Filed Under Enterprise, Microsoft, Search | 2 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search