The Content Acquisition Hot Spots
February 21, 2008
I want to take a closer look at behind-the-firewall search system bottle necks. This essay talks about the content acquisition hot spot. I want to provide some information, but I will not go into the detail that appears in Beyond Search.
Content acquisition is a core function of a search system. “Classic” search systems are designed to pull content from a server where the document resides to the storage device the spider uses to hold the new or changed content. Please, keep in mind that you will make a copy of a source document, move it over the Intranet to the spider, and store that content object on the storage device for new or changed content. The terms crawling or spidering have been used since 1993 to describe the processes for:
- Finding new or changed information on a server or in a folder
- Copying that information back to the search system or the crawler sub system
- Writing information about the crawlers operation to the crawler log file.
On the surface, crawling seems simple. It’s not. Crawlers or spiders require configuration. Most vendors provide a browser-based administrative “tool” that makes is relatively easy to configure the most common settings. For example, you will want to specify how often the content acquisition sub system checks for new or changed content. You also have to “tell” the crawling sub system what servers, computers, directories, and files to acquire. In fact, the crawling sub system has a wide range of settings. Many systems allow you to provide create “rules” or special scripts to handle certain types of content; for example, you can set a specific schedule for spidering for certain servers or folders.
In the last three or four years, more search systems have made it easier for the content acquisition system to receive “pushed” content. The way “push” works is that you write a script or use the snippet of code provided by the search vendor to take certain content and copy it to a specific location on the storage device where the spider’s content resides. I can’t cover the specifics of each vendor’s “push” options, but you will find the details in the Help files, API documentation, or the FAQs for your search system.
Pull
Pull works pretty well when you have a modest amount of new or changed content every time slice. You determine the time interval between spider runs. You can make the spider aggressive and launch the sub system every 60 seconds. You can relax the schedule and check for changed content every seven days. In most organizations, running crawlers every minute can suck up available network bandwidth and exceed the capacity of the server or servers running the crawler sub system.
You now have an important insight into the reason the content acquisition sub system can become a hot spot. You can run out of machine resources, so you will have to make the crawler less aggressive. Alternatively, you can saturate the network and the crawler sub system by bringing back more content than your infrastructure can handle. Some search systems bring back content that exceeds available storage space. Your choices are stark — limit the number of servers and folders the crawling sub system indexes.
When you operate a behind-the-firewall search system, you don’t have the luxury a public Web indexing engine has. These systems can easily skip a server that times out or not revisit a server until the next spidering cycle. In an organization, you have to know what much be indexed immediately or as close to immediately as you can get. You have to acquire content from servers that may time out.
The easy fixes for crawler sub system problems are likely to create some problems for users. Users don’t understand why a document may not be findable in the search system. The reason may be that the crawler subsystem was not able to get the document back to the search system for many different reasons. Believe me, users don’t care.
The key to avoiding problems with traditional spidering boils down to knowing how much new and changed content your crawler sub system must handle at peak loads. You also must know the rate of growth for new and changed content. You need the first piece of information to specify the hardware, bandwidth, storage, and RAM you need for the server or servers handling content acquisition. The second data point gives you the information you need to upgrade your content acquisition system. You have to keep the content acquisition system sufficiently robust to handle the ever-larger amount of information generated in organizations today.
The cause of a hot spot in content acquisition is due to:
- Insufficient resources
- Failure to balance crawler aggressiveness with machine resources
- Improper handling of high-latency response from certain systems whose content must be brought back to the search storage sub system for indexing.
The best fix is to do the up front work accurately and thoroughly. To prevent problems from happening, a proactive upgrade path must be designed and implemented. Routine maintenance and tuning must be routine operations, not “we’ll do it later” procedures.
Push
Push is another way to reduce the need for the content acquisition sub system to “hit” the network at inopportune times. The idea is simple, and it is designed to operate in a way directly opposite from the “pull” service that gave content “pull” a bad reputation. PointCast “pulled” content indiscriminately, causing network congestion.
The type of “pull” I am discussing is a fall out of the document inventory conducted before you deploy the first spider. You want to identify those content objects that can be copied from their host location to the content acquisition storage sub system using a crontab file or a script that triggers the transfer when [a] the new or changed data are available and [b] at off-peak times.
The idea is to keep the spiders from identifying certain content objects and then moving those files from their host location to the crawler storage device at inopportune moments.
In order to make “push” work, you need to know which content is a candidate for routine movement. You have to set up the content acquisition system to receive “pushed” content, which is usually handled via the graphical administrative interface. You need to create the script or customize the vendor-provided function to “wake up” when new of changed content arrives in a specific folder on the machine hosting the content. Then the script consults the rules for starting the “push”. The transfer occurs and the script should verify in some way that the “pushed” file was received without errors.
Many vendors of behind-the-firewall search systems support “push”. If your system does not, you can use the API to create this feature. While not trivial, a custom “push” function is a better solution than trying to get a crashed content acquisition sub system back online. You run the risk of having to reacquire the content, which can trigger another crash or saturate the network bandwidth despite your best efforts to prevent another failure.
Why You Want to Use Both Push and Pull
The optimal content acquisition sub system will use both “push” and “pull” techniques. Push can be very effective for high-priority content that must be indexed without waiting for the crawler to run a CRC, time stamp, or file size check on content.
The only way to make the most efficient use of your available resources is to designate certain content as “pull” and other content as “push”. You cannot guess. You must have accurate baseline data and update those data by consulting the crawler logs.
You will want to develop schedules for obtaining new and changed content via “push” and “pull”. You may want to take a look at the essay on this Web log about “hit boosting”, a variation on “push” content with some added zip to ensure that certain information appears in the context you want it to show up.
Where Are the Hot Spots?
If you have a single server and your content acquisition function chokes, you know the main hot spot — available hardware. You should place the crawler sub system on a separate server or servers.
The second hot spot may be the network bandwidth or lack of it when you are running the crawlers and pushing data to the content acquisition sub system. If you run out of bandwidth, you face some specific choices. No choice is completely good or bad. The choices are shades of gray; that is, you must make trade offs. I will hightly three, and you can work through the others yourself.
First, you can acquire less content less frequently. This reduces network saturation, but it increases the likelihood that users will not find the needed information. How can they? The information has not yet been brought to the search system for document processing.
Second, you can shift to “push”, de emphasizing “pull” or traditional crawling. The upside is that you can control how much content you move and when. The downside is that you may inadvertently saturate the network when you are “pushing”. Also, you will have to do the research to know what to “push” and then you have to set up or code, configure, test, debug, and deploy the system. If people have to move the content to the folder the “push” script uses, you will need to do some “human engineering”. It’s better to automate the “push” function in so far as possible.
Third, you have to set up a re-crawl schedule. Skipping servers may not be an option in your organization. Of course, if no one notices missing content, you can take your chances. I suggest knuckling down and doing the job correctly the first time. Unfortunately, short cuts and outright mistakes are very common in the content acquisition piece of the puzzle.
In short, hot spots can crop up in the crawler sub system. The causes may be human, configuration, infrastructure, or a combination of causes.
Is This Really a Big Deal?
Vendors will definitely tell you the content acquisition sub system is no big deal. You may be told, “Look we have optimized our crawler to avoid these problems” or “Look, we have made push a point-and-click option. Even my mom can set this up.”
Feel free to believe these assurances. Let me close with an anecdote. Judge for yourself about the importance of staying on top of the content acquisition sub system.
The setting is a large US government agency. The users of the system were sending search requests to an Intranet Web server. The server would ingest the request and output a list of results. No one noticed that the results were incomplete. An audit revealed that the content acquisition sub system was not correctly identifying changed content. The error caused more than four million reports to be incorrect. Remediation cost more than $10 million. Upon analyzing the problem, facts came to light that the crawler was incorrectly configured when the system was first installed, almost 18 months before the audit. In addition to the money lost, certain staff were laterally arabesqued. Few in Federal employ get fired.
Pretty exciting for a high-profile vendor, a major US agency, and the “professionals” who created this massive problem.
Now, how important is your search system’s content acquisition sub system to you?
Beyond Search: More Information
February 21, 2008
I posted a page here that provides more information about Beyond Search: What to Do When Your Search System Won’t Work. The publisher, The Gilbane Group, has a form here which you can use to receive more specifics about the 250-page study. You can also write beyondsearch at gilbane.com.
If you find the information on this Web log useful, you may want to think about getting a copy of the study. The information I am publishing on this Web log is useful, but it wasn’t directly on target for the Beyond Search study.
Stephen Arnold, February 21, 2008
Search System Bottlenecks
February 21, 2008
In a conference call yesterday (February 19, 2008), one of the well-informed participants asked, “What’s with the performance slow downs in these behind-the-firewall search systems?”
I asked, “Is it a specific vendor’s system?”
The answer, “No, it seems like a more general problem. Have you heard anything about search slow downs on Intranet systems?”
I do hear quite a bit about behind-the-firewall search systems. People find my name on the Internet and ask me questions. Others get a referral to me. I listen to their question or comment and try to pass those with legitimate issues to someone who can help out. I’m not too keen on traveling to a big city, poking into the innards of a system, and trying to figure out what went off track. That’s a job for younger, less jaded folks.
But yesterday’s question got me thinking. I dug around in my files and discovered a dated, but still useful diagram of the major components of a behind-the-firewall search system. Here’s the diagram, which I know is difficult to read, but I want to call your attention to the seven principal components of the diagram and then talk briefly about hot spots. I will address each specific hot spot in a separate Web log post to keep the length manageable.
This essay, then, takes a broad look at the places I have learned to examine first when trying to address a system slow down. I will try to keep the technical jargon and level of detail at a reasonable level. My purpose is to provide you with an orientation to hot spots before you begin your remediation effort.
The Bird’s Eye View of a Typical Search System
Keep in mind that each vendor implements the search sub systems in a way appropriate for their engineering. In general, if you segment the sub systems, you will see a horizontal area in the middle of this diagram surrounded by four key subsystems, the content, and, of course, the user. The search system exists for the user, which many vendors and procurement teams happily overlook.
This diagram has been used in my talks at public events for more than five years. You may use this for your personal use or in educational activities without restrictions. If you want to use it in a publication, please, provide contact me for permission.
Let’s run through this diagram and then identify the hot spots. You see some arrows. These are designed to show the pipeline through which content, queries, and results flow. In several places, you see arrows pointing different directions in close proximity. It is obvious that in these interfaces, a glitch of any type will create a slowdown. Now let’s identify the main features.
In the upper left hand corner is a blue sphere that represents content. For our purpose, let’s just assume that the content resides behind the firewall, and it is the general collection of Word documents, email, and PowerPoints that make up much of an organization’s information. Pundits calculate that 80 percent of an organization’s information is unstructured. My research suggests that the ratio of structured to unstructured data varies sharply by type of organization. For now, let’s just deal with generalized “content”. In the upper right hand corner, you see the user. The user, like the content, can be generalized for our purposes. We will assume that the user navigates to a Web page, sees a search box, or a list of hot links, and enters a query in some way. I don’t want to de emphasize the user’s role in this system, but I want to set aside her needs, the hassle of designing an interface, and other user-centric considerations such as personalization.
Backbone or Framework
Now, let’s look at the horizontal area in the center of the diagram show below:
You can see that there are specific sub systems within this sub system labeled storage clusters. This is the first key notion to keep in mind when thinking about performance of a search system. The problem that manifests itself at an interface may be caused by a sub component in a sub system. Until there’s a problem, you may not have thought about your system as a series of nested boxes. What you want to keep in mind is that until you have a performance bottleneck is that the many complex moving parts were working pretty well. Don’t criticize your system vendor without appreciating how complicated a search system is. These puppies are far from trivial — including the free one you download to index documents on your Mac or PC.
In this rectangle are “spaces” — a single drive or clusters of servers — that hold content returned from the crawling sub system (described below), the outputs of the document processing sub system (described below), the index or indexes, the system that holds the “content” in some proprietary or other representation, and a component to house the “metrics” for the system. Please keep in mind that running analytics is a big job, and you will want to make sure that you have a way to store, process, and manipulate system logs. No system logs — well, then, you are pretty much lost in space when it comes to trouble shooting. One major Federal agency could not process its logs; therefore, usage data and system performance information did not exist. Not good. Not good at all.
The components in this sub system handle content acquisition, usually called crawling or spidering. I want to point out that the content acquisition sub system can be a separate server or cluster of servers. Also, keep in mind that keeping the content acquisition sub system on track requires that you fiddle with rules. Some systems like Google’s search appliance reduce this to a point-and-click exercise. Other systems require command line editing of configuration files. Rules may be check boxes or separate scripts / programs. Yes, you have to write these or pay someone to do the rule fiddling. When the volume of content grows, this sub system can choke. The result is not a slow down, but you may find that some users say, “I put the content in the folder for indexing, and I can’t find the document.” No, the user can’t. It may be snagged in an over burdened content acquisition sub system.
Document Processing / Document Transformation
Let me define what I mean by document processing. I am using this term to mean content normalization and transformation. In Beyond Search, I use the word transformation to stream line the text. In this sub system, I am not discussing indexing the content. I want to move a Word file from its native Word format to a form that can be easily ingested by the indexing sub system described in the next section of this essay.
This sub system pulls or accepts the information acquired by the spidering sub system. Each file is transformed into an representation that the indexed sub system (described below) can understand. Transformation is now a key part of many behind-the-firewall systems. The fruit cake of different document types are normalized; that is, made standard. If a document cannot be manipulated by the system, then that document cannot be indexed. An increasing number of document transformation sub systems store the outputs in an XML format. Some vendors include an XML data base or data management system with their search system. Others use a data base system and keep it buried in the “guts” of their system. This notion of transformation means that disc writes will occur. The use of a data base system “under the hood” may impose some performance penalties on the document processing sub system. Traditional data base management systems can be input – output bound. A bottle neck related to an “under the hood” third-party, proprietary, or open source data base can be difficult to speed up if resources like money for hardware are scarce.
Indexing
Most vendors spend significant time explaining the features and functions of their systems’ indexing. You will hear about semantic indexing, latent semantic indexing, linguistics, and statistical processes. There are very real differences between vendors’ systems. Keep in mind that any indexing sub system is a complicate beastie. Here’s a blow up from the generalized schematic above:
In this diagram, you see knowledge bases, statistical functions, “advanced processes” (linguistics / semantics), and a reference to an indexing infrastructure. Indexing performs much of the “heavy lifting” for a search system, and it is absolutely essential that the indexing sub system be properly resourced. This means bandwidth, CPU cycles, storage, and random access memory. If the indexing sub system cannot keep pace with the amount of information to be indexed and the number of queries passed against the indexes, a number of symptoms become evident to users and the system administrator. I will return to the problems of an overloaded indexing subsystem in a separate essay in a day or two. Note that I have included “manual tagging” in the list of fancy processes. The notion of a fully automatic system, in my experience, is a goal, not a reality. Most indexing systems require over sight by a subject matter expert or indexing specialist. Both statistical and linguistic systems can get “lost in space.” There are many reasons such as language drift, neologisms, and exogenous shifts. The only reliable way to get these indexing glitches resolved is to have a human make the changes to the rules, the knowledge bases, or the actual terms assigned to individual records. Few vendors like to discuss these expensive, yet essential, interventions. Little wonder that many licensees feel snookered when “surprises” related to the indexing sub system become evident and then continue to crop up like dandelion.
Query Processing
Query processing is a variant of indexing. Queries have to be passed against the indexes. In effect, a user’s query is “indexed”. The query is matched or passed against the index, and the results pulled out, formatted, and pushed to the user. I’m not going to talk about stored queries or what used to be called SDI (selective dissemination of information), saved searches, or filters. Let’s just talk about a key word query.
The query processing sub system consists of some pre – and post – processing functions. A heavily-used system requires a robust query processing “front end.” The more users sending queries at the same time, the more important it is to be able to process those queries and get results back in an acceptable time. My tests show that a user of a behind-the-firewall system will wait as much as 15 seconds before complaining. In my tests on systems in 2007, I found an average query response time in the 20 second range, which explains in large part why employees are dissatisfied with their incumbent search system. The dissatisfaction is a result of an inadequate infrastructure for the search system itself. Dissatisfaction, in fact, does not single out a specific vendor. The vendors are equally dissatisfying. The vendors, obviously, can make their systems run faster, but the licensee has the responsibility to provide a suitable infrastructure on which to run the search system. In short, the “dissatisfaction” is a result of poor response time. Only licensees can “fix” this infrastructure problem. Blaming a search vendor for lousy performance is often a false claim. Notice that the functions performed within the query processing sub system are complex; for example, “on the fly” clustering, relevance ranking, and formatting. Some systems include work flow components that shape queries and results to meet the needs of particular employees or tasks. The work flow component then generates the display appropriate for the work task. Some systems “inject” search results into a third-party application so the employee has the needed information on a screen display related to the work task; for instance, a person’s investments or prior contact history.
Back to Hot Spots
Let me reiterate — I am using an older, generalized diagram. I want to identify the complexities within a representative behind-the-firewall search system. The purpose of this exercise is to allow me to comment on some general hot spots as a precursor to a quick look in a subsequent essay about specific bottle necks in subsystems.
The high level points about search system slow downs are:
- A slow down in one part of the system may be caused be a deeper issue. In many cases, the problem could be buried deep within a particular component in a sub system. Glitches in search systems can, therefore, take some time to troubleshoot. In some cases, there may be no “fix”. The engineers will have to “work around” the problem which may mean writing code. Switching to a hosted service or a search appliance may be the easiest way to avoid this problem.
- The slow down may be outside the vendor’s span of control. If you have an inadequate search system infrastructure, the vendor can advise you on what to change. But you will need the capital resources to make the change. Most slow downs in search systems are a result of the licensee’s errors in calculating CPU cycles, storage, bandwidth, and RAM. The cause of this problem is ignorance of the computational burden search systems place on their infrastructure. The fast CPUs are wonderful, but you may need clusters of servers, not one or two servers. The fix is to get outside verification of the infrastructure demands. If you can’t afford the plumbing, shift to a hosted solution or license an appliance.
- A surge in either the amount of content to index or the numbers of queries to process can individually bring a system to a half. When the two coincide, the system will choke, often failing. If you don’t have log data and you don’t review it, you will not know where to begin looking for a problem. The logs are often orphans, and their data are voluminous, hard to process, and cryptic. Get over it. Most organizations have a steady increase in content to be processed and more users sending queries to the search system despite their dissatisfaction with its performance. In this case, you will have a system that will fail and then fail again. The fix is to buckle down, manage the logs, study what’s going on in the sub systems, and act in an anticipatory way. What’s this mean? You will have to continue to build out your system when performance is acceptable. If you wait until something goes wrong, you will be in a very precarious position.,
To wrap up this discussion, you may be reeling from the ill-tasting medicine I have prescribed. Slow downs and hot spots are a fact of life with complex systems such as search. Furthermore, the complexity of the search systems in general and their sub systems in particular are essentially not fully understood by most licensees, their IT colleagues, or their management. In the first three editions of the Enterprise Search Report, I discussed this problem at length. I touch upon it briefly in Beyond Search because it is critical to the success of any search or content processing initiative. If you have different experiences from mind, please, share them via the comments function on this Web log.
I will address specific hot spots in the next day or two.
Stephen Arnold, February 21, 2008
Power Leveling
February 20, 2008
Last week I spoke with a group of young, enthusiastic programmers. In that lecture, I used the phrase power leveling. I didn’t coin this term. In my preparation for my lecture, I came across an illustration of a maze.

What made the maze interesting was a rat had broken through the maze’s dividers. From the start of the maze to the cheese at the exit, the mouse bulldozed through the barriers. Instead of running the maze, the rat went from A to B in the shortest, most direct way.
Power leveling.
When I used the term, I was talking about solving some troublesome problems in search and retrieval. What I learned in the research for Beyond Search was that many companies get trapped in a maze. Some work very hard to figure out one part of the puzzle and fail to find the exit. Other companies solve the maze, but the process is full of starts and stops.
Two Approaches Some Vendors Take
In terms of search and retrieval, many vendors develop solutions that work in a particular way on a specific part of the search and retrieval puzzle. For example, a number of companies performing intensive content processing generate additional indexes (now called metatags) for each document processed. These companies extract entities, assign geo spatial tags, classify documents and those documents components. The thorough indexing is often over kill. When these systems crunch through email, which is often cryptic, the intense indexing can go off the rails. The user can’t locate the needed email using the index terms and must fall back on searching by date, sender, or subject. This type of search system is like the rat that figures out how to solve one corner of the maze and never gets to the exit and freedom.
The other approach does not go directly to the exit. These systems iterate, crunch, generate indexes, and rerun processes repeatedly. With each epoch of the indexing processing, the metatags get more accurate. Instead of a blizzard of metatags, the vendor delivers useful metadata. The vendor achieves the goal with the computational equivalent of using a submachine gun to kill the wasp in the basement. As long as you have the firepower, you can fire away until you solve the problem. The collateral damage is the computational equivalent of shooting up your kitchen. Instead of an AK-47, these vendors require massive amounts of computing horsepower, equivalent storage, and sophisticated infrastructure.
Three Problems to Resolve
Power leveling is neither of these approaches. Here’s what I think more developers of search-and-retrieval systems should do. You may not agree. Share you views in the comments section of this Web log.
First, find a way around brute force solutions. The most successful systems often use techniques that are readily available in text books or technical journals. The trick is to find a clever way to do the maximum amount of work in fewest cycles. Just because today’s processors are pretty darn quick, you will deliver a better solution by letting software innovations do the heavy lifting. Search systems that expect me to throw iron at bottlenecks are likely to become a money pit at some point. A number of high-profile vendors are suffering from this problem. I won’t mention any names, but you can identify the brute force systems doing some Web research.
Second, how can you or a vendor get the proper perspective on the search-and-retrieval system? It is tough to get from A to B in a nice Euclidian way if you keep your nose buried in a tiny corner of the larger problem space. In the last few days, two different vendors were thunderstruck that my write ups of their system described their respective products more narrowly than the vendors’ saw the products. My perspective was broader than theirs. These two vendors struggled and are still struggling to reconcile my narrow perception of their systems with the broader and, I believe, inaccurate descriptions of these systems.
I have identified a third problem with search-and-retrieval systems. Vendors work hard to find an angle, a way to make themselves distinct. In this effort to be different, I identified vendors who have created systems that can be used when certain, highly-specific requirements call for these functions. Most organizations don’t want overly narrow solutions. The need is to have a system that allows the major search-and-retrieval functions to be performed at a reasonable cost on relatively modest hardware. As important, the customers want a system that an information technology generalist can understand, maintain, and enhance. In my experience, most organizations don’t want rocket science. Overly complex systems are fueling interest in SaaS (software as a service. Believe me, there are search-and-retrieval vendors selling systems that are so convoluted, so mind-boggling complicated that their own engineers can’t make some changes without consulting the one or two people who know the “secret trick”. Mere mortals cannot make these puppies work.
Not surprisingly, the 50 or 60 people at my lecture were surprised to hear me make suggestions that put so much emphasis on being clever, finding ways to go through certain problems, keeping the goal in sight, and keeping their egos from getting between their customers and what the customer needs to do with a system.
A Tough Year Ahead
Too many vendors find themselves in a very tough competitive situation. The prospects often have experience with search-and-retrieval systems. The reason these prospects are talking to vendors of search-and-retrieval systems is because the incumbent system doesn’t do the job.
With chief financial officers sweating bullets about costs, search-and-retrieval vendors will have to deliver systems that work, can be maintained without hefty consulting add ons, and get the customer from point A to B.
I think search-and-retrieval as a separate software category is in danger of being commoditized. Lucene, for example, is a good enough solution. The hundreds of companies chasing a relatively modest pool of potential buyers is ripe for a shake out and consolidation. Vendors may find themselves blocked by super platforms who bundle search and content processing with other, higher value enterprise applications.
Search-and-retrieval vendors may want to print out the power leveling illustration and tape it to their desk. Inspiration? Threat? You decide.
Stephen Arnold, February 20, 2008
Arnold’s KMWorld Essay Series
February 19, 2008
the newspaper covering the knowledge management market sector, published the first of a series essays by my hand in its February 2008. Unfortunately I am not permitted to reproduce the entire essay here because the copyright has been assigned to Information Today, Inc.
In each essay, I want to look at Google’s (NASDAQ:GOOG) impact on knowledge management and closed related fields. Many people see Google as a Web indexing and advertising business that has tried to move into other businesses and failed. But Google has disrupted the telecommunications industry with its “open platform” play in the spectrum auction. Now Google is probing shopping, banking, and entertainment sectors. Make no mistake. These probes are not happenstance. Google is a new breed of enterprise, and I want to help you understand it an essay at a time.
Here’s one snippet from my February 2008 KMWorld essay:
If we dip into Google’s more than 250 patent applications and patents, we find more than two dozen inventions related to content, embedding advertising in that content, and manipulating the content to create compilations or anthologies, as well as other “interesting” services… Just as Google disrupted the global telecommunications sector with its open platform and hosted mobile services, enterprise publishing and traditional publishing are now in the path of Googzilla**.
** That’s my coinage to refer the powerful entity that Google has become. Google has the skeleton of a meat-eating dinosaur out side of its Mountain View, California offices. Don’t believe me. Click this Google dinosaur link to see for yourself.
In the February 2008 essay titled “Probing the Knowledge Market” I talk about Google’s growing capability in enterprise content management and publishing. Most traditional publishers haven’t figured out Google’s advertising business. It comes as no surprise, then, for me to assert that Google’s potential impact on traditional publishing and CMS is essentially unperceived. JotSpot? Do you know what JotSpot’s technology can do for Google users? Most don’t. That’s a gap in your knowledge you may want to fill by reading my February column.
I’ve already completed a couple of submissions for this series. You will learn about my views on the GSA (Google Search Appliance). Unlike the GSA bashers, I think GSA is a very good and quite useful search-and-retrieval system. Competitors and pundits have been quick to point out the GSA’s inability to duplicate the alleged functionality of some of the best-known search system vendors. The problem is, I explain, that GSA is one piece of a larger enterprise solution. Unlike the mind-boggling complexity of some enterprise search solutions, Google’s approach is to reduce complexity, the time required to deploy a search solution, and eliminate most of the administrative headaches that plague many “behind the firewall” search system. Flexibility comes from the OneBox API, not a menu of poorly integrated features and functions. You can make a GSA perform most content processing tricks without eroding the basic system’s simplicity and stability.
I also tackle what I call “Google Glue”. The idea of creating a “sticky” environment is emerging as a key Google strategy. Most professionals are blissfully unaware of a series of activities over the last two years that “cement” users and developers to Google. Google is not just a search system; it is an application platform. I explain the different “molecules” in Google’s bonding agent. Some of these are “off the radar” of enterprise information technology professionals. I want to get these facts “on the radar”. My mantra is “surf on Google.” After studying Google’s technology for more than five years, the Google as President Bush phrased it is a game changer.
The “hook” in my KMWorld essays will be Google and its enterprise activities. I don’t work for Google, and I don’t think the management thinks too much of my work. My The Google Legacy: How Search Become the Next Application Platform and Google Version 2.0: The Calculating Predator presented information I obtained from open source about Google’s larger technology capabilities and its disruptive probes into a half dozen markets. More info about these studies here.
What you will get in my essays is an analysis of open source information about the world’s most influential search, content processing, and knowledge management company best known for its free Web search and online advertising business.
Please, navigate to the KMWorld Web site. You can look my essays there, or you can sign up to get the hard copy of the KMWorld tabloid. Once I complete the series, I will post versions of the columns. I did this with my earlier “Technology from Harrod’s Creek” essays that ran for two years in Information World Review. But I don’t post these drafts until two or three years after an essay series closes.
Stephen Arnold, February 19, 2008
How Big is the Behind-the-Firewall Search Market?
February 18, 2008
InternetNews.com ran a story by David Needle on February 5. The title was “Enterprise Search Will Top $1 Billion by 2010.” If the story is still online (news has a tendency to disappear pretty quickly), I found it here.
In January 2003, my publisher (Harry Collier, Infonortics Ltd., Tetbury, Glou.) and I collaborated on a short white paper Search Engines: Evolution and Diffusion. That document is no longer in print. We have talked about updating it. Each month the amount of information available about search and retrieval, content processing, and text analysis grows. An update is on my to-do list. I’m not sure about Mr. Collier’s task agenda.
How We Generated Our Estimate in 2003
In that essay, we calculated — actually backed into — estimates on the size of the search-and-retrieval market. Our procedure was straight forward. We identified the companies in our list of 100 vendors that were public. If the public company focused exclusively on search, we assumed the company’s revenues came from search. Autonomy (LO:AU) and Fast Search & Transfer (NASDAQ:MSFT) are involved in a number of activities that generate revenue. For our purposes, we took the gross revenue and assumed it was from search-centric activities. For super platforms such as IBM (NYSE:IBM), Microsoft (NASDAQ:MSFT), Oracle (NASDAQ:ORCL), and SAP (NYSE:SAP), we looked at the companies’ Securities & Exchange Commission filings and found that search revenue was mashed into other lines of business, not separated as a distinct line item.
We knew that at these public companies search was not a major line of business, but search certainly contributed some revenue. I had some information about Microsoft’s search business in 2002, and I used those data to make a calculation about the contribution to revenue Web search, SQLServer search, and SharePoint search made to the company. I discounted any payments by Certified Partners with search systems for SharePoint. Microsoft was in 2002 and early 2003 actively supporting some vendors’ efforts to create “snap in” SharePoint search systems (for instance, Denmark’s Mondosoft). Google had yet to introduce its Google Search Appliance in 2003, so it was not a factor in our analysis.
I had done some work for various investment banks and venture capital firms on the average revenue generated in one year by a sample of privately-held search firms. Using these data were were able to calculate a target revenue per full time equivalent (FTE). Using the actual revenues from a dozen companies with which I was familiar, I was able to calibrate my FTE calculation and generate an estimated revenue for the privately-held firms.
After some number crunching without any spreadsheet fever goosing our model, we estimated that search-and-retrieval — excluding Web ad revenue — was in the $2.8 to $3.1 billion range for calendar 2003. However, we knew there was a phenomenon of claiming revenues before the search licensee actually transferred real money to the search vendor. A number of companies have been involved in certain questionable dealings regarding how search license fees were tallied. Some of these incidents have been investigated by various organizations or by investors. Others are subjects of active probes. I’m not at liberty to provide details, nor do I want to reveal the details of the “adjustments” we made for certain accounting procedures. The adjustment was that we decremented our gross revenue estimate by about one-third, pegging the “size” of the search market in 2003 at $1.7 to $2.2 billion.
The Gartner Estimate
If you have reviewed the data reported in InternetNews.com’s story, you can see that its $1.2 billion estimate is lower than our 2003 estimate. I’m not privy to the methodology used to generate this Gartner estimate. The author of the article (David Needle) did not perform the analysis. He is reporting data released by the Gartner Group (NYSE:IT), one of the giants in technology research business. The key bit for me in the new story is this:
Total software revenue worldwide from enterprise search will reach $989.7 million this year, up 15 percent from 2007, according to Gartner. By 2010 Gartner forecasts the market will grow to $1.2 billion. While the rate of growth will slow to low double digits over the next few years, Gartner research director Tom Eid notes enterprise search is a huge market.
Usually research company predictions err on the high side. In my files, I have notes about estimates of search and retrieval hitting the $9.0 billion mark in 2007, which I don’t think happened. If one includes Google and Yahoo, the $9.0 billion is off the mark by a generous amount. Estimates of the size of the search market are all over the map.
I assert that the Gartner estimate is low. When I reviewed the data for our 2003 calculation and made adjustments for the following factors, I came up with a different estimate. Here’s a summary of my notes to myself made when I retraced my 2003 analysis and looked at the data compiled for my new study Beyond Search:
- There’s been an increase in the number of vendors offering search and retrieval, content processing, and text analysis systems. In 2003, we had a list of about 110 vendors. The list I compiled for Beyond Search contains about 300 vendors. Of these 300, about 175 are “solid” names. Some of these like Delphes and Exegy are unknown to most of the pundits and gurus tracking the search sector. Others are long shots, and I don’t want to name these vendors in my Web log.
- A market shift has been created by Google’s market penetration. I estimate that Google (NASDAQ:GOOG) has sold about 8,500 Google Search Appliances (GSA). It has about 40 reseller / partners / integrators. Based on my research and without any help from Google, I calculated that the estimated revenue from the GSA revenue in FY2007 was in the $400 million range, making the its behind-the-firewall search business larger than the revenue of Autonomy and Fast Search & Transfer combined.
- Endeca’s reaching about $85 million in revenues in calendar 2007, colored by its success in obtaining an injection of financing from Intel (NASDAQ:INTC) and SAP.
- Strong financial growth by the search vendors in my Beyond Search market sector analysis, specifically in the category called “Up and Comers”. Several of the companies profiled in Beyond Search have revenues in the $6 to $10 million range for calendar 2007. I was then able to adjust the FTE calculation.
I made some other adjustments to my model. The bottom line is that the 2007 market size as defined in Search Engines: Evolution and Diffusion was in the $3.4 to 4.3 billion range, up from $1.7 to $2.2 billion in 2003. The growth, therefore, was solid but not spectacular. Year-on-year growth of Google, for example, makes the more narrow search-and-retrieval sector look anemic. The relative buy out of Fast Search & Transfer at $1.2 billion is, based on my analysis, generous. When compared to the Yahoo buyout of more than $40 billion, it is pretty easy to make a case that Microsoft is ponying up about 7X Fast Search’s 2007 revenue.
My thought is that the Gartner estimate should be viewed with skepticism. It’s as misleading to low ball a market’s size as it is to over state it. Investors in search and retrieval have to pump money into technology based on some factor other than stellar financial performance.
Taken as a group, most companies in the search and retrieval business have a very tough time generating really big money. Look at the effort Autonomy (LO:AU), Endeca, and Fast Search (NASDAQ:MSFT) have expended to hit their revenue in FY2007. I find it remarkable that so many companies are able to convince investors to ante up big money with relatively little hard evidence that a newcomer can make search pay. Some companies have great PR but no shipping products. Other companies have spectacular trade show exhibits and insufficient revenues to remain in business (for instance, the Entopia system profiled on the Web log).
Some Revenue Trends to Watch in the Search Sector
Let me close by identifying several revenue trends that my research has brought to light. Alas, I can’t share the fundamental data in a Web log. Here are several points to keep in mind:
- Search — particularly key word search — is now effectively a commodity; therefore, look to more enterprise systems with embedded search functions that can handle broader enterprise content. This is a value add for the vendor of a database management system or a content management system. This means that it will get harder, not easier, to estimate how much of a company’s revenue comes from its search and content processing technology.
- Specialized vendors — see the Delphes case — can build a solid business by focusing on a niche and avoiding the Madison Avenue disease. This problem puts generalized brand building before one-on-one selling. Search systems need one-on-one selling. Brand advertising is, based on my research, a waste of time and money. It’s fun. Selling and making a system work is hard. More vendors need to tackle the more difficult tasks, not the distractions of building a brand. These companies, almost by definition, may be “off the radar” of the pundits and gurus who take a generalist’s view of the search sector.
- There will be some consolidation, but there will be more “going dark” situations. These shutdowns occur when the investors grow impatient and stop writing checks. I have already referenced the Entopia case, and I purposely included it in my Web log to make a point that sales and revenue have to be generated. Technology alone is not enough in today’s business environment. I believe that the next nine to 18 months will be more challenging. There are too many vendors and too few buyers to absorb what’s on offer.
- A growing number of organizations with incumbent disappointing search systems will be looking for ways to fix what’s broken fast. A smaller percentage will look for a replacement, an expensive proposition even when the swap goes smoothly. This means that “up and comers” and some vendors with technology that can slap a patch on a brand-name search system can experience significant growth. I name the up and comers and vendors to watch in Beyond Search but not in this essay.
- The geyser of US government money to fund technology to “fight terrorism” is likely to slow, possibly to a mere trickle. Not only is there a financial “problem” in the government’s checking account, a new administration will fiddle with priorities. Therefore, some of the vendors who derive the bulk of their revenue from government contracts will be squeezed by a revenue shortfall. The sales cycle for a search or content processing system is, unfortunately, measured in months, even years. So, a fast ramp of revenue from commercial customers is not going to allow the companies to rise above the stream of red ink.
To close, the search market has been growing. It is larger than some believe, but it is not as large as most people wish it were. In 2008, tectonic plates are moving the business in significant ways. Maybe the Gartner prediction is predicting the post-crash search market size? I will print out Mr. Needle’s story and do some comparisons in a year, maybe two from now.
Stephen Arnold, February 19, 2008
Blossom Software’s Dr. Alan Feuer Interviewed
February 18, 2008
You can click here to read an interview with Dr. Alan Feuer. He’s the founder of Blossom Software, a search-and-retrieval system that has carved out a lucrative niche. In the interview, Dr. Feuer says:
Degree of magic” is a telling scale for classifying search engines. At one end are search engines that take queries very literally; at the other are systems that try to be your intimate personal assistant. Systems high on the magic scale make hidden assumptions that influence the search results. High magic usually implies low transparency. Blossom works very hard to get the user results without throwing too much pixie dust in anyone’s eyes.
Dr. Feuer is a former Bell Labs’s researcher, and he has been one of the leaders in providing hosted search as well as on-premises installations of the Blossom search-and-retrieval system. I used the Blossom system to index the Federal Bureau of Investigation’s public content when a much higher profile vendor’s system failed. I also used the Blossom technology for the U.S. government funded Threat Open Source Information Gateway.
The FBI content was indexed by Blossom’s hosted service and online within 12 hours. The system accommodated the FBI’s security procedures and delivered on-point results. Once the incumbent vendor’s system had been restored to service, the Blossom hosted service was retained for one year as a hot fail over. This experience made me a believer in hosted search “Blossom style”.
Click here for the full interview. For information about Blossom, click Blossom link
Stephen Arnold, February 18, 2008
Delphes: A Low-Profile Search Vendor
February 17, 2008
Now that I am in clean up mode for Beyond Search, I have been double-checking my selection of companies for the “Profiles” section of the study. In a few days, I will make public a summary of the study’s contents. The publisher — The Gilbane Group — will also post an informational page. Publication is likely to be very close to the previously announced target of April 2008.
Yesterday, I used the Entopia system as the backbone of a mini-case study. Today — Sunday, February 17, 2008 — I want to provide some information about an interesting company not included in my Beyond Search study.
The last information I received from this company arrived in 2006, but the company’s Web site explicitly copyrights its content for 2008. When I telephoned on Friday, February 15, 2008, I went to voice mail. Therefore, I believe the company is in business.
Delphes, in the tradition of search and content processing companies, is a variant of the English word Delphi. You are probably familiar with the oracle of Delphi. I think the name of the company is intended to evoke a system that speaks with authority. As far as I know, Delphes is a private concern and concentrates its sales and marketing efforts in Canada, Francophone nations, and Spain. When I mention the name Delphes to Americans, I’m usually met with a question, “What did you say?” Delphes has a very low profile in the United States. I don’t recall seeing the company on the program of the search-and-retrieval conferences I attended in 2006 or 2007, but I go to a small number of shows. I may have overlooked the company’s sessions.
The Company’s Approach
The “guts” of the Delphes’ search-and-retrieval system is based on natural language processing embedded in a platform. The firm’s product is marketed as Diogene, another Greek variant. Diogenes, as you know, was a popular name in Greece. I assume the Diogenes to which Delphes is derived is Diogenes of Sinope, sometimes remembered as the Cynic More information about Diogenes of Sinope is here.)
Diogene extracts information using “dynamic natural language processing”. The iterative, linguistic process generates metadata, tags concepts, and classifies information processed by the system.
The company’s technology is available in enterprise, Web, and personal versions of the system. DioWeb Enterprise is the behind-the-firewall version of the product. You can license from the company DioMorpho which is for an individual user on a single workstation. Delphes works through a number of partners, and you can deal directly with the company for an on-premises license or an OEM (original equipment manufacturing) deal. Its partners include Sun Microsystems, Microsoft, and EMC, among others.
When I first looked at Delphes in 2002, the company had a good reputation in Montréal (Québec), Toronto and Ottawa (Ontario). The company’s clients now include governmental agencies, insurance companies, law firms, financial institutions, healthcare institutions, and consulting firms, among others. You can explore how some of the firm’s clients use the firm’s content processing technology by navigating to the Québec International Portal. The search and content processing for this Web site is provided by Delphes.
The company’s Web site includes a wealth of information about the architecture of the system, its features and functions, and services available from the company. The company offers a PDF that describes in a succinct way the features of what the company calls its “Intelligent Knowledge Management System”. You can download the IKMS overview document here.
Architecture
Information about the technical underpinnings of Delphes is sketchy. I have in my files a Delphes document called “The Birth of Digital Intelligence: Extranet and Internet Solutions”. This information, dated 2004, includes a high-level schematic of the Delphes system. Keep in mind that the company has enhanced its technology, but I think we can use this diagram to form a general impression of the system. Note: these diagrams were available in open sources, and are copyrighted by Delphes.
The “linguistic soul” of the system is encapsulated in two clusters of sub systems. First, there is the “advanced analysis” for content processing. This set of functions performs semantic analysis, which “understands” each processed document. The second system permits cross-language operation. Canada is officially bilingual, so for Delphes to make sales in Canadian agencies, the system must handle multiple languages and have a means to permit a user to locate information using either English or French.
The “body” of the system includes a distributed architecture, multi-index support, a federating function, support for XML and Web services. In short, Delphes followed the innovation trajectory of Autonomy (LO:AU), Endeca, and Fast Search & Transfer (NASDAQ:MSFT). One can argue that Delphes has a system of comparable sophistication that permits the same customization and scaling.
Delphes makes a live demo available in a side-by-side comparison with Google. The content used for the demo comes from the Cisco Systems’ Web site. You can explore this live implementation in the Delphes demo here. The interface incorporates a number of functions that strike me as quite useful. The screen shot below comes from the Delphes document from which the systems diagram was extracted. Portions of the graphic are difficult to read, but I will summarize the key features. You will be able to get a notion of the default interface, which, of course, can be customized by the licensee.
The results of the query high speed access through cable appear in the main display. Note that a user can select “themes” (actually a document type) and a “category”.
Each “hit” in the results list includes an extract from the most relevant paragraph in the source document that matches the query. In this example, the query terms are not matched exactly. The Delphes system can understand “fuzzy” notions and use them to find relevant documents. Key word indexing systems typically don’t have this functionality. With a single click, the user can launch a second query within the subset. This is generally known as “search within results.” Many search systems do not make this feature available to their users.
Notice that a link is available so the user can send the document with one-click to a colleague. The hit also includes a link to the source document. A link is provided so the user can jump directly to the next relevant paragraph in a hit. This feature eliminates scrolling through long documents looking for results. Finally, the hit provides a count of the number of relevant paragraphs in a source document. A long document with a single relevant paragraph may not be as useful to a user as a document with a larger number of relevant paragraphs.
Based on my notes to myself about the Delphes system, I identified the following major functions of DioWeb. Forgive me if I blur some functions from the DioWeb product. I can no longer recall the boundaries of each product. Delphes, I’m confident, can set you straight if I go off track.
First, the system can perform search-and-retrieval tasks. The interface permits free text and natural language querying. The system’s ability to “understand” content eliminates the shackles of the key word Boolean search technology. Users want the search box to be more understanding. Boolean systems are powerful but not understood by most users. Delphes describes its semantic approach as using “key linguistic differentiators”. I explain these functions briefly in Beyond Search, so I won’t define each of these concepts in this essay. Delphes uses syntax, disambiguation, lemmatization, masks, controlled term lists, and automatic language recognition, among other techniques.
Second, the system can federate content from different systems and further segment processed content by document type. Concepts can be used to refine a results list. Delphes defines concepts as proper nouns, dates, product names, codes, and other types of metadata.
Third, the system identifies relevant portions of a hit. A user can see only those portions of the document or browse the entire document. A navigator link allows the user to jump from relevant paragraph to relevant paragraph without the annoying scrolling imposed by some other vendors’ approaches to results viewing.
Fourth, the system can generate a “gist” or “summary” of a result. This feature extracts the most important portions of each hit and makes them available in a report. The system’s email link makes it easy to send the results to a colleague.
Fifth, Delphes includes what it calls a “knowledge manager”. I’m generally suspicious of KM or knowledge management systems. Delphes’ implementation strikes me as a variation on the “gist” or “summary” feature. The user can add comments, save the results, or perform other housekeeping functions. A complementary “information manager” function generates a display that shows what reports a user has generated. If a user sends a report to a colleague, the dashboard display of the “information manager” makes it possible to see that the colleague added a comment to a report. Again, this is useful housekeeping stuff, not the more esoteric functions described in my earlier summary of the Entopia approach.
What Can We Learn?
My goal for Beyond Search was to write a study with fewer than 200 pages, minimizing the technical details to focus on “what’s in it for the licensee”. Beyond Search is going to run about 250 pages, and I had to trim some information that I thought was important to readers. Delphes is an interesting vendor, and it offers a system that has a number of high-profile, demanding licensees in Canada, Europe, and elsewhere.
The reason I wanted to provide this brief summary — fully unauthorized by the company — was to underscore what I call the visibility problem in behind-the-firewall search.
Reading the information from the major consultancies and pundits who “cover” this sector of the software business, Delphes is essentially invisible. However, Delphes does exist and offers a competitive system that can go toe-top-toe with Autonomy, Endeca, and Fast Search & Transfer. One can argue that Delphes can enhance a SharePoint environment and match the functionality of a custom system built from IBM’s (NYSE:IBM) WebSphere and Ominifind components.
What’s does this discussion of Delphes tell us?
If you rely on the consultants and pundits, you may not be getting the full story. Just as I had to chop information from Beyond Search, others exercise the same judgment. This means that when you ask, “Which system is best for my requirements?” — you may be getting at best an incomplete answer. You may be getting the wrong answer.
A search for Delphes on Exalead, Live.com (NASDAQ:MSFT), Google (NASDAQ:GOOG), and Yahoo (NASDAQ:YHOO) is essentially useless. Little of the information I provide in this essay is available to you. Part of the problem is that the word Delphes is perceived by the search systems as a variant of Delphi. You learn a lot about tourism and not too much about this system.
There are two key points to keep in mind about search-and-retrieval systems:
- The “experts” may not know about some systems that could be germane to your needs. If the “experts” don’t know about these systems, you are not going to get a well-rounded analysis. The phrase that sticks in my mind is “bright but uninformed”. This can be a critical weak spot for some “experts”.
- The public Web search systems do a pretty awful job on certain types of queries. It is worth keeping this in mind because in the last few weeks, Google’s market share of Web search is viewed as a “game over” market. I’m not so sure. People who think the “game is over” in search are “bright but uninformed”. Don’t believe me. Run the Delphes query and let me know your impression of the results. (Don’t cheat and use the product names I include in this essay. Start with Delphes and go from there.)
In closing, contrast Entopia with Delphes. Both companies asserted in 2004 – 2006 similar functionality. Today, the high-profile Entopia is nowhere to be found. The lower-profile Delphes is still in business.
Make no mistake. Search is a tough business. Delphes illustrates the usefulness of focusing on a market, not lighting up the sky with marketing fireworks. I would like to ask the Delphic oracle in Greece, “What’s the future of Delphes?” I will have to wait and see. I’m not trekking to Greece to look at smoke and pigeon entrails. I do know some search engine “pundits” who may want to go. Perhaps the Delphic oracle will short cut their learning about Delphes?
Stephen Arnold, February 17, 2008
Entopia: A Look Back in Time
February 16, 2008
Periodically I browse though my notes about behind-the-firewall systems, content processing solutions, and information retrieval start ups. I think Entopia, a well-funded content processing company founded in 1999, shut down, maybe permanently some time in 2006.
In my “Dormant Search Vendors” folder, I keep information about companies that had interesting technology but dropped off my watch list. A small number of search vendors are intriguing. I revisit what information I have in order to see if there are any salient facts I have overlooked or forgotten.
KangarooNet and Smart Pouches
Do you remember Entopia? The company offered a system that would key word index, identify entities and concepts, and allow a licensee to access information from the bottom up. The firm open its doors as KangarooNet. I noticed the name because it reminded me of the whimsical Purple Yogi (now Stratify). Some names lure me because they are off-beat if not too helpful to a prospective customer. I do recall that the reference to a kangaroo was intended to evoke something called a “smart pouch”. The founders, I believe, were from Israel, not Australia. I assumed some Australian tech wizards had crafted the “smart pouch” moniker, but I was wrong.
Do you know what a “smart pouch” is? The idea is that the kangaroo has a place to keep important items such as baby kangaroos. The Entopia “smart pouch” was a way to gather important information and keep it available. Users could share “smart pouches” and collaborate on information. Delicious.com’s bookmarks provide a crude analog of a single “smart pouch” function.
I recall contacting the company in 2000, but I had a difficult time understanding how the company’s system would operate at scale in an affordable way. Infrastructure and engineering support costs seemed likely to be unacceptably high. No matter what the proposed benefits of a system, if the costs are too high, customers are unwilling to ink a deal.
Shifting Gears: New Name, New Positioning
Entopia is a company name derived from the Greek word entopizo. For those of you whose Greek is a rusty, the verb means to locate or bring to light. Entopia’s senior technologists stressed that their K-Bus and Quantum systems allowed a licensee to locate and make use of information that would otherwise be invisible to some decision makers.
When I spoke with representatives of the company at one of the Information Today conferences in New York, New York, in 2005. I learned that Entopia was, according to the engineer giving me the demo, was “a third-generation technology”. The idea was that Entopia’s system would supplement indexing with data about the document’s author, display Use For and See Also references, and foster collaboration.
I noted that I also spoke with Entopia’s vice president of product management, David Hickman, a quite personable man as I recall. My notes included this impression:
Entopia wants to capture social aspects of information in an organization. Relationships and social nuances are analyzed by Entopia’s system. Instead of a person looking at a list of possibly relevant documents, the user sees the information in the context of the document author, the author’s role in the organization, and the relationships among these elements.
In my files, I found this screen shot of Entopia’s default search results display. It’s very attractive, and includes a number of features that systems now in the channel do not provide. For example, if you had access to Entopia’s system in 2006 prior to its apparent withdrawal from the market, you could:
- See concepts, people, and sources related to your query. These appear in the left hand panel on the screen shot below
- Get a results list with the creator, source, date, and relevance score for each item clearly presented. In contrast to the default displays used by some of the company’s in my Beyond Search study, Entopia’s interface is significantly more advanced
- The standard search box, a hot link to advanced search functions, and one-click access to saved searches keep important but little used functions front and center.
When the firm was repositioned in 2003, the core product was named, according to my handwritten notes, the “K-Bus Knowledge Extractor”. I think the “k” in K-Bus is a remnant of the original “kangaroo” notion. I wrote in my notes that Entopia was a spin out from an outfit called Omind and Global Catalyst Partners.
Other features of the Entopia system were:
- Support for knowledge bases, taxonomies, and controlled term lists
- An API and a software development kit
- Support for natural language processing
- Classification of content
- Enhanced metatagging
The K-Bus technology was enhanced with another software component called Quantum. The software system created a collaborative workspace. The idea was that system users to assemble, discuss, and manipulate the information processed by the K-Bus. This is the original SmartPouch technology that allows a user to gather information and keep it in a virtual workspace.
System Overview
In my Entopia folder, I found white papers and other materials given to me by the company. Among the illustrations was this high-level view of the Entopia system.
Several observations are warranted even though the labels in the figure are not readable. First, licensees had to embrace a comprehensive information platform. In the 2005 – 2006 period, a number of content processing vendors had added the word “platform” to their marketing collateral. Entopia to its credit does a good job of depicting how significant an investment is required to make good on the firm’s assertions for discovering information.
Second, it is clear that the complex interactions required to make the system work as advertised cannot tolerate bottlenecks. A slow down in one component — for instance, the horizontal gray rectangle in the center of the diagram is the “Session Facade Beans” subsystem. If these processes slow down the Web framework in the horizontal blue box above the horizontal gray box slows down user access. Another hot spot is the Data Access Module — the gray rectangle below the horizontal gray rectangle just referenced. A problem in this component prevents the metadata from being accessed. In short, a heck of an infrastructure of systems, storage, and bandwidth availability are needed to keep the system performing at acceptable levels.
Finally, the complexity of the system appears to require on-site support and in some cases, technical support from Entopia. A licensee’s existing information technology staff could require additional headcount to manage this K-Bus architecture.
As I scanned these notes, now more than two years’ old, I was struck by the fact that Entopia was on the right track. The buzz about social search makes sense, particularly in an organization where one-to-one relationships occur out of a hierarchical organizational structure. Software can provide some context for knowledge workers who are often monads, responsible to other monads, not the organization as a whole.
Entopia wanted to blend expertise identification, content visualization, social network analysis, and content discovery into one behind-the-firewall system. I noted that the company’s system started at $250,000, and I assume the up-and-running price tag would be in the millions.
When I asked, “Who are Entopia’s customers?”, I learned that Saab, the US government, Intel, and Boeing were licensees. Those were blue-chip names, and I thought that these firms’ use of the the K-Bus indicated Entopia would thrive. Entopia was among the first search vendors to integrate with Salesforce.com. The system also allowed a licensee to invoke the Entopia functions within a Word document.
What Can We Learn?
Entopia seems to have gone dark quietly in the last half of 2006. My hunch is that the intellectual property of the company has been recycle. Entopia could be in operation under a different corporate name or incorporated as a proprietary system in other content processing systems. When I clicked on the Entopia.com Web address in my folder, a page of links appeared. Running queries on Live.com, Google, and Yahoo returned links to stale information. If Entopia remains in business, it is doing a great job of keeping a low profile.
If you read my essay “Power Leveling”, you know that two common challenges in search and content processing are getting caught in a programming maze. The need to solve a particular problem fails to meet a licensee’s needs. The second problem is that when the system developer assembles the local solutions, the overall result is not efficient. Instead of driving straight from Point A to Point B, the system iterates and explores every highway and by way. Performance becomes a problem. To get the system to go fast, capital investment is necessary. When licensees can’t or won’t spend more on hardware, the system remains sluggish.
Entopia, on the surface, appears to be an excellent candidate for further analysis. My cursory looks at the system in 2001, again in 2005, and finally in 2006 revealed considerable prescience about the overall direction of the content processing market. Some of the subsystems were very clever and well in advance of what other vendors had on the market. The use of the social metadata in search results was quite useful. When these clever subsystems were hooked together, my recollection is now hazy, but I had noted that response time was sluggish. Maybe it was. Maybe it wasn’t. The point is that a complex system like that illustrated above would require on-going work to keep operating at peak performance.
Unfortunately, I don’t have an Entopia system to benchmark against the systems of the 24 companies profiled in Beyond Search. I wanted to include this Entopia information, but I couldn’t justify a historical look back when there was so much to communicate about systems now in the channel.
In Beyond Search, I don’t discuss the platforms available from Autonomy , Endeca, Fast Search & Transfer. IBM, and Oracle. I do mention these companies to frame the new players and little known up and comers that figure in Beyond Search. I would like to conclude this essay with several broad observations about the perils of selling organizations platforms.
First, any company selling a platform is essentially trying to obtain a controlling or central position in the licensee’s organization. A platform play is one that has a potentially huge financial pay off. A platform is a sophisticated “lock in”. Once the platform is in position, competitors have a difficult time making headway against the incumbent platform.
Second, the platform is the core product of IBM (NYSE:IBM), Microsoft (NASDAQ:MSFT), and Oracle (NASDAQ:ORCL). One might include SAP (NYSE:SAP) in this list, but I will omit the company because it’s in transition. These Big Three have the financial and market clout to compete with one another. Smaller outfits p9ushing platforms have to out market, out fox, and out deliver any of the Big Three. After all, why would an Oracle DBA want another information processing platform in an all-Oracle environment. IBM and Microsoft operate with almost the same mind set. Smaller platform vendors — perhaps we could include Autonomy (LON:AU) and Endeca in this category — are likely to face increasing pressure to mesh seamlessly with whatever a licensee has. If this is correct, Fast Search’s ESP has a better chance going forward than Autonomy. It’s too early to determine if Endeca’s deal with SAP will pay similar dividends. You can decide for yourself if Autonomy can go toe-to-tow with the Big Three. From my observation post in rural Kentucky, Autonomy will have to shift into a higher gear in 2008.
Third, super-advanced systems are vulnerable in business environments where credit is tight, sales are in slow or low growth cycles, and a licensee’s technical staff may be understaffed and overworked.
In conclusion, I think Entopia was a forward-thinking company. Its technology anticipated market needs now more clearly discernable. Its system was slick, anticipating some of the functionality of the Web 2.0 boom. The company demonstrated a willingness to abandon overly cute marketing for more professional product and company nomenclature. The company did apparently have one weakness — too little revenue. Entopia, if you are still out there, please, let me know.
Stephen Arnold, February 16, 2008
Search Musical Chairs
February 15, 2008
Running a search business is tough. Being involved in search and retrieval is no picnic either. The game of musical chairs that dominates the news I review comes as no surprise.
For example, Yahoo’s Bradley Horowitz, head of advanced projects, pulls in the Google parking lot now. You can read his “unfortunate timing” and “I really love Yahoo” apologia Horowitz apologia link. The executive shifts at Microsoft are too numerous for me to try and figure out. The search wizard from Ask.com — Steve Berkowitz — has turned in his magic wand to Microsoft security. You can read more about that Microsoft shuffle link. The low profile SchemaLogic lost one of its founders a month ago, although the news was largely overlooked by the technical media. Then, in Seattle on February 13, I heard that changes are afoot in Oracle’s secure enterprise search group. In short, the revolving doors in search and retrieval keep spinning.
But there are even larger games link afoot. For example, T-Mobile embraced Yahoo. Almost simultaneously, Nokia snuggled up to Google. (Note: the links to these news stories go dark without warning, and I can’t archive the original material on my Web site due to copyright considerations.) The world of mobile search continues to be fluid, and we haven’t had the winner of the FTC spectrum auction announced yet. As these larger tie ups play out, I want to keep my eye on telco search companies that are off the radar; for example, Fast Search & Transfer’s mobile licensees might be jetting to different climes when the Microsoft acquisition is completed. A certain large behind-the-scenes vendor of mobile search is likely to be among the first to seek a new partner.
At the next higher level, the investment banks continue to take a close look at their exposure in search and related sectors. With more than 150 companies actively marketing search, content processing, and utilities, some major financial institutions are becoming increasingly concerned. What once looked like a very large, open-ended opportunity has a very different appearance. The news that Google touches more than 60 percent of online search traffic leaves little wiggle room for online search competitors in the US and Europe. Asia seems to be a different issue, but in the lucrative US market, Google is the factor. In the behind-the-firewall sector Microsoft – Fast and Google seem destined to collide. With that looking increasingly likely, IBM and Oracle will have to crank up their respective efforts.
In short, at the executive level, sector level, and investment level, speed dating is likely to be a feature of the landscape for the next six to nine months. If someone were to ask me to run a search-centric company, I would push the button on my little gizmo that interrupts telephone calls with bursts of static. The MBAs, lawyers, and accountants who assume leadership positions in search-centric companies are wiser, braver, and younger than I. Unfortunately, as the bladder of their confidence swells, the substance behind that confidence may prove thin indeed.
I have resisted making forecasts about what the major trends in search and retrieval will be in 2008. I can make one prediction and feel quite certain that it will hold true.
The executive turnover in the ranks
of search and content processing
companies will churn, flip,
and flop throughout 2008.
The reason? Too many companies chasing too few opportunities. The wide open spaces of search are beginning to close. Beyond Search contains a diagram showing how the forces of Lucene, up-and-coming vendors with value-priced systems, and embedded search from super platforms like IBM, Microsoft, and Oracle are going to make life very interesting for certain companies. When the pressure increases, the management chair becomes a hot seat. The investors get cranky, and the Bradley Horowitz’s of the world find a nice way to say, “Adios.”
Stephen Arnold, February 15, 2008

