Lucene: Merits a Test Drive and a Close Look
January 27, 2008
On Friday, I gave an invited lecture at the Speed School, the engineering and computer science department of the University of Louisville. After the talk on Google’s use of machine learning, one of the Ph.D. candidates asked me about Lucene. Lucene, as you may know, is the open source search engine authored by one of the Excite developers. If you want background on Lucene, the Wikipedia entry is a good place to start, and I didn’t spot any egregious errors when I scanned it earlier today. My interest is behind the firewall search and content processing. My comments, therefore, reflect my somewhat narrow view of Lucene and other retrieval systems. I told the student that I would offer some comments about Lucene and provide him with a few links.
Background
Lucene’s author is Doug Cutting, who worked at Xerox’s Palo Alto Research Center and eventually landed at Excite. After Excite was absorbed into Excite@Home, he needed to learn Java. He wrote Lucene as an exercise. Lucene (his wife’s middle name) was contributed to the Apache project, and you can download a copy, documentation, and sample code here. An update — Java Version 2.3.0 — became available on January 24, 2008.
What It Does
Lucene permits key word and fielded search. You can use Boolean AND, OR, and NOT to formulate complex queries. The system permits fuzzy search, useful when searching text created by optical character recognition. You can also set up the system to display similar results, roughly the same as See Also references. You can set up the system to index documents. When a user requests a source document, that document must be retrieved over the local network. If you want to minimize the bandwidth hit, you can configure Lucene to store an archive of the processed documents. If the system processes structured content, you can search by the field tags, sort these results, and perform other manipulations. There is an administrative component which is accessed via a command line.
In a nutshell, you can use Lucene as a search and retrieval system.
Selected Features
You will want to have adequate infrastructure to operate the system, serve queries, and process content. When properly configured, you will be able to handle collections that number hundreds of millions of documents. Lucene delivers good relevancy when properly configured. Like a number of search and content processing systems, the administrative tools allow the search administrator to tweak the relevance engine. Among the knobs and dials you can twirl are document weights so you can boost or suppress certain results. As you dig through the documentation, you will find guidance for run time term weights, length normalization, and field weights, among others. A bit of advice — run the system in the default mode on a test set of documents so you can experiment with various configuration and administrative settings. The current version improves on the system’s ability to handle processes in parallel. Indexing speed and query response time, when properly set up and resourced, are as good as or better than some commercial products’ responsiveness.
Strengths and Weaknesses
The Lucene Web site provides good insight into the strengths and weaknesses of a Lucene search system. The biggest plus is that you can download the system, install it on a Linux, UNIX, or Windows server and provide a stable, functional key word and fielded search system. In the last three or four years, the system has made significant improvements in processing speed, reducing the size of the index footprint (now about 25 percent of the source documents’ size), incremental updates, support for index partitions, and other useful enhancements.
The downside of Lucene is that a non-programmer will not be able to figure out how to install, test, configure, and deploy the system. Open source programs are often quite good technically, but some lack the graphical interfaces and training wheels that are standard with some commercial search and content processing systems. You will be dependent on the Lucene community to help you resolve some issues. You may find that your request for support results in a Lucene aficionado suggesting that you use another open source tool to resolve a particular issue. You will also have to hunt around for some content filters, or you will be forced to code your own import filters. Lucene has not been engineered to deliver the type of security found in Oracle’s SES 11g system, so expect to spend some time making sure users can access only content at their clearance level.
When to Use Lucene
If you have an interest in behind-the-firewall search, you should download, install, and test the system. Lucene provides an excellent learning experience. I would have no hesitation installing Lucene in an organization where money for a brand name search system was not available. The caveat is that I am confident in my ability to download, install, debug, configure, and deploy the system. If you lack the requisite skills, you can still use Lucene. In September 2007, I met the founders of Tesuji, a company with business offices in Rome, Italy, and technical operations, in Budapest, Hungary. This company provides technical support for Lucene, customization services, and provides a build that the company has configured. Information about Tesuji is here. Another option is to download SOLR, which is a wrapper for Lucene. SOLR provides a number of features but the one that is quite useful is the Web-based administrative interface. When you poke under the SOLR hood, you will find tools to replicate indexes and perform other chores.
What surprises a number of people is the wide use of Lucene. IBM makes use of it. Siderean Software can include it in their system if the customer requires a search system as well as Siderean’s advanced content processing tools. Project Gutenberg uses it as well.
Some organizations have yet to embrace open source software. If you have the necessary expertise, give it a test drive. Compared to the $300,000 or higher first-year license fees some search and content processing vendors demand, Lucene is an outstanding value.
Stephen Arnold, January 27,2008
ESR, 4th Edition – The Definitive Atlas of Search
January 27, 2008
My colleague, Tony Byrne, publisher of The Enterprise Search Report, reminded me to inform you that the fourth edition of The Enterprise Search Report is available. It contains explanatory information essential for a successful search deployment, and you get 18 in-depth profiles of the leading vendors of enterprise search systems. I’m proud to have been part of the original study, which since 2004, has provided information about what’s turned out to be one of the hottest sectors in enterprise software. This new edition builds upon my original work for the first, second, and third editions. The fourth edition contains information useful to system administrators, procurement teams, investors, and organizations involved in any way with search and retrieval. I strongly urge you to visit CMSWatch.com, look at the sample chapter available without charge, and then order your copy. A site license is available. If you are waiting for my new study Beyond Search, buy ESR. The two studies are complementary, and in Beyond Search I refer readers to the ESR so I don’t have to repeat myself. If you are serious about search, content processing, and value-adds to existing systems, you will find both studies useful.
Stephen Arnold, January 27, 2008
Up or Out? Probably Neither. Go Diagonal to the Future
January 27, 2008
I’ve been working on the introduction to Beyond Search. My fancy-dan monitor still makes my eyes tired. I’m the first to admit that next-generation technology is not without its weaknesses. To rest, I sat down and started flipping through the print magazines that accumulate each week.
Baseline is a Ziff Davis Enterprise magazine. I want you to know that I worked at Ziff Communications, and I have fond memories of Ziff Davis, one of Ziff’s most important units, at its peak. ZD’s products were good. Advertisers flocked to our trade shows, commercial online databases, and, of course, the magazines. I remember when Computer Shopper had so many pages, the printer complained because his binding unit wasn’t designed to do what he called “telephone books.” My recollection about that issue, which I saved for years, was a newsprint magazine with more than 600 pages that month. The Baseline I’m holding has 62 pages and editorial copy on the inside back cover, not an ad.
Baseline is a computer business magazine with the tagline “where leadership meets technology.” The Ziff of old was predicated on product reviews, product information, and product comparisons. This Baseline magazine doesn’t follow the old Ziff formula. Times change, and managers have to adapt. The original Ziff formula was right for the go-go years of the PC industry when ad money flowed to hard copy publications. It’s good that Baseline has a companion Web site. The information on the Web site is more timely than the articles in the print magazine, but maybe because of my monitor, I found the site difficult to read and confusing. Some of the news is timely and important; for example, Baseline carried the story about Google’s signing up the University of Phoenix, another educational scalp in its bag. That’s an important story largely unreported and not included in the Google News index. I like the idea of different, thoughtful approach to information technology. I also use the Baseline Web site.
The story in the January 2008 issue — “Scaling Up or Out” by David F. Carr — tackles an important subject. The question of how to scale to meet growing demand is one that many organizations now face. (I would provide a link to the article, but I could not locate it on the magazine’s Web site. The site lacks a key word search box, or I couldn’t find it. If you want to read the hard copy of this article, you will find it on pages 57, 58, 59, and 60.)
The subject addresses what IT options are available when systems get bogged down. The article correctly points out that you can buy bigger machines and consolidate activity. Traditional database bottlenecks can be reduced with big iron and big money. I think that’s scaling up. Another approach is to use more servers the way Google and many other Web sites do. I think that’s scaling out. The third option is to distribute the work over many commodity machines. But distributed processing brings some new headaches, and it is not a cure-all. There’s another option that walks a middle path. You “scale diagonally.” I think this means some “up” and some “out.” I’m sure some fancy Harvard MBA created apt terminology for this approach, but I think the phrase “technology grazing” fits the bill. The Baseline editors loved this story; the author loved it; and most readers of Baseline will love it. But when I read it, three points jabbed me in the snoot.
First, pages 58 and 59 feature pictures of three high-end servers. Most readers will not get their hands on these gizmos, but for techno-geeks these pix are better than a Sports Illustrated swimsuit issue. But no comparative data are offered. I don’t think anyone at ZD saw these super-hot computers or actually used them. With starting prices are in six figures and soaring effortlessly to $2 million or more for one server, some product analysis would be useful. It is clear from the article, that for really tough database jobs, you will need several of these fire breathers. The three servers are the HP Integrity Superdome, the Unisys ES7000/one, and the IBM p5 595. And page 60 has a photo of the Sun SPARC Enterprise M9000.From these graphics, I knew that the article was going to make the point that for my enterprise data center, I would have to have these machines. By the way, HP, IBM, and Sun are listed as advertisers on page 8. Do you think an ad sales professional at ZD will suggest to Unisys that it too should advertise in Baseline? The annoyance: product fluff presented as management meat.
Second, the reason to buy big, fast iron is the familiar RDBMS or relational database management system. The article sidesteps the ubiquitous Codd architecture. Today, Dr. Codd’s invention is asked to do more, faster. The problem is that big iron is a temporary fix. As the volume of data and transactions rise, today’s hot iron won’t be hot enough. I wasn’t reading about a solution; I was getting a dose of the hardware sales professional’s Holy Grail–guaranteed upgrades. I don’t think bigger iron will resolve transaction bottlenecks with big data. The annoyance: the IT folks embracing a return to the mainframe may be exacerbating the crisis.
Third, I may be too sensitive. I came away from the article with the sense that distributed, massively parallel systems are okay for lightweight applications. Forget it for the serious computing work. For real work, you need HP Integrity Superdome, Unisys ES7000/one, IBM p5 595, or the Sun SPARC Enterprise M9000. Baseline hasn’t suggested how to remove the RDBMS handcuffs limiting the freedom of some organizations. Annoyance: no solution presented.
Google’s enterprise boss is Dave Girouard. Not long ago, I heard him make reference to a coming “crisis in IT.” This Baseline article makes it clear that IT professionals who keep their eyes firmly on the past have to embrace tried and true solutions.
In my opinion here’s what’s coming and not mentioned or even hinted at in the Baseline article. The present financial downturn is going accelerate some signficant changes in the way organizations manage and crunch their data. Economics will start the snowball rolling. The need for speed will stimulate organizations’ appetite for a better way to handle mission-critical data management tasks.
Amazon and Google are embracing different approaches to “old” data problems. If I read the Baseline article correctly, lots of transactions in real time aren’t such a good idea with distributed, massively parallel architecture built on commodity hardware. What about Amazon and Google? Both run their respective $15 billion dollar annual turnover on this type of platform. And dozens of other companies are working like beavers to avoid a 1970s computer center using multi-core CPUs.
Finally, the problem is different from those referenced in the Baseline article. In Beyond Search, I profile one little-known Google initiative which may or may not become a product. But the research is forward looking and aims to solve the database problem. Not surprisingly, the Google research uses the commodity hardware and Googley distributed, massively parallel infrastructure. Could it be time for companies struggling with older technologies to look further ahead than having a big rack stuffed with several score multi-core CPUS? Could it be time to look for an alternative to the relational database? Could it be time to admit that the IT crisis has arrived?
Baseline seems unwilling to move beyond conventional wisdom. True, the article does advise me to “scale diagonally.” The problem is that I don’t know what this means. Do you?
Stephen Arnold, January 27, 2008
Twisty Logic
January 25, 2008
I live in a hollow in rural Kentucky. The big city newspaper is The Courier-Journal, and I look at it every day, and I even read a story or two. The January 25, 2008, edition ran a story called “Comair Passengers Blamed in Crash.” The story by Tom Loftus contained the phrase contributory negligence. This phrase might be useful or not too useful if you find yourself taking your search engine vendor to court.
The phrase contributory negligence is used in the context of a tragic air craft mishap in Lexington, Kentucky. The attorney defending the airline in the matter “has claimed that passengers were partly to blame for their own deaths in the August 2006 crash…” The logic is that passengers knew the weather was bad and that the runway was under construction.
According to the article, the attorney did not include this argument in his defense of his client.
The notion that a vendor can blame the customer for failure is an interesting idea, and it is one that, I must admit, I had not considered. In Kentucky, this logic makes sense. But when you have a million dollar investment in a search system, you may want to make sure that you document your experience with the search system.
If you don’t or you take short cuts in the implementation, you might find yourself on the wrong side of the logic that asserts you — the customer — caused the problem. The vendor wins.
You can read the full Courier-Journal story here. The Web site rolls off older stories, so you might encounter a dead link. Hurry. The story was still live at 9 30 pm, Friday, January 25, 2008.
Stephen Arnold, January 25, 2008
Transformation: An Emerging “Hot” Niche
January 25, 2008
Transformation is a $5 dollar word that means changing a file from one format to another. The trade magazines and professional writers often use data integration or normalization to refer to what amounts to taking a Word 97 document with a Dot DOC extension and turning it into a structured document in XML. These big words and phrases refer to a significant gotcha in behind-the-firewall search, content processing, and plain old moving information from one system to another.
Here’s a simple example of the headaches associated with what should be a seamless, invisible process after a half century of computing. The story:
You buy a new computer. Maybe a Windows laptop or a new Mac. You load a copy of Office 2007, write a proposal, save the file, and attach it to an email that says, “I’ve drafted the proposal we have to submit tomorrow before 10 am.” You send the email and go out with some friends.
In the midst of a friendly discussion about the merits of US democratic presidential contenders, your mobile rings. You hear your boss saying over the background noise, “You sent me a file I can’t open. I need the file. Where are you? In a bar? Do you have your computer so you can resend the file? No? Just get it done now!” Click here to read what ITWorld has to say on this subject. Also, there’s some user vitriol over Word to Word compatibiity hassle itself here. A work around from Tech Addict is here
Another scenario is to have a powerful new content processing system that churns through, according to the vendor’s technical specification, “more than 200 common file types.” You set up the content processing gizmo, aim it at the marketing department’s server, and click “Index.” You go home. When you arrive the next morning at 8 am, you find that the 60,000 documents in the folders containing what you wanted indexed had become an index with 30,000 documents.” Where are the other 30,000 documents? After a bit of fiddling, you discover the exception log and find that half of the documents you wanted indexed were not processed. You look up the error code and learn that it means, “File type not supported.”
The culprit is the inability of one system to recognize and process a file. The reasons for the exceptions are many and often subtle. Let’s troubleshoot the first problem, the boss’s inability to open a Word 2007 file sent as an attachment to an email.
The problem is that the recipient is using an older version of Word. The sender saved the file in the most recent Word’s version of XML. You can recognize these files by their extension Dot DOCX. What the sender should have done is save the [a] proposal as either a Dot DOC file in an older “flavor” of Word’s DOC format; [b] file as the now long-in-the-tooth RTF (rich text format) type; or [c] file in Dot TXT (ASCII) format. The fix is for the sender to resend the file in a format the recipient can view. But that one file can cost a person credibility points or the company a contract.
The second scenario is more complicated. The marketing department’s server had a combination of Word files, Adobe Portable Document Format files with Dot PDF extensions, some Adobe InDesign files, some Quark Express files, some Framemaker files, and some database files produced on a system no one knows much about except that the files came from a system no longer used by marketing. A bit of manual exploration revealed that the Adobe PDF files were password protected, so the content processing system rejected them. The content processing system lacked import filters to open the proprietary page layout and publishing program files. So it rejected them. The mysterious files from the disused system were data dumps from an IBM CICS system. The content processing system opened and then found them unreadable, so those were exceptions as well.
Now the nettles, painful nettles:
First, fixing the problem with any one file is disruptive but usually doable. The reputation damage done may or may not be repaired. At the very least, the sender’s evening was ruined, but the high-powered vice president was with a gaggle of upper crust types arguing about an election’s impact on trust funds. To “fix” the problem, she had to redo her work. Time consuming and annoying to leave her friends. The recipient — a senior VP — had to jiggle his plans in order to meet the 10 am deadline. Instead of chlling with The Simpsons TV show, he had to dive into the proposal and shape the numbers under the pressure of the looming deadline.
We can now appreciate a 30,000 file problem. It is a very big problem. There’s probably no way to get the passwords to open some the PDFs. So, the PDFs’ content may remain unreadable. The weird publishing formats have to be opened in the application that created them and then exported in a file format the content processing system understands. This is a tricky problem, maybe another Web log posting. An alternative is to print out hard copies of the files, scan them, use optical character recognition software to create ASCII versions, and then feed the ASCII versions of the files to the content processing system. (Note: some vendors make paper-to-ASCII systems to handle this type of problem.) Those IBM CICS files can be recovered, but an outside vendor may be needed if the system producing the files is no longer available in house. When the costs are added up, these 30,000 files can represent hundreds of hours of tedious work. Figure $60 per hour and a week’s work if everything goes smoothly, and you can estimate the minimum budget “hit”. No one knows the final cost because transformation is dicey. Cost naivety is the reason my blood pressure spikes when a vendor asserts, “Our system will index all the information in your organization.” That’s baloney. You don’t know what will or won’t be indexed unless you perform a thorough inventory of files and their types and then run tests on a sample of each document type. That just doesn’t happen very often in my experience.
Now you know what transformation is. It is a formal process of converting lead into content gold.
One Google wizard — whose name I will withhold so Google’s legions of super-attorneys don’t flock to rural Kentucky to get the sheriff to lock me up — estimated that up to 30 percent of information technology budgets is consumed by transformation. So for a certain chicken company’s $17 million IT budget, the transformation bill could be in the $5 to $6 million range. That translates to selling a heck of a lot of fried chicken. Let’s assume the wizard is wrong by a factor of two. This means that $2 to $3 million is gnawed by transformation.
As organizations generate and absorb more digital information, what happens to transformation costs? The costs will go up. Whether the Google wizard is right or wrong, transformation is an issue that needs experienced hands minding the store.
The trigger for these two examples is a news item that the former president of Fast Search & Transfer, Ali Riaz, has started a new search company. Its USP (unique selling proposition) is data integration plus search and content processing. You can read Information Week‘s take on this new company here.
In Beyond Search, I discuss a number of companies and their ability to transform and integrate data. If you haven’t experienced the thrill of a transformation job, a data integration project, or a structured data normalization task — you will. Transformation is going to be a hot niche for the next few years.
Understanding of what can be done with existing digital information is, in general, wide and shallow. Transformation demands narrow and deep understanding of a number of esoteric and almost insanely diabolical issues. Let me identify three from own personal experience learned at the street academy called Our Lady of Saint Transformation.
First, each publishing system has its own peculiarities about files produced by different versions of itself. InDesign 1.0 and 2.0 cannot open the most recent version’s files. There’s a work around, but unless you are “into” InDesign, you have to climb this learning curve and fast. I’m not picking on Adobe. The same intra-program compatibilities plague Quark, PageMaker, the moribund Ventura, Framemaker, and some high-end professional publishing systems.
Second, data files spit out by mainframe systems can be fun for a 20-something. There are some interesting data formats still in daily use. EBCDIC or Extended Binary-Coded Decimal Interchange Code is something some readers can learn to love. It is either that or figuring out how to fire up an IBM mainframe, reinstalling the application (good luck on that one, 20 somethings), restoring the data from a DASD or flat file back up tapes (another fun task for a recent computer science grad), and then outputting something the zippy new search or content processing can convert in a meaningful way. (Note: “meaningful way” is important because when a filter gets confused, it produces some interesting metadata. Some glitches can require you to reindex the content if your index restore won’t work.)
Third, the Adobe PDFs with their two layers of security can be especially interesting. If you have one level of password, you can open the file and maybe print it, and copy some content from it. Or, not. If not, you either print the PDFs (if printing has not be disabled) , and go through the OCR-to-ASCII drill. In my opinion, PDFs are like a digital albatross. These birds hang around one’s neck. Your colleagues want to “search” for the PDFs’ content in their behind-the-firewall system. When asked to produce the needed passwords, I often hear something discomforting from the marketing department. So it is no surprise to learn that some system users are not too happy.
You may find this post disheartening.
No!
This post is chock full of really good news. It makes clear that companies in the business of transformation are going to have more customers in 2008 and 2009. It’s good news for off-shore conversion shops. Companies that have potent transformation tools are going to have a growing list of prospects. Young college grads get more chances to learn the mainframe’s idiosyncrasies.
The only negative in this rosy scenario is for the individual who:
- Fails to audit the file types and the amount of content in those file types
- Skips determining which content must be transformed before the new system is activated
- Ignores the budget implications of transformation
- Assumes that 200 or 300 filters will do the job
- Does not understand the implications behind a vendor’s statement along these line: “Our engineers can create a custom filter for you if you don’t have time to do that scripting yourself.”
One final point: those 200 or more file types. Vendors talk about them with gusto. Check to see if the vendor is licensing filters from a third party. In certain situations, the included file type filters don’t support some of the more recent applications’ file formats. Other vendors “roll their own” filters. But filters can vary in efficacy because different people write them at different times with different capabilities. Try as they might, vendors can’t squash some of the filter nits and bugs. When you do some investigating, you may be able to substantiate my data that suggest filters work on about two thirds of the files you feed into the search or content processing system. Your investigation may prove my data incorrect. No problem. When you are processing 250,000 documents, the exception file becomes chunky from the system’s two to three percent rejection rate. A thirty percent rate can be a show stopper.
Stephen E. Arnold, January 25, 2008
Google Version 2.0 Patents Available without Charge
January 24, 2008
Google Version 2.0 (Infonortics, Ltd., Tetbury, Glou., 2007) references more than 60 Google patent applications and patents. I’ve received a number of requests for this collection of US documents. I’m delighted to make them available to anyone at ArnoldIT.com. The patent applications and patents in this collection represent the Google inventions that provide significant insight into the company’s remarkable technology. You can learn about a game designer’s next-generation ad system. You can marvel at the productivity of former Digital Equipment AltaVista.com’s engineers solution to bottlenecks in traditional parallel processing systems. Some of these inventions require considerable effort to digest; for example, Ramanathan Guha’s inventions regarding the Semantic Web. Others are a blend of youthful brilliance and Google’s computational infrastructure; specifically, the transportation routing system Google uses to move employees around the San Francisco area. Enjoy.
Stephen E. Arnold, January 24, 2008
Reducing the “Pain” in Behind-the-Firewall Search
January 23, 2008
I received several interesting emails in the last 48 hours. I would like to share the details with you, but the threat of legal action dissuades me. The emails caused me to think about the “pain” that accompanies some behind-the-firewall search implementations. You probably have experienced some of these pains.
Item: fee-creep pain. What happens is that the vendor sends a bill that is greater than the anticipated amount. Meetings ensue, and in most cases, the licensees pay the bill. Cost over runs, in my experience, occur with depressing frequency. There are always reasonable explanations.
Item: complexity pain. Some systems get more convoluted with what some of my clients have told me is “depressing quickness.” With behind-the-firewall search nudging into middle age, is it necessary for systems to become more complicated, making it very difficult, if not impossible, for the licensee’s technical staff to make changes. One licensee of a well-known search system told me, “If we push in a button here, it breaks something over there. We don’t know how the system’s components interconnect.”
Item: relevancy pain. Here’s the scenario. You are sitting in your office and a user sends an email that says, “We can’t find a document that we know is in the system. We type in the name of the document, and we can’t find it.” In effect, users are baffled about the connection between their query and what the system returns as a relevant result. Now the hard part comes. The licensee’s engineer tries to tweak relevancy or hard wire a certain hit to appear at the top of the results list. Some systems don’t allow fiddling with the relevancy settings. Others offer dozens, three score, knobs and dials. Few or no controls, or too many controls — the happy medium is nowhere to be found.
Item: performance pain. The behind-the-firewall system churns through the training data. It indexes the identified servers. The queries come back in less than one second, blindingly fast for a behind-the-firewall network. Then performance degrades, not all at once. No, the system gets slower over time. How does one fix performance? The solution that our research suggests is the preferred one is more hardware. The budget is left gasping, but performance then degrades.
Item: impelling. Some vendors install a system. Before the licensee knows it, the vendor’s sales professional is touting an upgrade. One rarely discussed issue is that certain vendors upgrades — how shall I phrase it — often introduce issues. The zippy new feature is not worth the time, cost, and hassle of stabilizing a system or getting it back online. Search is becoming a consumer product with “new” and “improved” bandied freely among the vendor, PR professionals, tech journalists, and the licensees. Impelling for some vendors is where the profit is. So upgrades are less about the system and more about generating revenue.
The causes for each of these pressure points are often complicated. Many times the licensees are at fault. The customer is not always right when he or she opines, “Our existing hardware can handle the load.” Or, “We have plenty of bandwidth and storage.” Vendors can shade facts in order to make the sale with the hope of getting lucrative training, consulting, and support work. One vendor hires an engineer at $60,000 per year and bills that person’s time at a 5X multiple counting on 100 percent billability to pump more than $200,000 into the search firm’s pockets after paying salaries, insurance, and overhead. Other vendors are marketing operations, and their executives exercise judgment when it comes to explaining what the system can and can’t do under certain conditions.
What can be done about these pain points? The answer is likely to surprise some readers. You expect a checklist of six things that will convert search lemons into search lemonade. I am going to disappoint you. If you license a search system and install it on your organization’s hardware, you will experience pain, probably sooner rather than later. The reason is that most licensees underestimate the complexity, hardware requirements, and manual grunt work needed to get a behind-the-firewall system to deliver superior relevancy and precision. My advice is to budget so the search vendor does the heavy lifting. Also consider hosted or managed services. Appliances can reduce some of the aches as well. But none of these solutions delivers a trouble-free search solution.
There are organizations with search systems that work and customers who are happy. You can talk to these proud owners at the various conferences featuring case studies. Fast Search & Transfer hosts its own search conference so attendees can learn about successful implementations. You will learn some useful facts at these trade shows. But the best approach is to have search implementation notches on your belt. Installing and maintaining a search system is the best way to learn what works and what doesn’t. With each installation, you get more street smarts, and you know what you want to do to avoid an outright disaster. User groups have fallen from favor in the last decade. Customers can “wander off the reservation”, creating a PR flap in some situations. Due to the high level of dissatisfaction among users of behind-the-firewall search systems, it’s difficult to get detailed, timely information from a search vendor’s customers. Ignorance may keep some people blissfully happy.
Okay, users are unhappy. Vendors make it a bit of work to get the facts about their systems. Your IT team has false confidence in its abilities. You need to fall back on the basics. You know these as well as I do: research, plan, formulate reasonable requirements, budget, run a competitive bid process, manage, verify, assess, modify, etc. The problem is that going through these tasks is difficult and tedious work. In most organizations, people are scheduled to the max, or there’s too few people to do the work due to staff rationalizations. Nevertheless, a reliable behind-the-firewall search implementation takes work, a great deal of work. Shortcuts — on the licensee’s side of the fence or the vendor’s patch of turn — increase the likelihood of a problem.
Another practical approach is to outsource the search function. A number of vendors offer hosted or managed search solutions. You may have to hunt for vendors who offer these services, sometimes called subscription search. Take a look at Blossom Software.
Also, consider one of the up-and-coming vendors. I’ve been impressed with ISYS Search Software and Siderean Software. You may also want to take another look at the Thunderstone – EPI appliance or the Google Search Appliance. I think both of these systems can deliver a reliable search solution. Both vendors’ appliances can be customized and extended.
But even with these pragmatic approaches, you run a good chance of turning your ankle or falling on your face if you stray too far from the basics. Pain may not be avoidable, but you can pick your way through various search obstacles if you proceed in a methodical, prudent way.
Stephen Arnold, January 23, 2008
Search, Content Processing, and the Great Database Controversy
January 22, 2008
“The Typical Programmer” posted the article “Why Programmers Don’t Like Relational Databases,” and ignited a mini-bonfire on September 25, 2007. I missed the essay when it first appeared, but a kind soul forwarded it to me as part of an email criticizing my remarks about Google’s MapReduce.
I agreed with most of the statements in the article, and I enjoyed the comments by readers. When I worked in the Keystone Steel Mill’s machine shop in college, I learned two things: [a] don’t get killed by doing something stupid and [b] use the right tool for every job.
If you have experience with behind-the-firewall search systems and content processing systems, you know that there is no right way to handle data management tasks in these systems. If you poke around the innards of some of the best-selling systems, you will find a wide range of data management and storage techniques. In my new study “Beyond Search,” I don’t focus on this particular issue because most licensees don’t think about data management until their systems run aground.
Let me highlight a handful of systems (without taking sides or mentioning names) and the range of data management techniques employed. I will conclude by making a few observations about one of the many crises that bedevil some behind-the-firewall search solutions available today.
The Repository Approach. IBM acquired iPhrase in 2005. The iPhrase approach to data management was similar to that used by Teratext. The history of Teratext is interesting, and the technology seems to have been folded back into the giant technical services firm SAIC. Both of these systems ingest text, store the source content in a transformed state, and create proprietary indexes that support query processing. When a document is required, the document is pulled from the repository. When I asked both companies about the data management techniques for used in their systems for the first edition of The Enterprise Search Report (2003-2004), I got very little information. What I recall from my research is that both systems used a combination of technologies integrated into a system. The licensee was insulated from the mechanics under the hood. The key point is that two very large systems able to handle large amounts of data relied on data warehousing and proprietary indexes. I heard when IBM bought iPhrase that one reason for IBM’s interest was the iPhrase customers were buying hardware from IBM in prodigious amounts. The fact that Teratext is unknown in most organizations is that it is one of the specialized tools purpose-built to handle CIA- and NSA-grade information chores.
The Proprietary Data Management Approach. One of the “Big Three” of enterprise search has created its own database technology, its own data management solution, and its own data platform. The reason is that this company was among the first to generate a significant amount of metadata from “intelligent” software. In order to reduce latency and cope with the large temporary files iterative processing generated, the company looked for an off-the-shelf solution. Not finding what it needed, the company’s engineers solved the problem and even today “bakes in” its data management, base, and manipulation components. When this system is licensed on an OEM (original equipment manufacturing product), the company’s own “database” lives within the software built by the licensee. Few are aware of this doubling up of technology, but it works reasonably well. When a corporate customer of a content management system wants to upgrade the search system included in the CMS, the upgrade is a complete version of the search system. There is easy way to get around the need to implement a complete, parallel solution.
The Fruit Salad Approach. A number of search and content processing companies deliver a fruit salad of data solutions in a single product. (I want to be vague because some readers will want to know who is delivering systems with these hybrid systems, and I won’t reveal the information in a public forum. Period.) Poke around and you will find open source database components. MySQL is popular, but there are other RDBMS offerings available, and depending on the vendor’s requirements, the best open source solution will be selected. Next, the vendor’s engineers will have designed a proprietary index. In many cases, the structure and details of the index are closely-guarded secrets. The reason is that the speed of query processing is often related to the cleverness of the index design. What I have found is that companies that start at the same time usually same similar approaches. I think this is because when the engineers were in university, the courses taught the received wisdom. The students then went on to their careers and tweaked what was learned in college. Despite the assertions of uniqueness, I find interesting coincidences based on this education factor. Finally, successful behind-the-firewall search and content processing companies license, buy, or are the beneficiaries of a helpful venture capital firm. The company ends up with different chunks of code, and in many cases, it is easier to use whatever is there than trying to figure out and make the solution work with the pears, apricots, and apples in use elsewhere in the company.
The Leap Frog. One company has designed a next-generation data management system. I talk about this technology in my column for one of Information Today’s tabloids, so I won’t repeat myself. This approach says, in effect: Today’s solutions are quite right for the petabyte-scale of some behind-the-firewall indexing tasks. The fix is to create something new, jumping over Dr. Codd, RDBMS, the costs of scaling, etc. When this data management technology becomes commercially available, there will be considerable pressure placed upon IBM, Microsoft, Oracle; open source database and data management solutions; and companies asserting “a unique solution” while putting old wine in new bottles.
Let me hazard several observations:
First, today’s solutions must be matched to the particular search and content processing problem. The technology, while important, is secondary to your getting what you want done within the time and budget parameters you have. Worrying about plumbing when the vendors won’t or can’t tell you what’s under the hood is not going to get your system up and running.
Second, regardless of the database, data management, or data transformation techniques used by a vendor, the key is reliability, stability, and ease of use from the point of view of the technical professional who has to keep the system up and running. You might want to have a homogeneous system, but you will be better off getting one that keeps your users engaged. When the data plumbing is flawed, look first at the resources available to the system. Any of today’s approaches work when properly resourced. Once you have vetted your organization, then turn your attention to the vendor.
Third, the leap frog solution is coming. I don’t know when, but there are researchers at universities in the U.S. and in other countries working on the problems of “databases” in our post-petabyte world. I appreciate the arguments from programmers, database administrators, vendors, and consultants. They are all generally correct. The problem, however, is that none of today’s solutions were designed to handle the types or volumes of information sloshing through networks today.
In closing, as the volume of information increases, today’s solutions — CICS, RDBMS, OODB and other approaches — are not the right tool for tomorrow’s job. As a pragmatist, I use what works for each engagement. I have given up trying to wrangle the “facts” from vendors. I don’t try to take sides in the technical religion wars. I do look forward to the solution to the endemic problems of big data. If you don’t believe me, try and find a specific version of a document. None of the approaches identified above can do this very well. No wonder users of behind-the-firewall search systems are generally annoyed most of the time. Today’s solutions are like the adult returning to college, finding a weird new world, and getting average marks with remarkable consistency.
Stephen Arnold, January 22, 2008
Search Vendors and Source Code
January 21, 2008
A reader of this Web log wrote and asked the question, “Why is software source code (e.g. programs, JCL, Shell scripts, etc.) not included with the “enterprise search” [system]?”
In my own work, I keep the source code because: [a] it’s a miracle (sometimes) that the system really works, and I don’t want youngsters to realize my weaknesses, [b] I don’t want to lose control of my intellectual property such as it is, [c] I am not certain what might happen; for example, a client might intentionally or unintentionally use my work for a purpose with which I am not comfortable, or [d] I might earn more money if I am asked to add customize the system.
No search engine vendor with whom I have worked has provided source code to the licensee unless specific contractual requirements were met. In some U.S. Federal procurements, the vendor may be asked to place a copy of a specific version of the software in escrow. The purpose of placing source code in escrow is to provide an insurance policy and peace of mind. If the vendor goes out of business — so the reasoning goes — then the government agency or consultants acting on the agency’s behalf can keep the system running.
Most of the search systems involved in certain types of government work do place their systems’ source code in escrow. Some commercial agreements with which I have familiarity have requested the source code to be placed in escrow. In my experience, the requirement is discussed thoroughly and considerable attention is given to the language regarding this provision.
I can’t speak for the hundreds of vendors who develop search and content processing systems, but I can speculate that the senior management of these firms have similar reasons to [a], [b], [c], and [d] above.
Based on my conversations with vendors and developers, there may be other factors operating as well. Let me highlight these but remember, your mileage may vary:
First, some vendors don’t develop their own search systems and, therefore, don’t have source code or at least complete source code. For example, when search and content processing companies come into being, the “system” may be a mixture of original code, open source, and licensed components. At start up, the “system” may be positioned in terms of features, not the underlying technology. As a result, no one gives much thought to the source code other than keeping it close to the vest for competitive, legal, or contractual reasons. This is a “repackaging” situation where the marketing paints one picture, and the technical reality is behind the overlay.
Second, some vendors have very complicated deals for their systems technology. One example are vendors who may enjoy a significant market share. Some companies are early adopters of certain technology. In some cases, the expertise may be highly specialized. In the development of commercial products some firms find themselves in interesting licensing arrangements; for example, an entrepreneur may rely on a professor or classmate for some technology. Sometimes, over time, these antecedents are merged with other technology. As a result, these companies do not make their source code available. One result is that some engineers, in the search vendor’s company and at its customer locations, may have to research the solution (which can take time) or perform workarounds to meet their customers’ needs (which can increase the fees for customer service).
Third, some search vendors find themselves with orphaned technology. The search vendor licensed a component from another person or company. That person or company quit business, and the source code disappeared or is mired in complex legal proceedings. As a result, the search vendor doesn’t have the source code itself. Few licensees are willing to foot the bill for Easter egg hunts or resolving legal issues. In my experience, this situation does occur, though not often.
Keep in mind that search and content processing research funded by U.S. government money may be publicly available. The process required to get access to this research work and possibly source code is tricky. Some people don’t realize that the patent for PageRank (US6285999) is held by the Stanford University Board of Trustees, not Google. Federal funding and the Federal “strings” may be partly responsible. My inquiries to Google on this matter have proven ineffectual.
Several companies, including IBM, use Lucene or pieces of Lucene as a search engine. The Lucene engine is available from Apache. You can download code, documentation, and widgets developed by the open source community. One company, Tesuji in Hungary, licenses a version of Lucene plus Lucene support services. So, if you have a Lucene-based search system, you can use the Apache version of the program to understand how the system works.
To summarize, there are many motives for keeping search system source code out of circulation. Whether it’s fear of the competition or a legal consideration, I don’t think search and content processing vendors will change their policies any time soon. I know that when my team has had access to source code for due diligence conducted for a client of mine, I recall my engineers recoiling in horror or laughing in an unflattering manner. The reasons are part programmer snobbishness and part the numerous short cuts that some search system vendors have taken. I chastise my engineers, but I know only too well how time and resource constraints impose constraints that exact harsh penalties. I myself have embraced the policy of “starting with something” instead of “starting from scratch.” That’s why I live in rural Kentucky, burning wood for heat, and eating squirrels for dinner. I am at the opposite end of the intellectual spectrum from the wizards at Google and Microsoft, among other illustrious firms.
Bottom line: some vendors adopt the policy of keeping the source code to themselves. The approach allows the vendors to focus on making the customer happy and has the advantage of keeping the provenance of some technology in the background. You can always ask a vendor to provide source code. Who knows, you may get lucky.
Stephen Arnold, January 21, 2008
Sentiment Analysis: Bubbling Up as the Economy Tanks
January 20, 2008
Sentiment analysis is a sub-discipline of text mining. Text mining, as most of you know, refers to processing unstructured information and text blocks in a database to wheedle useful information from sentences, paragraphs, and entire documents. Text mining looks for entities, linguistic clues, and statistically significant high points.
The processing approach varies from vendor to vendor. Some vendors use statistics; others semantic techniques. More and more, mix and match procedures to get the best of each approach. The idea is that software “reads” or “understands” text. None of the more than 100 vendors offering text mining systems and utilities does as well as a human, but the systems are improving. When properly configured, some systems out perform a human indexer. (Most people think humans are the best indexers, but for some applications, software can do a better job.) Humans are needed to resolve “exceptions” when automated systems stumble. But unlike the human indexer who often memorizes a number of terms and uses these sometimes without seeking a more appropriate term from the controlled vocabulary. Also, human indexers can get tired, and fatigue affects indexing performance. Software indexing is the only way to deal with the large volumes of information in digital form today.
Sentiment analysis “reads” and “understands” text in order to find out if the document is positive or negative. About eight years ago, my team did a sentiment analysis for a major investment fund’s start up. The start up’s engineers were heads down on another technical matter, and the sentiment analysis job came to ArnoldIT.com.
We took some short cuts because time was limited. After looking at various open source tools and the code snippets in ArnoldIT’s repository, we generated a list of words and phrases that were generally positive and generally negative. We had several collections of text, mostly from customer support projects. We used these and applied some ArnoldIT “magic”. We were able to process unstructured information and assign a positive or negative score to documents based on our ArnoldIT “magic” and the dictionary. We assigned a red icon for results that our system identified as negative. Without much originality, we used a green icon to flag positive comments. The investment bank moved on, and I don’t know what the fate of our early sentiment analysis system was. I do recall that it was useful in pinpointing negative emails about products and services.
A number of companies offer sentiment analysis as a text mining function. Vendors include, Autonomy, Corpora Software, and Fast Search & Transfer, among others. A number of companies offer sentiment analysis as a hosted service with the work more sharply focused on marketing and brands. Buzzmetrics (a unit of AC Nielsen), Summize, and Andiamo Systems compete in the consumer segment. ClearForest, before it was subsumed into Reuters (which was then bought by the Thomson Corporation) had tools that performed a range of sentiment functions.
The news that triggered my thinking about sentiment was statistics and business intelligence giant SPSS’s announcement that it had enhanced the sentiment analysis functions of its Clementine content processing system. According to ITWire, Clementine has added “automated modeiing to identify the best analytic models, as well as combining multiple predictions for the most accurate results. You can read more about SPSS’s Clementine technology here. SPSS acquired LexiQuest, an early player in rich content processing, in 2002. SPSS has integrated its own text mining technology with the LexiQuest technology. SAS followed suit but licensed Inxight Software technology and combined that with SAS’s home-grown content processing tools.
There’s growing interest in analyzing call center, customer support, and Web log content for sentiment about people, places, and things. I will be watching for more announcements from other vendors. In the behind-the-firewall search and content processing sectors, there’s a strong tendency to do “me too” announcements. The challenge is to figure out which system does what. Figuring out the differences (often very modest) between and among different vendors’ solutions is a tough job.
Will 2008 be the year for sentiment analysis? We’ll know in a few months if SPSS competitors jump on this band wagon.
Stephen E. Arnold, January 20, 2008.

