Open Search Server
June 14, 2011
TechWorld’s “Open Search Server Releases New Developer Preview” offers details on a preview of a new Apache Lucene-based search system.
Written in Java, the article reports that:
Open Search Server can crawl file systems, databases and websites” and supports a wide variety of document formats. The preview includes “a new screenshot feature that captures screenshots of the Web pages being crawled, similar to the preview feature of a big name public search engine.
The open software offers companies independence in developing their information management strategies. In the “Cloud” era, these strategies – how users will be able to search and retrieve documents – will become strategic.
If you are tracking open source search vendors, add this one to your list.
Stephen E Arnold, June 14, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
ProQuest: A Typo or Marketing?
June 10, 2011
I was poking around with the bound phrase “deep indexing.” I had a briefing from a start up called Correlation Concepts. The conversation focused on the firm’s method of figuring out relationships among concepts within text documents. If you want to know more about Correlation Concepts, you can get more information from the firm’s Web site at http://goo.gl/gnBz6.
I mentioned to Correlation Concepts Dr. Zbigniew Michalewicz’s work in mereology and genetic algorithms and also referenced the deep extraction methods developed by Dr. David Bean at Attensity. I also commented on some of the methods disclosed in Google’s open source content. But Google has become less interesting to me as new approaches have become known to me. Deep extraction requires focus, and I find it difficult to reconcile focus with the paint gun approach Google is now taking in disciplines far removed from my narrow area of interest.
A typo is a typo. An intentional mistake may be a joke or maybe disinformation. Source: http://thiiran-muru-arul.blogspot.com/2010/11/dealing-with-mistakes.html
After the interesting demo given to me by Correlation Concepts, I did some patent surfing. I use a number of tools to find, crunch, and figure out which crazily worded filing relates to other, equally crazily worded documents. I don’t think the patent system is much more than an exotic work of fiction and fancy similar to Spenser’s The Faerie Queene.
Deep indexing is important. Key word indexing does not capture in some cases the “aboutness” of a document. As metadata becomes more important, indexing outfits have to cut costs. Human indexers are like tall grass in an upscale subdivision. Someone is going to trim that surplus. In indexing, humans get pushed out for fancy automated systems. Initially more expensive than humans, the automated systems don’t require retirement, health care, or much management. The problem is that humans still index certain content better than automated systems. Toss out high quality indexing and insert algorithmic methods, and you get search results which can vary from indexing update to indexing update.
Stormy Weather for the Eucalyptus Grove?
June 10, 2011
Still feel safe in the cloud? Have you heard from Eucalyptus lately?
According to “Critical Vulnerability in Open Source Eucalyptus Clouds”, there has been another break-in. At least a theoretical one; university researchers have found a hole in the cloud. Per the article:
“An attacker can, with access to the network traffic, intercept Eucalyptus SOAP commands and either modify them or issue their own arbitrary commands. To achieve this, the attacker needs only to copy the signature from one of the XML packets sent by Eucalyptus to the user. As Eucalyptus did not properly validate SOAP requests, the attacker could use the copy in their own commands sent to the SOAP interface and have them executed as the authenticated user.”
The platform has already provided a newer, downloadable version that corrects the issue. Eucalyptus has warned their services may be a little spotty while the rest of the system recognizes the fix.
Go ahead and tally another tick mark against the cloud. What’s worse, besides the discovered threat, users must contend with the hassle of outages related to the fix. I could be wrong, but it seems it is only a matter of time before some serious consequences arise from lax attitudes concerning data storage.
How about putting enterprise data in the cloud with a search interface? Or maybe a bank of social security numbers? Now what about a security lapse?
Sarah Rogers, June 10, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Protected: Straighten Up Your SharePoint Web Parts Tables
June 9, 2011
Landscape of Search Order Form Live
June 8, 2011
Pandia.com, the publisher of “The New Landscape of Enterprise Search”, has posted an information page and a link to an order form. This new study takes a frank, objective look at the market for enterprise search systems and six leading vendors. Unlike the “pay to play” studies and conferences, the 150 page report provides the detail procurement teams and business professionals need to decide which system best suits a particular findability problem.
The report answers a number of questions which are routinely overlooked, ignored, or unknown to some of the organizations writing “pay to play” reports about vendors; for example:
- What was the status of the rewrite of Fast ESP prior to the purchase of the company by Microsoft in 2008?
- What technical methods cause certain scaling challenges in some Endeca and Vivisimo implementations?
- How do the platforms of Autonomy and Exalead compare in multi content deployments for enterprise applications?
- Why are most procurements won by a small number of vendors despite dozens, if not hundreds of lower cost options?
- What are the cost implications of Google’s GSA pricing method for the GB 7007 and GB 9009?
- What’s the outlook for search innovation in the next nine to 12 months?
This report goes beyond Stephen E Arnold’s 2008 report on content processing for the Gilbane Group, the Successful Enterprise Search Management monograph for Galatea in 2009, and his three studies of Google’s now-ageing search technology in the Google trilogy, published by Infonortics. Significant additional investigation via interviews and hands on involvement with search technology propel this report well beyond his first three editions of the Enterprise Search Report, 2004 to 2007.
If you are involved in enterprise search, you will want to get a copy of this report which discusses search solutions available from Autonomy, Endeca, Exalead, Google, Microsoft (Fast Search), and Vivisimo. The report includes a table providing brief facts about two dozen other systems, including open source options.
What sets the report apart is that the information in the new report does not duplicate the information which is available without charge in the Search Wizards Speak collection of more than 50 interviews with experts in search and retrieval or Mr. Arnold’s blogs about search and content processing: Beyond Search and Inteltrax.com
You can access the Pandia.com description of the report and the order form at http://www.pandia.com/enterprise-search/. The report costs $20 and is available as a PDF file.
Don Anderson, June 8, 2011
The post was sponsored by Stephen E Arnold
SAS Simplifies Text Analysis
June 8, 2011
Let’s face it, time is money. (Some former SEO Panda victims, assorted art history majors, and a few MBAs perceive time as opportunity to contemplate the magnitude of their student loans and monthly cash flows.)
Wading through archives to find the answers to your questions is labor intensive and more work than watching reruns on TV.
We found “New SAS Industry Taxonomy Rules Starter Kits Enhance the Speed to Value of Text Analytics” promises to be the answer to some of these problems. It can cut search time from months to weeks by creating a structured taxonomy. The story asserted:
Building taxonomies from scratch can be daunting. But with the new SAS Industry Taxonomy Rules starter kits, organizations get a jump start, and can move more quickly from document and text chaos to value and insight from their unstructured data.
However, the use of taxonomy isn’t anything new. Many businesses recognize the value of categorizing their archives in order to save time and in the end money.
The same thing goes for digital archives of electronic documents, SAS helps to organize the electronic and save valuable time and money by using the most effective integrated capabilities on the market today both in its ability to combine structured and unstructured data. It also utilizes predictive analytics to remember what documents or areas are searched most as well as allowing customers to customize their systems to fit individual needs.
Sounds like a win win.
Leslie Radcliff, June 8, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Smoothing SharePoint Upgrades
June 7, 2011
After a whirl of conferences, I was catching up on my reading. I was interested in J. Peter Bruzzese’s article “Don’t Upgrade to SharePoint 2010 Until You Read This” suggests, this is not an update for the faint of heart. Our experience at Search Technologies was that SharePoint upgrades have been reasonably straight forward.
His warning suggests:
You may like to be hands-on with your own environment, installing all your own servers and such, but the upgrade to SharePoint 2010 should either be treated with the utmost care or turned over to an expert who’s done it a bunch of times and has it down to a science.
He continues by saying it took him “a week to research and test in-place upgrade process and the database-attach migrate process before throwing down the ‘hire somebody else’ gauntlet.”
So there it is.
His caution comes complete with neon blinking lights. His article cited some well known experts; for example, Spencer Harbar, Microsoft Enterprise Architect and Don Holmes, Intellium consultant and trainer. The article suggest that any “headaches” that you encounter “depends more on your current environment than on SP2010 itself.”
We agree.
They claim that this upgrade “is far less of an issue than upgrading from SPs2003 to Moss2007.”
We have some suggestions. First, check with specialists. Please, consider Search Technologies as a potential resource. Second, work through Microsoft’s documentation paying particular attention to customization notes. Microsoft’s installers are thoroughly tested, but it is impossible for any vendor to upgrade every possible configuration of SharePoint. Third, make certain you have a back up, installation discs and their keys, and any other information that Microsoft provides licensees, certified engineers, or certified SharePoint developers. Often a hiccup can be addressed easily when these essentials are at hand.
For more information, contact us via our Web site at www.searchtechnologies.com.
Iain Fletcher, June 7, 2011
Search Technologies
Update on Thetus Savanna
June 6, 2011
A Sys-Con Media article “Thetus’ Savanna Analytical Tool” provides an overview of the Thetus Savanna Analytical Tool by two authors and includes a video evaluation. We found the information interesting, but parts did look as if the Thetus marketing gene was dominant.,
The Savanna Analytical tool is designed to provide search, discovery and visualization tools for analysts. The article said:
Savanna uses tools such as Kapow to scrape websites and all-source data and then pushes them through MetaCarta (for geo-spatial analysis) and Janya (for real-language textual analysis). This data is then sorted into a Savanna’s application – enabling real-time search.
After the documents go through Kapow, MetaCarta and Janya, Savanna re-renders the documents and turns the masses of text into real pages making the search and discovery of the pages much easier.
The write up added:
Savanna’s search function crawls through the document repository added, and uses socio-economic indicators to categorize. It allows analysts to take a large number of search returns and narrow them down quickly and accurately.
If only all decisions could be so simple. Real data in real life can give even sophisticated systems indigestion.
Stephen E Arnold, June 1, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Microsoft Search Blog Not Updated in Months
June 2, 2011
If you have not visited the Microsoft Search Blog, you may want to check it out. We think it is a good example of the commitment Microsoft has to enterprise search. Oh, Microsoft still sells Fast Search, consulting, certifications, and add ons. However, the blog is not exactly a pivot point.
It’s about relevance, it’s about speed and it’s all about competition…ya snooze, ya lose, right?
We’re a little confused then, by the search results we got from Google recently when queried “enterprise search.”
Our queries for content and visits to the site over a week or so revealed that the last update seems to have been about ten months ago.
My hunch is that somewhere, in some small, cubby in Redmond, there’s a person who’s supposed to be searching and updating the enterprise blog.
We try to monitor the SharePoint search world, and we are finding that the information about SharePoint search is mostly about getting a SharePoint system under control, back on track, and delivering specific functionality. You can track our SharePoint coverage at www.sharepointsemantics.com. We also cover SharePoint in Beyond Search. Just search for the category SharePoint in the search box on the blog’s splash page.
The goslings and I will try to “mind the gap”.
Stephen E Arnold, June 2, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Protected: Indexing SharePoint Content through Northern Light
June 2, 2011