Protected: Amazon Converts to SharePoint 2010
September 27, 2011
Protected: Unraveling SharePoint 2010 Architecture
September 26, 2011
Traditional Entity Extraction’s Six Weaknesses
September 26, 2011
Editor’s Note: This is an article written by Tim Estes, founder of Digital Reasoning, one of the world’s leading providers of technology for entity based analytics. You can learn more about Digital Reasoning at www.digitalreasoning.com.
Most university programming courses ignore entity extraction. Some professors talk about the challenges of identifying people, places, things, events, Social Security Numbers and leave the rest to the students. Other professors may have an assignment related to parsing text and detecting anomalies or bound phrases. But most of those emerging with a degree in computer science consign the challenge of entity extraction to the Miscellaneous file.
Entity extraction means processing text to identify, tag, and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word like Cher or a bound sequence of words like White House. The challenge of figuring out names is tough one for several reasons. Many names exist in richly varied forms. You can find interesting naming conventions in street addresses in Madrid, Spain, and for the owner of a falafel shop in Tripoli.
Entities, as information retrieval experts have learned since the first DARPA conference on the subject in 1987, are quite important to certain types of content analysis. Digital Reasoning has been working for more than 11 years on entity extraction and related content processing problems. Entity oriented analytics have become a very important issue these days as companies deal with too much data, the need to understand the meaning and not the just the statistics of the data and finally to understand entities in context – critical to understanding code terms, etc.
I want to highlight the six weaknesses of traditional entity extraction and highlight Digital Reasoning’s patented, fully automated method. Let’s look at the weaknesses.
1 Prior Knowledge
Traditional entity extraction systems assume that the system will “know” about the entities. This information has been obtained via training or specialized knowledge bases. The idea is that a system processes content similar to that which the system will process when fully operational. When the system is able to locate or a human “helps” the system locate an entity, the software will “remember” the entity. In effect, entity extraction assumes that the system either has a list of entities to identify and tag or a human will interact with various parsing methods to “teach” the system about the entities. The obvious problem is that when a new entity becomes available and is mentioned one time, the system may not identify the entity.
2 Human Inputs
I have already mentioned the need for a human to interact with the system. The approach is widely used, even in the sophisticated systems associated with firms such as Hewlett Packard Autonomy and Microsoft Fast Search. The problem with relying on humans is a time and cost equation. As the volume of data to be processed goes up, more human time is needed to make sure the system is identifying and tagging correctly. In our era of data doubling every four months, the cost of coping with massive data flows makes human intermediated entity identification impractical.
3 Slow Throughput
Most content processing systems talk about high performance, scalability, and massively parallel computing. The reality is that most of the subsystems required to manipulate content for the purpose of identifying, tagging, and performing other operations on entities are bottlenecks. What is the solution? Most vendors of entity extraction solutions push the problem back to the client. Most information technology managers solve performance problems by adding hardware to either an on premises or cloud-based solution. The problem is that adding hardware is at best a temporary fix. In the present era of big data, content volume will increase. The appetite for adding hardware lessens in a business climate characterized by financial constraints. Not surprisingly entity extraction systems are often “turned off” because the client cannot afford the infrastructure required to deal with the volume of data to be processed. A great system that is too expensive introduces some flaws in the analytic process.
Software Giants Race to Solve the Big Data Puzzle
September 23, 2011
Other software companies have also put their hats in the ring to come up with a potential solution to this issue. The Computing.Co.UK article, Essential Guide to A Big Data: Part One states:
Microsoft, Oracle, SAP and Endeca are looking to sell enhanced database, analytics and business intelligence tools based on the big data concept, though the very definition of the term tends to be manipulated to play to individual product strengths in each case, meaning big data remains a moving target in many respects.
Sponsored by Pandia.com
Protected: Microsoft Debugger Download Available: DebugDiag 1.2
September 23, 2011
Protected: Make Your SharePoint Look Like Twitter
September 20, 2011
Protected: SharePoint People Picker
September 19, 2011
Sinequa Dials In Siemens
September 16, 2011
With so much information available to companies, it is no surprise that they are dealing with search overloads resulting in too much, poorly organized, information. As a result a new industry has emerged offering their clients solutions to better manage information. The article, Siemens Uses Sinequa Business Search to Find Synergies in Technology Projects of its Divisions Around the World, on Decidio, explains how one company has utilized these new search optimization skills.
Siemens Corporate Technology announced they are using Sinequa Business Search to help optimize searching for technology projects. The article reports (translated):
Stefan Augustin, Principal Consultant and Project Manager at Global Information / Knowledge Management at Siemens Corporate Technology, lists other positive experiences of the first projects: a research platform that is flexible and highly reliable with scalable performance; connectors between many sources of data, ready for use, the a high level of security that protects the confidentiality of documents perfectly.
Sinequa explains the problem businesses face with information overload in terms everyone understands – money. According to Sinequa’s website, employees spend an average of five hours a week sifting through data looking for usable information. That translates into $5,000 to $20,000 a year in wasted money, per employee, for companies. By offering customized, industry-specific searches, Sinequa cuts that wasted time dramatically saving money and boosting employee performance.
As more and more information is becoming available, from a variety of sources, companies of all sizes from around the world are feeling the inadequacies of current commercial search engines. The customized industry-specific search optimization market is exploding with need and desire. As more and more tech companies announce successes, it will be no surprise to see the face of data search change to meet the growing demands.
Catherine Lamsfuss, September 16, 2011
Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search
Smartlogic Buys SchemaLogic: Consoliation Underway
September 15, 2011
Mergers have captured the attention of the media and for good reason. Deals which fuse two companies create new opportunities and can disrupt certain market sectors. For example, Hewlett Packard’s purchase of Autonomy has bulldozed the search landscape. Now Smartlogic has acquired SchemaLogic and is poised to have the same effect on the world of taxonomies, controlled vocabularies, and the hot business sector described as “tagging” or “metadata.”
As you know, Smartlogic has emerged as one of the leaders in content tagging, metadata, indexing, ontologies, and associated services. The company’s tag line is that its systems and methods deliver content intelligence solutions. Smartlogic supports the Google search technology, open source search solutions such as Solr, and Microsoft SharePoint and Microsoft Fast Search. Smartlogic’s customers include UBS, Yell.com, Autodesk, the McClatchy Company, and many others.
With the acquisition of SchemaLogic, Smartlogic tries to become one of the leading if not the leading company in the white hot semantic content processing market. The addition of SchemaServer to the platform adds incremental functionality and extends solutions for customers. The merger adds more clients to Smartlogic’s current list of Fortune 1000 and global enterprise customers and confirms the company as the leading provider of Content Intelligent Software. Jeremy Bentley told Beyond Search:
Smartlogic has a reputation for providing innovative Content Intelligence solutions alongside an impeccable delivery record. We look forward to providing Grade A support to our new clients, and to broadening the appeal of Semaphore.
SchemaLogic was founded in 2003 by Breanna Anderson (CTO) and Andrei Ovchinnikov (a Russian martial arts expert with a love of taxonomy and advisory board member) and Trevor Traina (chairman and entrepreneur; he sold Compare.Net comparison shopping company to Microsoft in 1999). SchemaLogic launched its first product in November 2003. The company’s flagship product is SchemaServer. The executive lineup has changed since the company’s founding, but the focus on indexing and management of controlled term lists has remained.
A company can use the SchemaLogic products to undertake master metadata management for content destined for a search and retrieval system or a text analytics / business intelligence system. However, unlike fully automated tagging systems, SchemaLogic products can make use of available controlled term lists, knowledge bases, and dictionaries. The system includes an administrative interface and index management tools which permit the licensee to edit or link certain concepts. The idea is that SchemaServer (and MetaPoint which is the SharePoint variant) provides a centralized repository which other enterprise applications can use as a source of key words and phrases. When properly resourced and configured, the SchemaLogic approach eliminates the Balkanization and inconsistency of indexing which is a characteristic of many organization’s content processing systems.
Early in the company’s history, SchemaLogic focused on SharePoint. The firm added support for Linux and Unix. Today, when I think of SchemaLogic, I associate the company with Microsoft SharePoint. The MetaPoint system works when one wants to improve the quality of Sharepoint metadata. But the system can be used for eDiscovery and applications where compliance guidelines require consistent application of terminology? Time will tell, particularly as the market for taxonomy systems continues to soften.
Three observations are warranted:
First, not since Business Objects’ acquisition of Inxight has a content processing deal had the potential to disrupt an essential and increasingly important market sector.
Second, with the combined client list and the complementary approach to semantic technology, Smartlogic is poised to move forward rapidly with value added content processing services. Work flow is one area where I expect to see significant market interest.
Third, smaller firms will now find that size does matter, particularly when offering products and services to Fortune 1000 firms.
Our view is that there will be further content centric mergers and investments in the run up to 2012. Attrition is becoming a feature of the search and content processing sector.
Stephen E Arnold, September 15, 2011
Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search
Powwownow Dons Autonomy IDOL
September 15, 2011
A leader in the teleconferencing industry, Powwownow, announced they are using Autonomy’s TeamSite as a content management tool. The article, Powwownow Selects Autonomy’s Content Management Tool, on Content Software Management (CSM) explains how this move will prove beneficial to the young company.
A UK based company, only seven years old, Powwownow took conference calling to a whole new level when they erupted onto the market. Their premise is simple: provide people a teleconferencing service with no fees, no scheduling hassles, and local phone rates. Genius!
As their growth has expanded exponentially since their humble beginnings, their tech needs have grown. The search began for Web Content Management as the company’s needs demanded changes. The final decision to go with Autonomy’s TeamSite resulted because of these factors:
By removing the need for manual processes and ensuring a personalized online experience for its customers, TeamSite will enable Powwownow to increase its conversions and sales revenues, said Autonomy. ‘Autonomy’s technology was the only one that could address the main goals of our WCM initiative: drive customer conversion rates, empower our marketing team and reduce involvement of IT. As a 24/7 service, we could not afford any downtime, and Autonomy’s implementation was a success in this regard,’ (Powwownow IT Director) Maguire said.
Beyond Search likes Autonomy’s simple mission: make computers sort through the millions of data streaming into a company every day so that human’s can be left to perform higher level thinking tasks. It’s no surprise that they are doing quite well in their efforts to provide Information Risk Management, archiving, rich media management, Information Risk Management and many other much needed services.
Catherine Lamsfuss, September 15, 2011
Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search