Vertica and Cloud-Based Business Intelligence
May 15, 2008
The IDG news service reported on May 12, 2008, that Vertica Systems will offer business intelligence as a service. You can read the complete IDG story here. Please, navigate to it quickly, since some IDG items can become tough to locate a few days after they appear. The computing horsepower will be provided by Amazon. Vertica will use the EC2 (Elastic Compute Cloud) infrastructure introduced by Amazon in August 2006.
Vertica, another column-oriented database shop, sees an opportunity for hosted and software as a service products. Smaller firms often lack the resources to install industrial-strength business intelligence systems on premises.
The pricing for the service begins at $2,000 per month for 500 gigabytes of data. You can read the Amazon Web Services catalog entry here.
In the meantime, Amazon has worked hard to build out its Web services. I’ve heard that the company has embraced Hadoop (a Google File System variant in open source) and Xen (another open source solution). Amazon has experienced some technical hiccups but has recovered quickly.
Amazon’s putting significant effort into its Web services, and Vertica’s use of the EC2 service will be an interesting one to watch. Amazon’s cloud services have beaten Google and other firms to the punch. Although one Google source pointed out to me that Google is able to learn from Amazon’s efforts. The implication is that Google can watch and wait until the market is “right” for Google to make a move. When it comes to infrastructure investments, Amazon’s spending lags behind Google’s. Amazon also has a leaner technical team. If Google enters this sector in a major way, Amazon’s technologists will have an opportunity to demonstrate their superiority to Google’s cloud-centric engineering.
I’m going to watch the Vertica service. If successful, it may spark a strong run up for Amazon. Then Vertica will have to make the math work. A typical Vertica on premises installation costs about $125,000. So, Vertica will have to make up the difference on volume, since the cloud service is likely to generate less revenue per customer. If support and customization costs rise, Vertica may find that getting the math to work could be tricky. Meanwhile, Google watches and learns.
Stephen Arnold, May 14, 2008
Collective Intelligence Anthology Available
May 14, 2008
The Arnoldit.com mascot admires the new collection of essay by Mark Tovey. Collective Intelligence: Creating a Prosperous World at Peace, published by the Earth Intelligence Network in Oakton, Virginia (ISBN: 13: 978-0-97-15661-6-3) contains more than 50 essays by analysts, consultants, and intelligence practitioners. You can obtain a copy from the publisher, Amazon, or your bookseller.
The ArnoldIT mascot completed reading the 600-page book with remarkable alacrity for a duck.
The collection of essays is likely to find many readers among those interested in social phenomena of networks. Many of the essays, including the one I contributed, talk about information retrieval in our increasingly inter connected world.
This essay will provide a synopsis of my contribution, “Search–Panacea or Play. Can Collective Intelligence Improve Findability”, which I wrote shortly before completing Beyond Search: What to Do When Your Search System Doesn’t Work“. My essay begins on page 375.
Social Search
The dominance of Google forces other vendors to look for a way over, under, around, or through its grip on the Web search. The vendor landscape now offers search and content processing systems that arguably do a better job of manipulating XML (Extensible Markup Language) content, figuring out who knows whom (the social graph initiative), and the “real” meaning of content (semantic search). There are more than 100 vendors who have technology that offers, if one believes the marketing collateral and conference presentations, a way to squeeze more information from information.
Social search is the name given to an information retrieval system that incorporates one or more of these functions:
- Users can suggest useful sites. Examples: Delicious.com and StumbleUpon.com
- The system discovers relationships between and among processed documents and links: Powerset.com and Kartoo Visu
- The system analyzes information extracts entities and identifies individuals and their relationships: i2 Ltd (now part of ChoicePoint) and Cluuz.com
- Monitoring of user behavior and using data to guide relevance, spidering and other system functions: public Web indexing companies
There are other types of social functions, but these provide sufficient salt and pepper for this information side dish. The reason I say side dish is that social functions are not going to displace the traditional functions on which they are based. Social search has been in the mainstream from the moment i2 Ltd. introduced its workbench product to the intelligence community more than a decade ago. “Social” functions, then, are a recent add-on to the main diet in information retrieval.
Old Statistics and Cheap, Powerful Computers
What’s overlooked in the rush to find a Google “killer” is that the new companies are using some well-known technologies. For example, the inner workings of Autonomy’s “black box” is somewhat dependent on the work of a slightly unusual Englishman, Thomas Bayes. Mr. Bayes left the world a couple of centuries ago, but his math has been a staple in college statistics courses for many years. To deploy Bayesian techniques on a large scale is, therefore, not exactly a secret to the thousands of mathematicians who followed his proofs in pursuit of their baccalaureate.
Commercial Intelligence: A Better Way to Do Competitive Intelligence
May 13, 2008
Business intelligence and competitive intelligence are “not really intelligence”, asserts Robert D. Steele, well-known advocate of open source information and managing director of OSS.Net. In an exclusive interview with Beyond Search, Mr. Steele–who is one of the strongest advocates for the use of open source information for intelligence–says that commercial business intelligence “systems are edging toward failure. The systems aren’t very good, useful, or usable.”
The fix to the problems of today’s software-based approaches to intelligence is a mixed approach. He says that a better approach:
…consists of requirements definition (understand the question in context, the desired outcome); collection management (know who knows), source discovery and validation (generally done by expert humans who have spent their life mastering the domain, at someone else’s expense); analysis, which can be aided by but does not necessarily require automated support; and compelling timely actionable presentation to the decision-maker.
You can read the full interview on the Interview section of the Beyond Search Web log site here.
Stephen Arnold, May 13, 2008
Former Clandestine Operative Says Automated Systems Not Good Enough
May 13, 2008
Editor’s Note: Robert Steele, former Marine Corp. officer and intelligence operative, was one of the first, if not the first, intelligence professional since World War II to question the relative value of secret sources and technologies in relation to open sources and technologies. Mr. Steele agreed to meet me near his office in suburban Washington, D.C. The full text of the interview appears below. After we spoke, Mr. Steele provided me with illustrations he referenced in our conversation. I have included these in the transcript at the point where Mr. Steele references them. You can read more about Mr. Steele at his Web site, OSS.Net.
How did you get interested in using information that’s readily available to anyone in a library, in newspapers, and online as a source of useful intelligence?
I went into the international spy program at CIA with a Master’s in International Relations, and knew quite a bit about citation analysis and primary research. What I was not expecting over the course of my clandestine career was the obsession with stealing secrets to the exclusion of all that could be known from open sources.
Robert D. Steele
The clandestine officers also refused to interact with the analysts—before leaving for my first overseas assignment, the Chief of Station took me to the analysis side of the house, and on my way there he said something along the lines of “these folks know nothing useful, and we tell them nothing.”
When the Marine Corps asked me to leave CIA to create the Marine Corps Intelligence Center in 1988, I promptly did what I thought the government wanted; that is, I spent $20 million on a codeword analysis center, including a Special Intelligence Communications (SPINTCOM) work station. I thought it would do everything except kill the terrorist.
Was I in for a shock. I had put a PC with Internet access in an isolated room, not connected to any government network. The PC had a modem. I was curious about online and bulletin board systems. In a short time, analysts were leaving their super charged workstations to stand in line to use the PC. These professionals were looking for information that was not in the government system and not known to our officers in the field (including diplomats and commercial or defense attaches).
What a wake up call.
That is when I learned that expensive systems are as good as their sources—narrow casting into the secret world made much of our multi-billion dollar technology virtually worthless. Analysts using the PC showed me that 80 to 90 percent of the information we needed could be obtained using the PC and public information to include direct calls to overt human experts. I also learned that useful information was available in 183 other languages no one in the US Government can speak or understand. Even today, a large number of Washington officials don’t understand the intelligence value of open sources of information including commercial imagery, foreign-language broadcasts that must be accessed locally, and gray literature, such as university yearbooks for a photo of a terrorist. Washington is completely out of touch with human experts that are not US citizens eligible for a secret clearance—the spies don’t want them unless they agree to commit treason, and the analysts are not allowed to talk to them by paranoid ignorant security officials.
Almost every vendor asserts that their systems can “do” business or competitive intelligence. In your experience is this accurate?
Look. BI and CI are not really intelligence.
BI or business intelligence is commonly used as a descriptor for what is nothing more than internal knowledge management, spiced up with a point-and-click graphics dashboard. Not only are most of these system non-interoperable with everything else, they are as smart or as stupid as the digital data they can access.
The reality of information in most organizations is that most of what is really valuable is not digital. And, most CEOs have zero idea what intelligence (decision support) actually means.
CI or competitive intelligence focuses on competitors. What I practice, Commercial Intelligence, focuses on
- External information
- Collaborative work
- Knowledge management
- Organizational intelligence.
Commercial intelligence leverages what can be drawn from the human social networks interacting with an organization and the other sources of information. External information is not information about competitors. It includes such factors as “true cost” of goods and next-generation “cradle to cradle” opportunities. You have to factor in the art and science of retaining Organizational Intelligence. I will send you a diagram that shows my view of this commercial intelligence space.
In my experience, today’s systems are edging toward failure. The systems aren’t very good, useful, or usable. As the Gartner Group recently said about Windows, it is untenable. I like Microsoft for its cash flow—they need to dump the legacy and launch an open source network with shared call centers and Blue Cube power processing.
Powerset Available
May 12, 2008
Navigate to Powerset.com and try out the much-publicized Web search system. Using proprietary technology plus third-party components, Powerset is a semantic search system. The system differentiates itself with fact extraction (Factz, in Powerset jargon), direct links to definitions, and a summary / outline view. A big yellow sticky note says that Powerset is searching Wikipedia articles, but my test queries returned useful information in the results list in default mode; for example, the name of Tropes Zoom, a system I had heard about but never seen. A quick Google search allowed me to pinpoint Semantic Knowledge as a company with a technology of this name. I’m not sure Powerset envisioned my use of its system as a front end for Google, but that use jumped out at me. Check it out and let me know if you think it is better than Google, Hakia, or Exalead. These are systems that contain a dollop of semantic sauce. Hopefully the company will provide a larger content index either by spidering the Web or via a metasearch like Vivisimo’s.
Stephen Arnold, May 12, 2008
Kartoo’s Visu: Semantic Search Plus Themescape Visualization
May 11, 2008
In England in December 2007, I saw a brief demonstration of Kartoo.com’s “thematic map”, which was announced in 2005.
The genesis for the company was developed from the relationships with large publishing groups into 1997. Mr. Baleydier was working to make CD-ROMs easily searchable. Founded in 2001 by Laurent and Nicholas Baleydier to provide a more advanced search interface. You can find out more about the company at Kartoo.net. Kartoo S.A. offers a no-charge metasearch Web system at Kartoo.com.
The original Kartoo service was one of the first to use dynamic graphics for Web search. Over the last few years, the interface became more refined. But the system presented links in the form of dynamic maps. Important Web sites were spherical, and the spheres were connected by lines. Here’s an example of the basic Kartoo interface as it looked on May 11, 2008, for the query “semantic search” run against the default of English Web sites. (The company also offers Ujiko.com, which is worth a quick look. The interface is a bit too abstract for me. You can try it here.)
The dark blue “ink blots” connect related Web sites. The terms provide an indication of the type of relationship between or among Web sites. You can click on this interface and explore the result set and perform other functions. Exploration of the interface is the best way to explore its features. Describing the mouse actions is not as effective as playing with the system.
Another company–Datops SA–was among the first to use interesting graphic representations of results. I recall someone telling me that the spheres that once characterized Groxis.com’s results had been influenced by a French wizard. Whether justified or not, when I saw spheres and ink blots, I said to myself, “Ah, another vendor influenced by French interface design”. In talking with people who use visualizations to help their users understand a “results space”, I’ve had mixed feedback. Some people love impressionistic representations of results; others, don’t. Decades ago I played a small role in the design of the F-15 interface or heads-up display. The one lesson I learned from that work was that under pressure, interfaces that offer too many options can paralyze reaction time. In combat, that means the pilot could be killed trying to figure out what graphics means. In other situations where a computational chemist is trying to make sense of 100,000 possible structures, a fine-grained visualization of the results may be appropriate.
Google: A Brace of Media Analyzer Inventions
May 11, 2008
On May 8, 2008, the USPTO, an outstanding organization with a stellar search system, published two Google patent applications. US2008/0107337 is “Methods and Systems for Analyzing Data in Media Material Having Layout” and US2008/0107338 is “Media Material Analysis of Continuing Article Portions”. You can download these here.
Both inventions, to which Google is the assignee, pertain to figuring out what’s important and what’s not on Web pages. Companies that scan hard copy and convert those images to machine-readable ASCII use some tricks but a great deal of brute force to figure out what’s information and what’s advertising or other dross.
The inventions’ systems and methods can also be applied to other types of images converted to a machine-readable form; for example, a PDF that consists of the PDF wrapper and the TIFF image in the wrapper. I know that commercial database publishers are on top of Google’s innovations in content processing, so this is old news to the wizards at ProQuest, Reed Elsevier, and Thomson Reuters. But others in the less rarified atmosphere may find these disclosures interesting. Two patent documents stumbling through the USPTO’s hallowed halls are not an accident of fate.
Stephen Arnold, May 11, 2008
Google: Content Management for YouTube
May 9, 2008
My hobby is reading Google’s opaque, jargon-filled, and disjointed patent documents. If you are following the $1 billion legal dispute between the GOOG and the media dinosaur Viacom or you upload video to Google, you will want to take a gander at US 20080109369, “Content Management System” by eight Googlers.
The invention is a control panel that shifts certain content tasks to the person posting content to the Google system. There are references to bits of Google technical magic that make the system smarter than the clunky content management systems that most organizations use.
In my opinion, this Google disclosure could shift the burden from Google to the person or software function posting content. You can download the document from the wonder system provided without charge by the US Patent & Trademark Office. I’m interested in your views of US 10080109369. The Verizon attorneys have undoubtedly gone over this invention with the legal acumen embodied in their sleek selves. I just read this stuff as I find it. This one’s worth a quick look if you are curious about one of Google’s systems for handling the more than one million video uploads pumped into the company every three or four weeks.
Keep in mind that the system and method in this patent document can be extended to other types of content. This invention could–note the could, please–make Google into a great big database publisher. Now Google is just inventing, not doing, what the system and method asserts. Patent applications aren’t products and services.
Stephen Arnold, May 9, 2008
Cluuz.com: Military Intelligence-Like Functions for Web Metasearch
May 8, 2008
One of my business associates in Canada sent me a link to an interesting search engine named Cluuz.com. The system–unlike the shy Powerset, a media darling developing a semantic search engine–is available for anyone to use. Navigate to Cluuz.com. Make sure you add the extra “u”, or you will be looking at a plain text page from the graphically restrained Clue Computing operation in cow country.
Cluuz.com takes results and applies semantic processes to them. Some of the company’s display options are a bit too sophisticated for my 64-year-young eyes, but I found the system quite useful. Let’s run through a basic search and take a cursory look at some of the features that I found interesting. Then I want to comment on the semantic search boom or boomlet (depending on how jaded you are), and conclude with several observations. In the last few days, the shrinking violets in the Big Name search vendors’ public relations department have reduced their flow of 30-something insights. Perhaps my comments about semantic search will “goose” them into squawking. I certainly hope so. Life’s no fun in rural Kentucky without well-groomed Ivy League wizards asserting their intellectual superiority in email speak.
A Query for Cluuz.com
Navigate to the Cluuz.com splash screen. Make certain that you have checked the option under the search box for “Charts”. We’ll look at the other options in a moment. Now enter the test query as shown in italics: Google +”programmable search engine”. Here’s my result for this query on May 7, 2008:
The system processes results from MSN (search.live.com) and Yahoo, processes them, and displays this map. Note that the system identifies important people and companies. The system correctly identifies the Google Forms service as related to the “programmable search engine”.
The system offers other ways to view the results set. For example, you can look at hits from the search engines to which the query is passed as a traditional laundry list. Other choices include a cluster display and a Flash display which is, in my opinion, cluttered with sliders, controls, and options.
You can also enter a more complex query using the Cluuz.com advanced search page. In my tests, the system did a good job of dealing with specific Boolean queries. You can also set preferences, which may not be necessary for a metasearch-based approach to generating hits.
US Government Uses AdWords
May 6, 2008
By the time you read this, the estimable Financial Times will have renamed the file, moved it to a digital dungeon, and besiege you with advertisements. The headline that stopped me in my web-footed tracks is, “US Advertises on Google to Snare Surfers”. Click here for what I hope is the original FT link.
The idea is that traffic to a US government site–America.gov–needs to be goosed (no pun intended, dear logo). Do you think the government might use content? Do you think the US government might use backlinks from high-traffic Web sites? Do you think the government might use nifty Web 2.0 features? Keep in mind that this site’s tag line is, “Telling America’s story”.
The answer is, “No.” The US government bids for such zippy terms as terrorism. The person who clicks on an advertisement and gains an insight into the American government’s psyche.
The FT story said:
In recent months the US administration has quietly been running the advertisements for its America.gov site, which is intended to give foreign audiences the Washington take on US foreign policy, culture and society.
I am not doing any government work at this time. I hope someday to meet the consultant who came up with this idea. I will try to get this wizard to take me to lunch. I have a hunch this consultant made some money on this project.
Stephen Arnold, May 6, 2008