Taxonomy: Search’s Hula-Hoop®
February 8, 2008
I received several thoughtful comments on my Beyond Search Web log from well-known search and content processing experts (not the search engine optimization type or the MBA analyst species). These comments addressed the topic of taxonomies. One senior manager at a leading search and content processing firm referenced David Weinberger’s quite good book, Everything is Miscellaneous. My copy has gone missing, so join me in ordering a new one from Amazon. Taxonomy and taxonomies have attained fad status in behind-the-firewall search and content processing. Every vendor has to support taxonomies. Every licensee wants to “have” a taxonomy.
This is a screen shot of the Oracle Pressroom. Notice that a “taxonomy” is used to present information by category. The center panel presents hot links by topics with the number of documents shown for each category. The outside column features a tag cloud.
A “taxonomy” is a classification of things. Let me narrow my focus to behind-the firewall content processing. In an organization, a taxonomy provides a conceptual framework that can be used to the organization’s information. Synonyms for taxonomy include classification, categorization, ontology, typing, and grouping. Each of these terms can be used with broader or narrower meanings, but for my purpose, we will assume each can be used interchangeably. Most vendors and consultants toss these terms around as interchangeable Lego blocks in my experience.
A fad, as you know, is an interest that is followed for some period of time with intense enthusiasm. Think Elvis, bell bottoms, and speaking Starbuck’s coffee language.
A Small Acorn
A few years ago, a consultant approached me to write about indexing content inside an organization. This individual had embarked on a consulting career and needed information for her Web site. I dipped into my files, collected some useful information about the challenges corporate jargon presented, and added some definitions of search-related terms.
I did work for hire, so my client could reuse the information to suit specific needs. Imagine my pleasant surprise when I found my information recycled multiple times and used to justify a custom taxonomy for an enterprise. I was pleased to have become a catalyst for a boom in taxonomy seminars, newsgroups, and consulting businesses. One remarkable irony was that a person who had recycled the information I sold to consultant A thousands of miles away turned up as consultant B at a company in which I was an investor. I sat in a meeting and heard my own information delivered back to me as a way to orient me about classifying an organization’s information.
Big Oak
A taxonomy revolution had taken place, and I was only partially aware. A new industry had taken root, flowered, and spread like kudzu around me.
The interest in taxonomies continues to grow. After completing the descriptions of companies offering what I call rich content processing, organizations looking for taxonomy-centric systems have many choices. Of the 24 companies profiled in the Beyond Search study, all 24 “do” taxonomies. Obviously there are greater and lesser degrees of stringency. One company has a system that supports American National Standards Institute guidelines for controlled terms and taxonomies. Other companies “discover” categories on the fly. Between these two extremes there are numerous variations. One conclusion I drew after this exhausting analysis is that it is difficult to locate a system that can’t “do” taxonomies.
What’s Behind the Fad?
Let me consider briefly a question that I don’t tackle in Beyond Search: “Why the white-hot interest in taxonomies?”
Taxonomies have a long and distinguished history in library science, philosophy, and epistemology. For those of you who are a bit rusty, “epistemology” is the theory of knowledge. Taxonomies require a grasp, no matter how weak, on knowledge. No matter how clever, a person creating a taxonomy must figure out how to organize email, proposals, legal documents, and the other effluvia of organizational existence.
I think people have enough experience with key word search to realize its strengths and limitations. Key words — either controlled terms or free text — work wonderfully when I know what’s in an electronic collection, and I know the jargon or “secret words” to use to get the information I need.
Boolean logic (implicit or explicit) is not too useful when one is trying to find information in a typical corpus today. There’s no editorial policy at work. Anything the indexing subsystem is fed is tossed into an inverted index. This is the “miscellaneous” in David Weinberger’s book.
A taxonomy becomes a way to index content so the user can look at a series of headings and subheadings. A series of headings and sub-headings makes it possible to see the forest, not the trees. Clever systems can take the category tags and marry them to a graphical interface. With hyperlinks, it is possible to follow one’s nose — what some vendors call exploratory search or search by serendipity.
Taxonomy Benefits
A taxonomy, when properly implemented, offers yields payoffs:
First, users like to point-and-click to discover information without having to craft a query. Believe me, most busy people in an organization don’t like trying to outfox the search box.
Second, the categories — even when hidden behind a naked search box interface — are intuitively obvious to a user. An accountant may (as I have seen) enter the term finance and then point-and-click through results. When I ask users if they know specific taxonomy terms, I hear, “What’s a taxonomy?” Intuitive search techniques should be a part of behind-the-firewall search and content processing systems.
Third, management is willing to invest in fine-tuning a taxonomy. Unlike a controlled vocabulary, a suggestion to add categories meets with surprisingly little resistance. I think the intuitive usefulness of cataloging and categorizing is obvious to people who tell people to search for them.
Some Pitfalls
There are some pitfalls in the taxonomy game: The standard warnings are “Don’t expect miracles when you categorize modest volumes of content.” And “Be prepared for some meetings that are more like a graduate class in logic than trying to figure out how to deliver what the marketing department needs in a search system. ” Etc.
On the whole, the investment in a system that automatically indexes is a wise one. It becomes ever wiser when the system can use a knowledge bases, word lists, taxonomies, and other information inputs to index more accurately.
Keep in mind that “smart” systems can be right most of the time and then without warning run into a ditch. At some point, you will have to hunker down and do the hard thinking that a useful taxonomy requires. If you are not sure how to proceed, try to get your hands on a the taxonomies that once were available from Convera. Oracle one once? offered vertical term lists. You can also Google for taxonomies. A little work will return some useful examples.
To wrap up, I am delighted that so many individuals and organizations have an interest in taxonomies — whether a fad or something more epistemologically more satisfying. The content processing industry is maturing. If you want to see a taxonomy in action, check out:
HMV, powered by Dieselpoint
Oracle’s Pressrom, powered by Siderean Software’s system
US government portal powered by Vivisimo (Microsoft)
Stephen Arnold, February 8, 2008
Simple Math = Big Challenge: MSFT & YHOO
February 4, 2008
I have only a few sections of Beyond Search to wrap up. Instead of being able to think about my updating my description of Access Innovations’ MAIstro, I am distracted by jibber jabber about the Microsoft (NSDQ:MSFT) Yahoo (NSDQ:YHOO) tie up.
Where We Are
First, it’s an offer, isn’t it? Maybe a trial balloon? No cash and stock have changed hands as I write this in the wee hours of Monday, February 4, 2008. Yet, many are in a frenzy over a hostile take over. Think about this word “hostile.” It means antagonistic, unfriendly, enemy. The reason for the bold move? Google, a company that has out foxed Microserfs and Yahooligans for almost a decade.
The number of articles in my various alerts, RSS feeds, and emails is remarkable. Worldwide a Microsoft – Yahoo marriage (even it is helped along with a shotgun) ignites folks’ imagination. Neither Microsoft nor Yahoo will be able to recruit tech wizards, one pundit asserts. Innovation in Silicon Valley will be forever changed, posits another. Sigh.
Sorry. I’m not that excited. I’m interested, but I’m too old, too pragmatic, and too familiar with the vagaries of acquisitions to jump up and down.
Judging from some grousing from Yahooligans, some Yahoo professionals aren’t too keen about working for Microsoft. I have had a hint that some Microsoft wizards aren’t too excited about fiddling with Yahoo’s mind-numbing array of products, services, technologies, search systems, partnerships, and research initiatives.
I think the root concern is trying to figure out how to fit two large operations together, a 1 + 1 = 3 problem. For example, there’s Yahoo Mail and Hotmail Live; Yahoo Panama and Microsoft Ad Center; and Yahoo News and Microsoft’s new services, etc., etc. One little-considered consequence is that Microsoft may end up owning more search systems than any other company. That’s a technology can of worms worthy of a separate essay.
I will tell you who is excited, and, please, keep in mind that this is my opinion. And, once I express my view, I want to offer another very simple (probably too simple for an MBA wizard) math problem. I will end this essay with my now familiar observations. Let’s begin.
Who Benefits?
This is an easy question to answer, and you will probably think that I am stating the obvious. Bear with me because the answer explains why some at Microsoft may not be able to get the right prescription for their deal bifocals. Without the right eye glasses, it’s tough to discern some smaller environmental factors obscured in the billion dollar fusillade fired at Yahoo’s board of directors’ meeting.
- Shareholders who can make some money with the Microsoft offer. When there’s money to be made, concerns about technology, culture, and market opportunity are going to finish last. Most shareholders don’t think too much other than the answer to two questions: “How much did I make?” and “What are the tax implications?”
- Investment bankers who earn money three ways on a deal of this magnitude. There are, of course, other ways for those in the financial loop to make money, but I’m going to focus on the ones that keep these professionals in blue suits, not orange jump suits. [a] Commissions. Where the is churn, there is a commission. For many investment advisors, buying and selling equals a bigger payday. [b] Bonuses. The mechanics of an investment banker’s bonus are complex. After all, it is a banker dealing with a fellow banker. Mere mortals should steer clear. The idea is simple. Generate churn or a fee, and you get more bonus money. The first three months of a calendar year is bonus and job hopping time on Wall Street. Anyone who can get a piece of the action for a big deal gets cash. [c] Involvement in a big deal acts like a huge electro magnet for more deals. Once Microsoft “thought” of the acquisition, significant positive input about the upside of the deal pours into the potential acquirer.
- Consultants. Once a big deal is announced, the consultants [delete apostrophe here] leap into action. The buyer needs analyses, advice, and strategic counsel. The buyer’s minions need tactical advice to answer such questions as “How can we maximize our tax benefits?” and “How can we pay for this with cheap money?” The buyer becomes hungry for advisors of every species. Blue-chip outfits like Bain, Booz, Allen & Hamilton, Boston Consulting Group, and McKinsey & Co. drool in eagerness to provide guidance on lofty strategy matters such as answering the question, “How can I maximize my pay-out?” And “What are the tax consequences of my windfall profit?” Tactical advisors from these firms can provide support on human resource issues and real estate leases, among other matters. In short, buyers throw money at “the problem” in order to be prepared to negotiate or find a better deal.
These three constituencies want the deal to go through. If Microsoft is the buyer, that’s fine. If another outfit with cash shows, that’s okay too. The deal now has a life of its own. Money talks. To get the money, these constituencies have no desire to help Microsoft “see” some of the gaps and canyons that must be traversed. Let’s turn to one practical matter and the aforementioned simple math. Testosterone and money — these are two ways to cloud perception and jazz logic.
More Simple Math
Let’s do a thought experiment, what some German philosophers call Gedankenexperiment. I am not talking about the proposed Microsoft – Yahoo deal, gentle attorneys.
Accordingly, We have two companies, Company Alpha and Company Beta; hereinafter, Company A(lpha) and Company B(eta), neither of which is a real company and should not be construed as having any similarity with any company now in existence.
Company Alpha has a dominant position in a market and wants to gain a larger share of a newer, tangential market. Company A has a proven, well-tuned, aging business model. That business model is a variation on selling subscriptions and generating annuity income from renewals. Company A’s business model works this way. Company A offers a product and then, on a periodic basis, Company A makes a change to an existing product, assessing a fee for customers to get the “new” or “enhanced” version of the product (service).
The idea is that once a subscription base is in place, Company A can predict a certain amount of revenue from standing orders and new orders. Company A has an excellent, stable, cash flow based on this well-crafted business model and periodic fee increases. Although there are environmental factors that put pressure on the proven business model, the customer base is large, and the business model continues to work in Company A’s traditional markets. Company A, aware of exogenous factors — for instance, the emergence of cloud computing and other non-subscription business models — has learned through trial and error that its subscription-based business model does not work in certain new markets. These new markets are potentially lucrative, representing “new” revenue and a threat to Company’s existing revenue stream. Company A wants to acquire a company to increase its chances for success in the new and emerging markets. Company A’s goal is to [a] protect its existing revenue, [b] generate new revenue, and [c] prevent other companies from dominating the new market(s).
Company A has performed a rational, market analysis. Company A’s management has determined that one company only — our Company B — represents a mechanism for achieving Company A’s goals. Company A, by definition, has performed its analyses through Company A’s “eye glasses”; that is, Company A’s proven business model and business culture. “Walking in another person’s moccasins” is easy to say and difficult, if not impossible, to do. Everyone views the world through his own experiential frame. Hence, Company A “sees” Company B as having characteristics, attributes, and capabilities that are, despite some acceptable risks, significant benefits to Company A. Having made this decision about the upside from buying Company B, the management of Company A becomes less able to accept alternative inputs, facts, information, perceptions, and opinions. Company A’s reasoning in its decision space is closed. Company A vivifies what William James called “a certain blindness.” The idea is that each person is “blind” in some way to reality that others can perceive.
The implications of “a certain blindness” in this hypothetical acquisition warrant further discussion:
Culture
Company A has a culture built around a business model that allows incremental product enhancements so that subscription revenue is generated. Company B has a business model built around acquisitions. Company A has a more or less homogeneous atmosphere engendered by the business model or what Company A calls the agenda. Company B is more like a loose federation of separate companies — what some MBAs might call a Ling Temco Vought framework. Each entity within Company B retains its own identity, enjoys wide scope of action, and preserves its own culture. “We do our own thing” characterizes these units of Company B. Company A, therefore, has several options to consider:
- Company A can leave Company B as it is. The plus is that not much will change Company B’s operations in the short term. The downside is that the technical problems will not be resolved.
- Company A can impose its culture on Company B. You don’t need me to tell you that this will go over like the former Soviet Union’s intervention in Poland in the late 1950s.
- Company A can try to make changes gradually. (This is a variation of the option in bullet 2 and will simply postpone rebellion. )
Technology
Company A has a different and relatively homogeneous technology base. Company B has a heterogeneous technology base. Maintaining multiple systems is more costly in general than homogeneous systems. Upon inspection, the technical staff needed to maintain these different systems have specialized to deal with particular technical problems in the heterogeneous environment. Technical people can learn new skills, but this takes time and adds cost. Company A has to find a way to streamline technical operations, reduce costs, and not waste time achieving rationalization. There are at least two ways to do this:
- Shift to a single platform, ideally Company A’s
- Retrain existing staff to have broader technical skills. With Company B’s staff able to perform more generalized work, Company A can reduce headcount at Company B, thus streamlining work processes and reducing cost.
Competitive Arena
The desirable new market for Company A has taking on the characteristics of what I call a “natural monopoly.” When I reflect on notable events in American business history, I note monopolistic behavior. Some monopolies were spawned by force of will; for example, JP Morgan and finance (this guy bailed out the US Treasury) and Andrew Carnegie and steel (this fellow thought of libraries for little people after pistol-whipping his competitors and antagonists).
Other monopolies — like Bell Telephone and your local electric company — came into being because some functions are more appropriately delivered by one organization. Water and Internet search / advertising, for instance, are subject to such economies of scale, quality of service, and standardization. In short, these may be “natural monopolies” due to numerous demand and cost force.
In our hypothetical example, Company A wants to enter a market which is coalescing and beginning now, based on my research, appears to be forming into a “natural monopoly”. This nameless competitor seems to be following a trajectory similar to that of the original Bell Telephone – AT&T life cycle.
Company A’s race, then, is against time and money. Untoward delay at any point going forward with regard to leveraging Company B means coming in second, maybe a distant second or losing out on the new market.
Instead of owning Park Place (a desirable property in the Parker Brothers’ game Monopoly), Company A ends up with Baltic and Mediterranean Avenues (really lousy properties in the Parker Brothers’ game). If Company A doesn’t get Company B, Company A is trapped in its old, deteriorating business model.
If Company A does acquire Company B, Company A has to challenge the competitor. Company B already has a five-year track record of being a day late and a dollar short. Company A, therefore, has to do everything in its power to make the Company B deal work, which appears to be an all-or-nothing proposition.
Now the math: Action by Company A = unknown, variable, escalating costs.
I told you math geeks would not like this analysis. Company A is betting the farm against long odds. Here’s why:
First, the cultures are not amenable to staff reductions or technological efficiencies; that is, use software and automation, not people, while increasing revenues. Company A, regardless of the money invested, cannot be certain of success. Company B’s culture – business model duality is investment insensitive. In short, money won’t close this gap. Company A’s resistance to cannibalizing its old, though still functioning, business model will be significant. Company A’s own employees will resist watching their money and jobs sacrificed to a great good.
Second, the competitive space is now being captured by the increasingly monopolistic competitor. Unchallenged for some period of time, the monopolistic competitor enjoys momentum and a significant lead in refining its own business model.
In the lingo of Wall Street, Company A can’t get enough “oxygen”; that is, revenue despite its best efforts to reign in the market leader.
Observations
If we assume a kernel of truth in my hypothetical analysis, we can now apply this hypothetical discussion to the Microsoft – Yahoo deal.
First, Microsoft’s business mode (not its technology) is the company’s strength. The business model is also its Achilles’ heel. Just as IBM’s mainframe-centric view of the world make its executives blind to Microsoft, now Microsoft can’t perceive today’s world from outside the Microsoft business model. The Microsoft business model is perhaps the most efficient subscription-based revenue generator in history. But that business model has not worked in the new markets Microsoft’s covets, so the Yahoo deal becomes the “obvious” play to Microsoft’s management. Its obviousness makes it difficult for Microsoft to see other options.
Second, the Microsoft business model is woven into the company’s culture. Cultures are ethnocentric. Ethnocentricity often manifests itself in conflict. Microsoft will have to make prescient, correct cultural decisions quickly and repeatedly. Microsoft’s culture, however, does not typically evidence excellent, rapid-fire decision-making.
Microsoft seems to be putting the company in a situation guaranteed to spark conflict within its own walls, between itself and Yahoo, and between Microsoft and Google. This is a three-front war. Even those with little exposure to military history can see that the costs and risks of a three-front conflict will be high, open-ended, and difficult to estimate.
The hostile bid itself is suggestive that Microsoft could not catch Google without Google, the notion that Microsoft can catch Google with the acquisition requires tremendous confidence in Microsoft’s management. I think Microsoft can make the deal work, but I think that execution must be flawless and that favorable winds push Microsoft along.
If Google continues to race forward, Microsoft has to spend more money to implement efficiencies more quickly. The calculus of catching a moving target can trigger a cost crisis. If costs go up too quickly, Microsoft must fall back on its proven business model. Taking a step backward when resolving the calculus of catching Google is not a net positive.
As you read this essay, you are wondering, “How can this doom and gloom be real?” The buzz about the deal is mostly positive. If you don’t believe me, call your broker and ask him how much your mutual fund will benefit from the MSFT – YHOO tie up.
I’ve spent some time around money types, and I can tell you making money is akin to blood in the water for sharks.
I’ve also been acquired and done the acquiring. Regardless of being the buyer or being the bought, ties ups are tricky. The larger the stakes, the more tricky the tie ups become. When the tie up is designed to halt the Google juggernaut, the calculus of time – cost is hard.
Please, recall, that I’m not saying that stopping Google is impossible for a Microsoft – Yahoo tie up to deliver. Making the tie up work will be difficult.
Don’t agree? That’s okay. Use the comments to set me straight. I’m willing to listen and learn. Just don’t overlook my core points; namely, business models, cultures, and technologies. One final thought: don’t factor out the Google (NSDQ:GOOG).
Stephen Arnold, February 4, 2008
Lotsa Search at Yahoo!
February 3, 2008
Microsoft’s hostile take over of Yahoo! did not surprise me. Rumors about Micro – hoo or Ya – soft have floated around for a couple of years. I want to steer clear of the newsy part of this take over, ignore the share-pumping behind the idea that Mr. Murdoch will step in to buy Yahoo, and side step Yahoo’s 11th hour “we’re not sure we want to sell” Web log posting.
I prefer to do what might be called a “catalog of search engines,” a meaningless exercise roughly equivalent to Homer’s listing of ships in The Illiad. Scholars are still arguing about why he included the information and centuries later continue to figure out who these guys were and why such an odd collection of vessels was necessary. You may have a similar question about Yahoo’s search fleet after you peruse this short list of Yahoo “findability” systems:
- InQuira. This is the Yahoo natural language customer support system. InQuira was formed from three smaller search outfits that ran aground. InQuire seems stable, and it provides NLP systems for customer support functions. Try it. Navigate to Yahoo. Click Help and ask a question, for example, “How do I cancel my premium mail account?” Good luck, but you have an opportunity to work with an “intelligent” agent who won’t tell you how to cancel a for-fee Yahoo service. When I learned of this deal, I asked, “Why don’t you just use Inktomi’s engine for this?” I didn’t get an answer. I don’t feel too bad. Google treats me the same way.
- Inktomi. Yahoo bought this Internet indexing company in 2002. We used the Inktomi system for the original US government search service, FirstGov.gov (now USA.gov). The system worked reasonably well, but once in the Yahooligans’ hands, not much was done with the system, and Inktomi was showing its age. In 2002, Google was motoring just drawing even with Yahoo. Yahoo seemed indifferent or unaware that search had more potential than Yahoo’s portal approach.
- Stata Labs. When Gmail entered semi-permanent beta, it offered two key features. First, there was one gigabyte of storage and, two, you could search your mail. Yahoo couldn’t search email at all. The fix was to buy Stata Labs in 2004. When you use the Yahoo mail search function, the Stata system does the work. Again I asked, “Why not use one of your Yahoo search systems to search mail?” Again, no response.
- Fast Search & Transfer. Yahoo, through the acquisition of Overture, ended up with the AllTheWeb.com Web site. The spidering and search technology are operated by Fast Search & Transfer (the same outfit that Microsoft bought for $1.2 billion in January 2008). Yahoo trumpeted the “see results as you type feature” in 2007, maybe 2006. The idea was that as you key your query, the system shows you results matching what you have typed. I find this function distracting, but you may love it. Try it yourself here. I heard that Yahoo has outsourced some data center functions to Fast Search & Transfer, which, if true, contradicts some of the pundits who assert that Yahoo has its data center infrastructure well in hand. If so, why lean on Fast Search & Transfer?
- Overture. When Yahoo acquired Overture (the original pay-for-traffic service) in 2003, it got the ad service and the Overture search engine. Overture purchased AllTheWeb.com and ad technology from Fast Search & Transfer. When Yahoo bought Overture, Yahoo inherited Overture’s Sun Microsystems’ servers with some Linux boxes running a home brew fraud detection service, the original Overture search system, and the AllTheWeb.com site. Yahoo still uses the Overture search system when you look for key words to buy. You can try it here. (Note: Google was “inspired” by the Overture system, and paid about $1.2 billion to Yahoo to avoid a messy lawsuit about its “inspiration” prior to the Google IPO in 2004. Yahoo seemed happy with the money and did little to impede Google.)
- Delicious. Yahoo bought Delicious in 2005. Delicious came with its weird url and search engine. If you have tried it, you know that it can return results with some latency. When it does respond quickly, I find it difficult to locate Web sites that I have seen. As far as I know, the Delicious system still uses the original Delicious search engine. You can try it here.
- Flickr. Yahoo bought Flickr in 2005, another cog in its social, Web 2.0 thing. The Flickr search engine runs on MySQL. At one trade show, I heard that the Flickr infrastructure and its search system were a “problem”. Scaling was tough. Based on the sketchy information I have about Yahoo’s search strategy, Flickr search is essentially the same as it was when it was purchased and is in need of refurbishing.
- Mindset. Yahoo, like Google and Microsoft, has a research and development group. You can read about their work on the recently redesigned Web site here. If you want to try Mindset, navigate to Yahoo Research and slide the controls. I’ve run some tests, and I think that Mindset is better than the “regular” Yahoo search, but it seems unchanged over the last six or seven months.
I’m going to stop my listing of Yahoo’s search systems, although I could continue with the Personals search, Groups search, News search, and more. I may comment on AltaVista.com, another oar in Yahoo’s search vessel, but that’s a topic that requires more space than I have in this essay. And I won’t beat up on Yahoo Shopping search. If I were a Yahoo merchant, I would be hopping mad. I can’t figure out how to limit my query to just Yahoo merchants. The results pages are duplicative and no longer useful to me. Yahoo has 500 million “users” but Web statistics are mushy. Yahoo must be doing something right as it continues to drift with the breeze as a variant of America Online.
In my research for my studies and journal articles, I don’t recall coming across a discussion of Yahoo’s many different search systems. No one, it seems, has noticed that Yahoo lacks an integrated, coherent approach to search. I know I’m not the only person who has observed that Yahoo cannot mount a significant challenge to Google.
As Google’s most capable competitor, Yahoo stayed out of the race. But it baffles me that a sophisticated, hip, with-it Silicon Valley outfit like Yahoo collected different search systems the way my grandmother coveted weird dwarf figurines. Like Yahoo, my grandmother never did much with her collection, I may have to conclude that Yahoo hasn’t done much with its collection of search systems.The cost of licensing, maintaining, and upgrading a fleet of search systems is not trivial. What baffles me is why on earth couldn’t Yahoo index its own email? Why couldn’t Yahoo use one of its own search systems to index Delicious bookmarks and Flickr photos? Why does Yahoo have a historical track record of operating search systems in silos, thus making it difficult to rationalize costs and simplify technical problems?
Compared to Yahoo, Google has its destroyer ship shape — if you call squishy purple pillows, dinosaur bones, and a keen desire to hire every math geek with an IQ of 165 on the planet “ship shape”. But Yahoo is still looking for the wharf. As Google churned past Yahoo, Yahoo watched Google sail without headwinds to the horizon.Over the years, I’ve been in chit-chats with some Yahoo wizards. Let me share my impressions without using the wizards’ names:
- Yahoo believes that its generalized approach is correct as Google made search the killer app of cloud computing. Yahoo’s very smart people seem to live in a different dimension
- Yahoo believes that its technology is superior to Google’s and Microsoft’s. When I asked about a Google innovation, Yahoo’s senior technologist told me that Yahoo had “surprises for Google.” I think the surprise was the hostile take over bid last week
- Yahoo sees its future in social, Web 2.0 services. To prove this, Yahoo hired economists and other social scientists. While Yahoo was recruiting, the company muffed the Facebook deal and let Yahoo 360 run aground. Yo, Yahoo, Google is inherently social. PageRank is based on human clicks and human-created Web pages. Google’s been social since Day One.
To bring this listing of Yahoo search triremes (ancient wooden war ships) to a close, I am not sure Microsoft, if it is able to acquire Yahoo, can integrate the fleet of search systems. I don’t think Mr. Murdoch can given the MySpace glitches. Fixing the flotilla of systems at Yahoo will be expensive and time consuming. The catch is that time is running out. Yahoo appears to me to be operating on pre-Internet time. Without major changes, Yahoo will be remembered for its many search systems, leaving pundits and academics to wonder where they came from and why. Maybe these investigators will use Google to find the answer? I know I would.
Stephen Arnold, February 3, 2008
Search Frustration: 1980 and 2008
February 2, 2008
I have received two telephone calls and several emails about user satisfaction with search. The people reaching out to me did not disagree that users were often frustrated with systems. I think the contacts were amplifications of the complexity of “getting search right”.
Instead of falling back on bell curves, standard deviations, and more exotic ways to think about populations, let’s go back in time. I want to then jump back to the present, offer some general observations, and then conclude with several of my opinions expressed as “observations”. I don’t mind push back. My purpose is to set forth facts as I understand them and stimulate discussion.
I’m quite a fan of Thucydides. If you have dipped into his sometimes stream-of-consciousness approach to history, you know that after a few hundred pages the hapless protagonists and antagonists just keep repeating their mistakes. Finally, after decades of running around the hamster wheel, resolution is achieved by exhaustion.
My hope is that with regard to search we arrive at a solution without slumping into torpor.
The Past: 1980
A database named ABI / INFORM (pronounced as three separate letters ay-bee-eye followed by the word inform) was a great online success. Its salad days are gone, but for one brief shining moment, it was white hot.
The idea for ABI (abstracted business information) originated at a university business school, maybe Wisconsin but I can’t recall. It was purchased by my friend Dennis Auld and his partner Greg Payne. There was another fellow involved early on, but I can’t dredge his name up this morning.
The database summarized and indexed journals containing information about business and management. Human SMEs (subject matter experts) read each article and wrote a 125-word synopsis. The SMEs paid particular attention to making the abstract meaty; that is, a person could read the abstract and get the gist of the argument and garner the two or three key “facts” in the source article. (Systems today perform automatic summarization, so the SMEs are out of a job.)
ABI / INFORM was designed to allow a busy person to ingest the contents of a particular journal like the Harvard Business Review quickly, or collect some abstracts on a topic such as ESOPs (Employee Stock Ownership Plans) and learn quickly on what was in the “literature” (a fancy word for current management thinking and research on a subject).
Our SMEs would write their abstracts on special forms that looked a lot like a 5″ by 8″ note card (about the amount of text on a single IBM mainframe green screen input form). SMEs would also enter the name of the author or authors, the title of the article, the source journal, and the standard bibliographic data taught in the 7th grade.
SMEs would also consult a printed list of controlled terms. A sample of a controlled term list appears below. Today, these controlled term lists are often called knowledge bases. For anyone my age, a list of words is pretty much a list of words. Flashy terminology doesn’t always make points easier to understand, which will be a theme of this essay.
Early in the production cycle, the index and abstract for each article would be typed twice once by an SME on a typewriter and then by a data entry operator into a dumb terminal. This type of information manufacturing reflected the crude, expensive systems available a quarter century ago. Once the data had been keyed into a computer system, it was in digital form, proofed, and sent via eight-track tape to a timesharing company. We generated revenue by distributing the ABI / INFORM records via Dialog Information Services, SDC Orbit, BRS, and other systems. (Perhaps I will go into more detail about these early online “players” in another post.) Our customers used the timesharing service to “search” ABI / INFORM. We split the money with the timesharing company and generally ended up with the short end of the stick.
Below is an example of the ABI / INFORM controlled vocabulary:
There were about 15,000 terms in the vocabulary. If you look closely, you will see that some terms are market “rt” and “uf”. These are “related terms” and “use for” terms. The idea was that a person assigning index terms would be able to select a general term like “market shares” and see that the related terms “competition” and “market erosion” would provide pertinent information. The “uf” or “use for” reminded the indexer that “share of market” was not the preferred index term. Our vocabulary could also be used by a customer or user whom we then called a searcher in 1980.
A person searching for information in the ABI / INFORM file (database) of business abstracts could use these terms to locate precisely the information desired. You may have heard the terms precision and recall used by search engine and content processing vendors. The idea originated with the need to allow users (then called searchers) to narrow results; that is, make them more precise. There was also a need to allow a user (searcher) to get more results if the first result set contained too few hits or did not have the information the user (searcher) wanted.
To address this problem, we created classification codes and assigned these to the ABI / INFROM records as well. As a point of fact, ABI / INFORM was one of the first, if not the first, commercial database to reindex every record in its database to assign manually six to eight index terms and classification codes as part of a quality assurance project.
When we undertook this time-consuming and expensive job, we had to use SMEs. The business terminology proved to be so slippery that our primitive automatic indexing and search-and-replace programs introduced too many indexing red herrings. My early experience with machine-indexing and my having to turn financial cartwheels to pay for the manual rework has made me suspicious of vendors pushing automated systems, especially for business content. Business content indexing remains challenging, eclipsed only by processing email and Web log entries. Scientific, technical, and medical content is tricky but quite a bit less complicated than general business content. (Again, that’s a subject for another Web log posting.)
Our solution to broadening a query was to make it possible for the SME indexing business abstracts to use a numerical code to indicate a general area of business; for example, marketing, and then use specific values to indicate a slightly narrower sub-category. The idea was that the controlled vocabulary was precise and narrow and the classification codes were broader and sub-divided into useful sub-categories. A snippet of the ABI / INFORM classification codes appears below:
If you look at these entries for the classification code 7000 Marketing, you will see terms such as “sn”. That’s a scope note, and it tells the indexer and the user (searcher) specific information about the code. You also see the “cd”. That means “code description”. A “code description” is provides specific guidance on when and how to use the classification code, in this case “7000 Marketing”.
Notice too that the code “7100 Marketing” is a sub-category of Marketing. The idea is that while 7000 Marketing is broad and appropriate for general articles about marketing, the sub-category allows the indexer or user to identify articles about “Market research.” While “Market research” is broad, it is ideally in a middle ground between the very broad classification code 7000 Marketing and the very specific terminology of the controlled vocabulary. We also had controlled terms lists for geography or what today is called “geo spatial coding”, document type codes, and other specialized index categories. These are important facets of the overall indexing scheme, but not germane to the point I want to make about user satisfaction with search and content processing systems.
Let’s step back. Humans created abstracts of journal articles. Humans then complete bibliographic entries for each selected article. Then an SME would index the abstracts, selecting terms that in their judgment and according to the editorial policy inherent in the controlled terms lists. These index terms became the building blocks of locating a specific article among hundreds of thousands or identifying a subset of all possible articles in ABI / INFORM directly on point to the topic on which the user wanted information.
The ABI / INFORM controlled vocabulary was used at commercial organizations to index internal documents or what we would today call “behind-the-firewall content.” One customer was IBM. Another was the Royal Bank of Canada. The need for a controlled vocabulary such as ABI / INFORM’s is rooted in the nature of business terminology. When business people speak, jargon creeps into almost every message. On top of that, new terms are coined for old concepts. For example, you don’t participate in a buzz group today. You participate in a focus group. Now you know why I am such a critic of the baloney used by search and content processing vendors. Making up words (neologisms) or misappropriating a word with a specific meaning (semantic, for example) and then gluing that word with another word with a reasonably clear meaning (processing, for example) creates the jargon semantic processing. Now I ask you, “Who knows what the heck that means?” I don’t, and that’s the core problem of business information. The language is slippery, fast moving, jargon-riddled, and fuzzy.
Appreciate that creating the ABI / INFORM controlled vocabulary, capturing the editorial policy in those lists, and then applying them consistently to what was then the world’s largest index to business and management thought was a big job. Everyone working on the project was exhausted after two years of researching, analyzing, and discussing. What made me particularly proud of the entire Courier-Journal team (organized by the time we finished into a separate database unit called Data Courier) was that library and information science courses used ABI / INFORM as a reference document. At Catholic University in Washington, DC, the entire vocabulary was used as a text book for an advanced information science class. Even today, ABI / INFORM’s controlled vocabulary stands as an example of:
- The complexity of creating useful, meaningful knowledge bases
- Proof that it is possible to index content so that it can be sliced and diced with few “false drops” or what we call today a “irrelevant hit”.
- A difficult domain such as business can be organized and made more accessible via good indexing.,
Now here’s the kicker, actually a knife in the heart to me and the entire ABI / INFORM team. We did user satisfaction surveys on our customers before the reindexing job and then after the reindexing job. But our users (searchers) did not use the controlled terms. Users (searchers) keyed one or two terms, hit the Enter key, and used what the system spit out.
Before the work, two-thirds of the people we polled who were known users of ABI/ INFORM said our indexing was unsatisfactory. After the work, two thirds of the people we polled who were known users of ABI / INFORM said our indexing was unsatisfactory. In short, bad indexing sucked. And better indexing sucked. User behavior was responsible for the dissatisfaction, and even today, who dares tell a user (search) that he / she can’t search worth a darn.
I’ve been thinking about these two benchmark studies performed by the Courier-Journal every so often for 28 years. Here’s what I have concluded:
- Inherent in the search and retrieval business is frustration with finding the information a particular user needs. This is neither a flaw in the human nor a flaw in the indexing. Users come to a database looking for information. Most of the time — two thirds to be exact — the experience disappoints.
- Investing person years of effort in constructing an almost-perfect epistemological construct in the form of controlled vocabularies is a great intellectual exercise. It just doesn’t pay huge dividends. Users (searchers) flounder around and get “good enough” information which results in the general dissatisfaction with search.
- As long as humans are involved, it is unlikely that the satisfaction scores will improve dramatically. Users (searchers) don’t want to work hard to formulate queries or don’t know how to formulate queries that deliver what’s needed. Humans aren’t going to change at least in my lifetime or what’s left of it.
What’s this mean?
Simply stated, algorithmic processes and the use of sophisticated mathematical procedures will deliver better results.
The Present: 2008
In my new study Beyond Search, I have not included much history. The reason is that today most procurement teams looking to improve an existing search system or replace one system with another want to know what’s available and what works.
The vendors of search and content processing systems have mastered the basics of key word indexing. Many have integrated entity extraction and classification functions into their content processing engines. Some have developed processes that look at documents, paragraphs, sentences, and phrases for clues to the meaning of a document.
Armed with these metatags (what I call index terms), the vendors can display the content in point-and-click interfaces. A query returns a result list, and the system also displays Use For references or what vendors call facets, hooks, or adjacent terms. The naked “search box” is surrounded with “rich interfaces”.
You know what?
Survey the users and you will find two-thirds of the users dissatisfied with the system to some degree. Users overestimate their ability and expertise in finding information. Many managers are too lazy to dig into results to find the most germane information. Search has become a “good enough” process for most users.
Rigorous search is still practiced by specialists like pharmaceutical company researchers and lawyers paid to turn over every stone in hopes of getting the client off the legal hook. But for most online users in commercial organizations, search is not practiced with diligence and thoroughness.
In May 2007, I mentioned in a talk at an iBreakfast seminar that Google had an invention called “I’m feeling doubly lucky.” The idea is that Google can look at a user’s profile (compiled automatically by the Googleplex), monitor the user’s location and movement via a geo spatial function in the user’s mobile device, and automatically formulate a query to retrieve information that may be needed by the user. So, if the user is known to be a business traveler and the geo spatial data plot his course toward La Guardia Airport, then the Google system will push to the user’s phone about which parking lot is available and whether the user’s flight is late. The key point is that the user doesn’t have to do anything but go one about his / her life. This is “I’m feeling doubly lucky” because it raises the convenient level of the “I’m feeling lucky button” on Google pages today. Press I’m feeling lucky and the system shows you the one best hit as defined by Google’s algorithmic factory. Some details of this invention appear in my September 2007 study, Google Version 2.0.
I’m convinced that automatic, implicit searching is the direction that search must go. Bear in mind that I really believe in controlled vocabularies, carefully crafted queries, and comprehensive review of results lists. But I’m a realist. Systems have to do most of the work for a user. When users have to do the searches themselves or at least most of the work, their level of dissatisfaction will remain high. The dissatisfaction is not with the controlled vocabulary, the indexing, or the particular search system. The dissatisfaction is with the work associated with finding and using the information. I think that most users are happy with the first page or first two or three results. These are good enough or at least assuage the user’s conscience sufficiently to make a decision.
The future, therefore, is going to be dominated by systems that automate, analyze, and predict what the mythical “average” user wants. These results will then be automatically refined based on what the system knows about a particular user’s wants and needs. The user profile becomes the “narrowing” function for a necessarily broad set of results.
Systems can automatically “push” information to users or at least keep it in a cache ready for near-zero latency delivery. In an enterprise, search must be hooked into work flow. The searches must be run for the user and the results displayed to the user. If not automatically, the user need only click a hot link and the needed information is displayed. A user can override an automatic systems, but I’m not sure most users would do it or care if the override were like a knob on a hotel’s air conditioner. You feel better turning the knob. You feel without control if you can’t turn the knob.
Observations
Let me offer several observations after this journey back in time and a look at the future of search and content processing. If you are easily upset, grab your antacid, because here we go:
- The razzle-dazzle about taxonomies, ontologies, and company-specific controlled term lists hides the fact that specific terms have to be identified and used to index automatically documents and information objects found in behind-the-firewall search systems. Today, these terms can be generated by processing a representative sample of existing documents produced by the organization. The key is a good-enough term list, not doing what was done 25 years ago. Keep in mind the phrase “good enough.” There are companies who offer software systems that can make this list generation easier. You can read about some vendors in Beyond Search, or you can do a search on Google, Live.com, or Yahoo.
- Users will never be satisfied. So before you dump your existing search system because of user dissatisfaction, you may want to get some other ammunition, preferably cost and uptime data. “Opinion” data are almost useless because no system will test better than another in my experience.
- Don’t believe the business jargon thrown at you by vendors. Inherent in business itself is a tendency to create a foggy understanding. I think the tendency to throw baloney has been around since the first caveman offered to trade a super-sharp flint for a tasty banana. The flint is not sharp; it’s like a Gillette four-track razor. The banana is not just good; it is mouth-watering, by implication a great banana. You have to invest time, effort, energy, and money in figuring out which search or content processing system is appropriate for your organization., This means head-to-head bake-offs. Few do this, and the results are clear. Most people are unhappy with their vendor, with search, and with the “information problem”.
- Background processes, agent-based automatic searching, and mechanisms to watch what your information needs and actions are will make search better. You enter ss cc=71? AND ud=9999 to get recent material about market research. but most people don’t and won’t.
In closing, keep these observations in mind when trying to figure out what vendors are really squabbling about. I’m not sure they themselves know. When you listen to a sales pitch, are the vendors saying the same thing? The answer is, “Yes.” You have to rise to the occasion and figure out the differences between systems. I guarantee you the vendors don’t know and if they know, the vendors sure won’t tell you.
Stephen Arnold, February 2, 2008
Transformation: An Emerging “Hot” Niche
January 25, 2008
Transformation is a $5 dollar word that means changing a file from one format to another. The trade magazines and professional writers often use data integration or normalization to refer to what amounts to taking a Word 97 document with a Dot DOC extension and turning it into a structured document in XML. These big words and phrases refer to a significant gotcha in behind-the-firewall search, content processing, and plain old moving information from one system to another.
Here’s a simple example of the headaches associated with what should be a seamless, invisible process after a half century of computing. The story:
You buy a new computer. Maybe a Windows laptop or a new Mac. You load a copy of Office 2007, write a proposal, save the file, and attach it to an email that says, “I’ve drafted the proposal we have to submit tomorrow before 10 am.” You send the email and go out with some friends.
In the midst of a friendly discussion about the merits of US democratic presidential contenders, your mobile rings. You hear your boss saying over the background noise, “You sent me a file I can’t open. I need the file. Where are you? In a bar? Do you have your computer so you can resend the file? No? Just get it done now!” Click here to read what ITWorld has to say on this subject. Also, there’s some user vitriol over Word to Word compatibiity hassle itself here. A work around from Tech Addict is here
Another scenario is to have a powerful new content processing system that churns through, according to the vendor’s technical specification, “more than 200 common file types.” You set up the content processing gizmo, aim it at the marketing department’s server, and click “Index.” You go home. When you arrive the next morning at 8 am, you find that the 60,000 documents in the folders containing what you wanted indexed had become an index with 30,000 documents.” Where are the other 30,000 documents? After a bit of fiddling, you discover the exception log and find that half of the documents you wanted indexed were not processed. You look up the error code and learn that it means, “File type not supported.”
The culprit is the inability of one system to recognize and process a file. The reasons for the exceptions are many and often subtle. Let’s troubleshoot the first problem, the boss’s inability to open a Word 2007 file sent as an attachment to an email.
The problem is that the recipient is using an older version of Word. The sender saved the file in the most recent Word’s version of XML. You can recognize these files by their extension Dot DOCX. What the sender should have done is save the [a] proposal as either a Dot DOC file in an older “flavor” of Word’s DOC format; [b] file as the now long-in-the-tooth RTF (rich text format) type; or [c] file in Dot TXT (ASCII) format. The fix is for the sender to resend the file in a format the recipient can view. But that one file can cost a person credibility points or the company a contract.
The second scenario is more complicated. The marketing department’s server had a combination of Word files, Adobe Portable Document Format files with Dot PDF extensions, some Adobe InDesign files, some Quark Express files, some Framemaker files, and some database files produced on a system no one knows much about except that the files came from a system no longer used by marketing. A bit of manual exploration revealed that the Adobe PDF files were password protected, so the content processing system rejected them. The content processing system lacked import filters to open the proprietary page layout and publishing program files. So it rejected them. The mysterious files from the disused system were data dumps from an IBM CICS system. The content processing system opened and then found them unreadable, so those were exceptions as well.
Now the nettles, painful nettles:
First, fixing the problem with any one file is disruptive but usually doable. The reputation damage done may or may not be repaired. At the very least, the sender’s evening was ruined, but the high-powered vice president was with a gaggle of upper crust types arguing about an election’s impact on trust funds. To “fix” the problem, she had to redo her work. Time consuming and annoying to leave her friends. The recipient — a senior VP — had to jiggle his plans in order to meet the 10 am deadline. Instead of chlling with The Simpsons TV show, he had to dive into the proposal and shape the numbers under the pressure of the looming deadline.
We can now appreciate a 30,000 file problem. It is a very big problem. There’s probably no way to get the passwords to open some the PDFs. So, the PDFs’ content may remain unreadable. The weird publishing formats have to be opened in the application that created them and then exported in a file format the content processing system understands. This is a tricky problem, maybe another Web log posting. An alternative is to print out hard copies of the files, scan them, use optical character recognition software to create ASCII versions, and then feed the ASCII versions of the files to the content processing system. (Note: some vendors make paper-to-ASCII systems to handle this type of problem.) Those IBM CICS files can be recovered, but an outside vendor may be needed if the system producing the files is no longer available in house. When the costs are added up, these 30,000 files can represent hundreds of hours of tedious work. Figure $60 per hour and a week’s work if everything goes smoothly, and you can estimate the minimum budget “hit”. No one knows the final cost because transformation is dicey. Cost naivety is the reason my blood pressure spikes when a vendor asserts, “Our system will index all the information in your organization.” That’s baloney. You don’t know what will or won’t be indexed unless you perform a thorough inventory of files and their types and then run tests on a sample of each document type. That just doesn’t happen very often in my experience.
Now you know what transformation is. It is a formal process of converting lead into content gold.
One Google wizard — whose name I will withhold so Google’s legions of super-attorneys don’t flock to rural Kentucky to get the sheriff to lock me up — estimated that up to 30 percent of information technology budgets is consumed by transformation. So for a certain chicken company’s $17 million IT budget, the transformation bill could be in the $5 to $6 million range. That translates to selling a heck of a lot of fried chicken. Let’s assume the wizard is wrong by a factor of two. This means that $2 to $3 million is gnawed by transformation.
As organizations generate and absorb more digital information, what happens to transformation costs? The costs will go up. Whether the Google wizard is right or wrong, transformation is an issue that needs experienced hands minding the store.
The trigger for these two examples is a news item that the former president of Fast Search & Transfer, Ali Riaz, has started a new search company. Its USP (unique selling proposition) is data integration plus search and content processing. You can read Information Week‘s take on this new company here.
In Beyond Search, I discuss a number of companies and their ability to transform and integrate data. If you haven’t experienced the thrill of a transformation job, a data integration project, or a structured data normalization task — you will. Transformation is going to be a hot niche for the next few years.
Understanding of what can be done with existing digital information is, in general, wide and shallow. Transformation demands narrow and deep understanding of a number of esoteric and almost insanely diabolical issues. Let me identify three from own personal experience learned at the street academy called Our Lady of Saint Transformation.
First, each publishing system has its own peculiarities about files produced by different versions of itself. InDesign 1.0 and 2.0 cannot open the most recent version’s files. There’s a work around, but unless you are “into” InDesign, you have to climb this learning curve and fast. I’m not picking on Adobe. The same intra-program compatibilities plague Quark, PageMaker, the moribund Ventura, Framemaker, and some high-end professional publishing systems.
Second, data files spit out by mainframe systems can be fun for a 20-something. There are some interesting data formats still in daily use. EBCDIC or Extended Binary-Coded Decimal Interchange Code is something some readers can learn to love. It is either that or figuring out how to fire up an IBM mainframe, reinstalling the application (good luck on that one, 20 somethings), restoring the data from a DASD or flat file back up tapes (another fun task for a recent computer science grad), and then outputting something the zippy new search or content processing can convert in a meaningful way. (Note: “meaningful way” is important because when a filter gets confused, it produces some interesting metadata. Some glitches can require you to reindex the content if your index restore won’t work.)
Third, the Adobe PDFs with their two layers of security can be especially interesting. If you have one level of password, you can open the file and maybe print it, and copy some content from it. Or, not. If not, you either print the PDFs (if printing has not be disabled) , and go through the OCR-to-ASCII drill. In my opinion, PDFs are like a digital albatross. These birds hang around one’s neck. Your colleagues want to “search” for the PDFs’ content in their behind-the-firewall system. When asked to produce the needed passwords, I often hear something discomforting from the marketing department. So it is no surprise to learn that some system users are not too happy.
You may find this post disheartening.
No!
This post is chock full of really good news. It makes clear that companies in the business of transformation are going to have more customers in 2008 and 2009. It’s good news for off-shore conversion shops. Companies that have potent transformation tools are going to have a growing list of prospects. Young college grads get more chances to learn the mainframe’s idiosyncrasies.
The only negative in this rosy scenario is for the individual who:
- Fails to audit the file types and the amount of content in those file types
- Skips determining which content must be transformed before the new system is activated
- Ignores the budget implications of transformation
- Assumes that 200 or 300 filters will do the job
- Does not understand the implications behind a vendor’s statement along these line: “Our engineers can create a custom filter for you if you don’t have time to do that scripting yourself.”
One final point: those 200 or more file types. Vendors talk about them with gusto. Check to see if the vendor is licensing filters from a third party. In certain situations, the included file type filters don’t support some of the more recent applications’ file formats. Other vendors “roll their own” filters. But filters can vary in efficacy because different people write them at different times with different capabilities. Try as they might, vendors can’t squash some of the filter nits and bugs. When you do some investigating, you may be able to substantiate my data that suggest filters work on about two thirds of the files you feed into the search or content processing system. Your investigation may prove my data incorrect. No problem. When you are processing 250,000 documents, the exception file becomes chunky from the system’s two to three percent rejection rate. A thirty percent rate can be a show stopper.
Stephen E. Arnold, January 25, 2008
Two Visions of the Future from the U.K.
January 17, 2008
Two different news items offered insights about the future of online. My focus is the limitations of key word search. I downloaded both articles, I must admit, eager to see if my research were disproved or augmented.
Whitebread
The first report appeared on January 14, 2008, in the (London) Times online in a news story “White Bread for Young Minds, Says University Professor.” In the intervening 72 hours, numerous comments appeared. The catch phrase is the coinage of Tara Brabazon, professor of Media Studies at the University of Brighton. She allegedly prohibits her students from using Google for research. The metaphor connotes in a memorable way a statement attributed to her in the Times’s article: “Google is filling, but it does not necessarily offer nutritional content.”
The argument strikes a chord with me because [a] I am a dinosaur, preferring warm thoughts about “the way it was” as the snow of time accretes on my shoulders; [b] schools are perceived to be in decline because it seems that some young people don’t read, ignore newspapers except for the sporty pictures that enliven gray pages of newsprint, and can’t do mathematics reliably at take-away shops; and [c] I respond to the charm of a “sky is falling” argument.
Ms. Brabazon’s argument is solid. Libraries seem to be morphing into Starbuck’s with more free media on offer. Google–the icon of “I’m feeling lucky” research–allows almost anyone to locate information on a topic regardless of its obscurity or commonness. I find myself flipping my dinosaurian tail out of the way to get the telephone number of the local tire shop, check the weather instead of looking out my window, and converting worthless dollars into high-value pounds. Why remember? Google or Live.com or Yahoo are there to do the heavy lifting for me.
Educators are in the business of transmitting certain skills to students. When digital technology seeps into the process, the hegemony begins to erode, so the argument goes. Ms. Brabazon joins Neil Postman Amusing Ourselves to Death: Public Discourse in the Age of Show Business, 1985) and more recently Andrew Keen (The Cult of the Amateur, 2007) among others in documenting the emergence of what I call the “inattention economy.”
I don’t like the loss of what weird expertise I possessed that allowed me to get good grades the old-fashioned way, but it’s reality. The notion that Google is more than an online service is interesting. I have argued in my two Google studies that Google is indeed much more than a Web search system growing fat on advertisers’ money. My research reveals little about Google as a corrosive effect on a teacher’s ability to get students to do their work using a range of research tools. Who wouldn’t use an online service to locate a journal article or book? I remember how comfortable my little study nook was in the rat hole in which I lived as a student, then slogging through the Illinois winter, dealing with the Easter egg hunt in the library stuffed with physical books that were never shelved in sequence, and manually taking notes or feeding 10-cent coins into a foul-smelling photocopy machine that rarely produced a readable copy. Give me my laptop and a high-speed Internet connection. I’m a dinosaur, and I don’t want to go back to my research roots. I am confident that the professor who shaped my research style–Professor William Gillis, may he rest in peace–neither knew nor cared how I gathered my information, performed my analyses, and assembled the blather that whizzed me through university and graduate school.
If a dinosaur can figure out a better way, Tefloned along by Google, a savvy teen will too. Draw your own conclusions about the “whitebread” argument, but it does reinforce my research that suggests a powerful “pull” exists for search systems that work better, faster, and more intelligently than those today. Where there’s a market pull, there’s change. So, the notion of going back to the days of taking class notes on wax in wooden frames and wandering with a professor under the lemon trees is charming but irrelevant.
The Researcher of the Future
The British Library is a highly-regarded, venerable institution. Some of its managers have great confidence that their perception of online in general and Google in particular is informed, substantiated by facts, and well-considered. The Library’s Web site offers a summary of a new study called (and I’m not sure of the bibliographic niceties for this title): A Ciber [sic] Briefing Paper. Information Behaviour of the Researcher of the Future, 11 January 2008. My system’s spelling checker is flashing madly regarding the spelling of cyber as ciber, but I’m certainly not intellectually as sharp as the erudite folks at the British Library, living in rural Kentucky and working by the light of buring coal. You can download this 1.67 megabyte 35 page document Researcher of the Future.
The British Library’s Web site article identifies the key point of the study as “research-behaviour traits that are commonly associated with younger users — impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs — are now becoming the norm for all age-groups, from younger pupils and undergraduates through to professors.” The British Library has learned that online is changing research habits. (As I noted in the first section of this essay, an old dinosaur like me figured out that doing research online faster, easier, and cheaper than playing “Find the Info” in my university’s library.)
My reading of this weirdly formatted document, which looks as if it was a PowerPoint presentation converted to a handout, identified several other important points. Let me share my reading of this unusual study’s findings with you:
- The study was a “virtual longitudinal study”. My take on this is that the researchers did the type of work identified as questionable in the “whitebread” argument summarized in the first section of the paper. If the British Library does “Googley research”, I posit that Ms. Brabazon’s and other defenders of the “right way” to do research have lost their battle. Score: 1 for Google-Live.com-Yahoo. Nil for Ms. Brabazon and the British Library.
- Libraries will be affected by the shift to online, virtualization, pervasive computing, and other impedimentia of the modern world for affluent people. Score 1 for Google-Live.com-Yahoo. Nil for Mr. Brabazon, nil for the British Library, nil for traditional libraries. I bet librarians reading this study will be really surprised to hear that traditional libraries have been affected by the online revolution.
- The Google generation is comprised of “expert searchers”. The reader learns that most people are lousy searchers. Companies developing new search systems are working overtime to create smarter search systems because most online users–forget about age, gentle reader–are really terrible searchers and researchers. The “fix” is computational intelligence in the search systems, not in the users. Score 1 more for Google-Live.com-Yahoo and any other search vendor. Nil for the British Library, nil for traditional education. Give Ms. Brabazon a bonus point because she reached her conclusion without spending money for the CIBER researchers to “validate” the change in learning behavior.
- The future is “a unified Web culture,” more digital content, eBooks, and the Semantic Web. The word unified stopped my ageing synapses. My research yielded data that suggest the emergence of monopolies in certain functions, and increasing fragmentation of information and markets. Unified is not a word I can apply to the online landscape.In my BearStearns’ report published in 2007 as Google’s Semantic Web: The Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft, I revealed that Google wants to become the Semantic Web.
Wrap Up
I look forward to heated debate about Google’s role in “whitebreading” youth. (Sounds similar to waterboarding, doesn’t it?) I also hunger for more reports from CIBER, the British Library, and folks a heck of lot smarter than I am. Nevertheless, my Beyond Search study will assert the following:
- Search has to get smarter. Most users aren’t progressing as rapidly as young information retrieval experts.
- The traditional ways of doing research, meeting people, even conversing are being altered as information flows course through thought and action.
- The future is going to be different from what big thinkers posit.
Traditional libraries will be buffeted by bits and bytes and Boards of Directors who favor quill pens and scratching on shards. Publishers want their old monopolies back. Universities want that darned trivium too. These are notions I support but recognize that the odds are indeed long.
Stephen E. Arnold, January 17, 2008
Google 2008 Publishing Output
January 1, 2008
If you had any doubt about Google’s publishing activities, check out “Google Blogging in 2008”. The article by Susan Straccia here provides a run down of the GOOG’s self publishing output. Google has more than 120 Web logs. The article crows about the number of unique visitors and tosses in some Googley references to Google fun. Pushing the baloney aside, the message is clear: Google has an effective, global publishing operation focused exclusively on promoting Google. Toss in the Google Channel on YouTube.com, and the GOOG has a communication, promotion, distribution mechanism that few of its rivals can match. In my opinion, not even a major TV network in the US can reach as many eyeballs as quickly and cheaply as Googzilla. Competitors have to find a way to match this promotional 30mm automatic Boeing M230 chain gun.
Stephen Arnold, January 1, 2009