Search Goes Down, Google Turns on the Juice
January 8, 2009
I saw several Web log posts and major media (dead tree outfits) articles about the decline in Web travel searches. A representative story is “Internet Travel Searches Drop 42 Percent” here. UK journalistic endeavors amuse me no end. Honk. Honk. Laura Dixon wrote:
Internet searches for flights were down over 40 per cent in the week after Christmas according to Hitwise, a division of Experian. Traffic to travel Web sites for the same period – up to the week ending January 3 – was also down 16 per cent.
Interesting but not the sort of data that makes me flap my wings. The addled goose thinks that a quick visit to Google.com is in order. I wonder if Ms. Dixon has entered this query in the Google search box: SFO LAX. That’s it. Two three letter strings. These abbreviations refer to airports. Here’s what the GOOG displayed for me:
My thought is that if the number of queries is down, what’s the value of appearing as one of the seven featured air ticket sources underneath the structured query insert? In fact, in the last few months, there’s been some shuffling of the featured carriers. Notice too that the Google system automatically converted the airport pair into a query. Set the dates, pick a vendor, and bingo you get a list of options. Pretty handy for mobile phone users too. I wonder if Microsoft will offer this feature on its forthcoming Verizon service.
To me the downturn in flight searches means that the Google will turn on the juice to get more revenue from advertisers who must get traffic. That’s a more interesting angle for the addled goose to consider. But I live in rural Kentucky and am not affiliated with an oh-so-excellent dead tree publication. There you go.
Stephen Arnold, January 8, 2009
Google Semantics Surfacing
January 8, 2009
ReadWriteWeb.com (January 6, 2009) ran an interesting article that tiptoes around Google’s semantic activities. You will want to read “Did Google Just Expose Semantic Data in Search Results”. Google won’t answer the question, of course. But the addled goose will, “Yep, where have you been since early 2007?” Let me point out that Marshall Kirkpatrick has done a good job of tracking down “in the wild” examples of Google’s machine-based semantic methods. These examples (and others in Google open source documents) make it clear that the semantic activities are chugging along and maturing nicely. “Semantics” as used in this write up means “figuring out what something is about.” Once one knows the “about” part of an information object, then other methods can hook these “about” metadata together. If you want to get a sense of the scope of the Google semantic system, click here. I have a checking copy of the report I wrote for BearStearns before that outfit went up in flames or down the drain. (Pick your metaphor.) My write up here does not include the detail that is in the full discussion in Google Version 2.0 here. But this draft provides some meat for the “in the wild” examples found in Mr. Kirkpatrick’s good article. How significant is the investment in semantics at Google? You can find some color on the sweep of Google’s semantic activities in the dataspace white paper Sue Feldman and I wrote (September 2008). You can get this report from IDC; it is report number 213562.
Let me close with three observations:
- Google is deeply involved in semantics, but with a Googley twist. Watching for examples in the wild is a very useful activity, especially for competitors
- The notion of semantics is sufficiently broad to embrace metadata generation and new types of metadata so that new types of data constructs can be automatically generated by Google. Think publishing new constructs for money.
- The competitors chasing Google face the increasingly likely prospect that Google has jumped over its own present position and will land even farther ahead of the likes of IBM, Microsoft, Oracle, and SAP. Yahoo. Forget them. The smart Yahooligans are either at Google or surfing on Google.
Now I expect some push back from the anti Google crowd. Have at it. Just make sure you have internalized Google’s technical papers, Google “talks”, and the patent documentation. This goose is not too interested in uninformed challenges. You can read more about Google semantics and in my forthcoming Google and Publishing study from my trusty publisher Infonortics Ltd. located near Oxford, England, in the spring.
Stephen Arnold, January 8, 2009
Lawyers and Metadata
January 8, 2009
Now the indexing world gets something to gnaw on. Automated indexing systems beat out humans when measured by cost per item indexed, speed, and consistency. Automated indexing systems can be as good as a human for some types of content. But humans are variably bad at indexing. Software hits a sweet spot and doesn’t get significantly better or worse unless the content throws in a wrench. Now the issue of not providing metadata arises. We can automate the creation of metadata, but it is early days in the world of automatic metadata scrubbing. I quacked happily when I thought, “I wonder who knows where their metadata are?”
Jim Calloway’s “Metadata–What Is It and Waht Are My Ethical Duties” here breathes new life into human indexing. What I find interesting is that lawyers charge by the hour. Human indexes are paid by piece work schedules or given a flat year fee and maybe some benefit crumbs. The economics of human indexing is based on keeping the per record cost as low as possible whilst one maintains the “quality” of the indexing. “Quality” in the commercial database world is often defined as a metric such as “four to six index terms per bibliographic record” or “16 records per hour with required fields completed”. You may have a more academic definition, but my examples come from the soon-to-be-marginalized world of human commercial database production.
The article defines metadata in terms of a legal eagle, of course. But the story gets interesting when Mr. Calloway cites a sitaution in which metadata became a legal issue. Where there is a legal issue, there is the risk of a fine, jail, or losing pride of place among the brood of legal eagles. Forget the compensation. Ego may be a bigger force in the legal eagle world. Mr. Calloway nicely hooks metadata with risk.
For me, the most important comment in this useful write up was:
In this writer’s view, the key is to avoid sending out documents with metadata that could disclose confidential information. Comparing metadata to a wrongly sent fax or e-mail is questionable and the idea that lawyers will be prohibited from examining metadata while parties, law enforcement officers and private detectives will be free to do so seems artificial at best. The Colorado rule that one must disclose receiving confidential information via metadata before acting on it seems to strike a rational balance. The best rule is for law firms to develop best practices internally to keep metadata from “escaping” in the first place.
I quite like “keep metadata from escaping in the first place”. To close, let me ask several questions:
- Do you know why metadata are in the documents available for indexing on your Web site
- Do you know how value added indexing in a dataspace can expand the access to a document in an often unrelated context
- Do you know where metadata are in a document, in a Web page or other containing housing the document, or in the dataspace created for the information objects?
If not, you will want to dig up this information yourself. Asking your attorney will result in a very large legal bill. One final question: Do you think Mr. Madoff knows about his metadata?
Stephen Arnold, January 8, 2009
Non-Techies and Metadata
January 8, 2009
The metadata quandary for legal eagles will stick like Kentucky mine run off. If you want to make sure your Word documents are metadata free, you will want to read “How to Remove the Hidden Metadata from Word Document” here. A slightly more interesting exercise is to aim a search engine’s content acquisition system at shared folders and browse what the spider catches in its digital web. If you think metadata are a liability, check out the goodies you harvest. Download any desktop indexing system that can access your network shares. Now you know why eDiscovery is so important and often quite interesting for those paid to pour through metadata.
Stephen Arnold, January 8, 2009
Google Video Creeps Forward
January 7, 2009
Telecompaper.com reported on January 7, 2008, here that “T-Mobile Launches YouTube Channel for G1.” Google has a Google Channel on YouTube.com. How many more channels will be available for special niches? The GOOG, unlike the traditional TV crowd, generates metatags for its videos. Creating a channel is a software process, not one requiring humans sitting in dark control rooms twirling dials. Michael Hirschorn’s “End Times” here notwithstanding, the GOOG’s potential energy in another bastion of traditional energy will increase in force. Like an earthquake, a jump from a 2.0 to a 3.0 is not a linear force. Clever writing won’t do much to change the face of traditional media when Googzilla does its waltz to the Strauss tune Schatz-Walzer. There’s gold in those honking hot videos pumped to any device that can tap into the Google umbilical.
Stephen Arnold, January 7, 2008
Google and Disallow
January 7, 2009
You will want to check out “On Google Disallowing Carling of Their Life Hosting” here. Google Blogoscoped has a good write up about this — to some — surprising development. Other search engines cannot index the Time Warner Life Magazine images. Google inserted a blocking line in its robots.txt file. I noticed that I was limited in the number of images I could browse when the service first went live. I was surprised that these images were available to me without a fee. For years, the Time crowd has noodled about its picture archive. First, Time wanted to handle the scanning itself. Then Time wanted to subcontract the work but that was too expensive. Then it was a good idea to talk with experts about what to do. Then the cycle repeated. Along came the GOOG and the rest, as someone will write after this goose is cooked, is history. Here’s what is going on in my opinion:
- Restrictive content access is going to become more visible. If you read the Guha patent applications from February 2007, you will have noted that Google’s system can operate in a discrimatory way. That translates, in my view of the world, to restrictions on what others can and cannot do with Google information. This is an important phrase: “Google information.” Please, note it, copyright lovers.
- The Life images are a big deal, and I am confident that the restrictions are probably positioned as part of the method to balance public access with protection for the assets of Time Warner. Everyone has needs, so this restriction is a nifty way of finding a middle way with Googzilla’s hands on the controls.
- The cost of getting the Life images was not trivial. I have not heard anything substantive about the financial burden of this project, but based on my prior knowledge of the magnitude of the scanning and logistics of the images, this puppy was expensive. In my view, unlike a pure academic library play, this deal has a price tag and someone has to pay at some point.
What’s ahead? Well, in my view, once Google creates metadata and populates one of its knowledgebases, those data will be protected and probably with considerable enthusiasm. Google’s programmable search engine generates data and if some data items are missing, the system beavers away until the empty cell is filled. Once those dataspaces are populated, the information is not for just anyone’s use.
I mentioned the word dataspaces in a telephone converastion today. I know I am not communicating. The person on the other end of the call asked, “What’s a dataspace?” Well, you are now disallowed from one.
Stephen Arnold, January 7, 2008
Newspapers: Another Analysis of Failure
January 7, 2009
Slate’s Jack Shafer took a Tanaka ECS-3301 chain saw to traditional newspapers here. His “How Newspapers Tried to Invent the Web” was an enjoyable read for me. I don’t think the wizards at some of the formerly high flying newspaper companies were similarly affected. The hook for the article was Pablo J. Boczkowski’s 2004 book, Digitizing the News: Innovation in Online Newspapers. Armed with a fact platform, Mr. Shafer frolics through the misadventures of media mavens and the Web. The phrase I liked was “extreme suckage”. I wish this goose had thought of that. Wordsmithing aside, the comment that resonated with me was:
From the beginning, newspapers sought to invent the Web in their own image by repurposing the copy, values, and temperament found in their ink-and-paper editions. Despite being early arrivals, despite having spent millions on manpower and hardware, despite all the animations, links, videos, databases, and other software tricks found on their sites, every newspaper Web site is instantly identifiable as a newspaper Web site. By succeeding, they failed to invent the Web.
A congratulatory quack to Mr. Shafer for this write up. Read at once. Now think about a similar fate for motion picture outfits confident of their brilliance after a strong 2008. The party’s not over for that crowd. More about this in my forthcoming Google and Publishing monograph.
Stephen Arnold, January 7, 2009
Data for the 21st Century
January 6, 2009
A happy quack to Max Indelicato for his “Scalability Strategies Primer: Database Sharding” here. Mr. Indelicato has gathered very useful information about data management tactics. Unlike the IBM-Microsoft-Oracle database information, this write up delivers useful, interesting information. Download and save the article. For me, the most important comment in the write up was:
You may be wondering if there is a high amount of overhead involved in always connecting to the Index Shard and querying it to determine where the second data retrieving query should be executed. You would be correct to assume that there is some overhead, but that overhead is often insignificant in comparison to the increase in overall system performance, as a result of this strategy’s granted parallelization. It is likely, independent of most dataset scenarios encountered, that the Index Shard contains a relatively small amount of data. Having this small amount of lookup data means that the database tables holding that data are likely to be stored entirely in memory. That, coupled with the low latencies one can achieve on a typical gigabit LAN, and also the connection pooling in use within most applications, and we can safely assume that the Index Shard will not become a major bottleneck within the system (have fun cutting down this statement in the comments, I already know it’s coming 🙂
Ah, the Google legacy coming to light.
Stephen Arnold, January 6, 2009
Search Pioneer Upshifts: Interview with Mike Weiner
January 6, 2009
In the 1980s I relied on a very fast search system for my personal computer. The program was Gopher from Microlytics. In the late 1990s, I met the founder of Gopher and tracker his interest in linguistic-centric search systems. I lost track of Mike Weiner, former president of Microlytics, but we spoke on the telephone a day or two ago. You can get information about Technology Innovations here. I captured his comments in an interview which is now available on the ArnoldIT.com Search Wizards Speaks sub site here.
Two comments in my conversation with Mr. Weiner struck a chord with me. Let me highlight these in this brief news item about the interview.
First, search has grown beyond the desktop. Mr. Weiner said in response to a question about desktop search:
…the desktop of today and tomorrow are connected to the “world.” So there can be very clever background processing done on your behalf that can leverage off the information you access and the information you create. The question will be, what’s useful and important to you, and can the system fetch, or generate, this, for you, and in an efficient form you can cognitively benefit from. One of the next potentials for incredible retrieval will be intelligent “information extraction.”
Second, Mr. Weiner’s new interests pivot on innovation. Technology Innovations holds patents on different facets of electronic paper or “epaper”. About the future of epaper, Mr. Weiner said:
I see epaper heavily used in educational publications, where children and learners have questions, need definitions, etc. You may see a speller and thesaurus, and translation technology coming bundled on books with electronic chips in them.
If you are interested in search and publishing in the 21st century, you will find the Mike Weiner interview interesting.
Stephen Arnold, January 6, 2008
Can You Find Crackle Videos with Crackle Search
January 6, 2009
At lunch the subject of video search came up among the Beyond Search goslings. One of the newly-hatched goslings mentioned that Sony’s Crackle was indexed thoroughly on Google Video. Furthermore, Sony uses YouTube.com to promote new, original Crackle content. For an example, click here. We fired up our baby Asus netbook and gave the flakey Verizon high speed wireless a go. Success. We were able to connect to the Crackle.com Web site and run queries on Google Video. What’s this have to do with search? Well, the search system on the Crackle.com site is not too good. The system uses a weird and hard to read blue type on black motif, returns matches on “star” and truncates the “ving” without warning, and generally seems sluggish.
Crackle, I learned from the gosling, that Sony bought the Grouper.com site for $65 million in 2007. Some background information is here. Renamed Crackle.com, Sony’s video site is positioned–well–out of site for me. I did explore the site via the search system. The programs like Rocketboom resonated. Sony paid a hefty sum to get the rights to distribute the quirky Net-centric video show. More information about this deal is here.
Sony is spending to be a player in video. But with the PlayStation sucking air and a global financial crisis bubbling away, one wonders if Sony can do much to boost the visibility of the Crackle.com service and have the money to fix the Crackle.com search system. One plus. Crackle.com works a lot better than the piggy Web site for the Sony electronic book.
Stephen Arnold, January 6, 2008