Google Semantics Surfacing

January 8, 2009

ReadWriteWeb.com (January 6, 2009) ran an interesting article that tiptoes around Google’s semantic activities. You will want to read “Did Google Just Expose Semantic Data in Search Results”. Google won’t answer the question, of course. But the addled goose will, “Yep, where have you been since early 2007?” Let me point out that Marshall Kirkpatrick has done a good job of tracking down “in the wild” examples of Google’s machine-based semantic methods. These examples (and others in Google open source documents) make it clear that the semantic activities are chugging along and maturing nicely. “Semantics” as used in this write up means “figuring out what something is about.” Once one knows the “about” part of an information object, then other methods can hook these “about” metadata together. If you want to get a sense of the scope of the Google semantic system, click here. I have a checking copy of the report I wrote for BearStearns before that outfit went up in flames or down the drain. (Pick your metaphor.) My write up here does not include the detail that is in the full discussion in Google Version 2.0 here. But this draft provides some meat for the “in the wild” examples found in Mr. Kirkpatrick’s good article. How significant is the investment in semantics at Google? You can find some color on the sweep of Google’s semantic activities in the dataspace white paper Sue Feldman and I wrote (September 2008). You can get this report from IDC; it is report number 213562.

Let me close with three observations:

  1. Google is deeply involved in semantics, but with a Googley twist. Watching for examples in the wild is a very useful activity, especially for competitors
  2. The notion of semantics is sufficiently broad to embrace metadata generation and new types of metadata so that new types of data constructs can be automatically generated by Google. Think publishing new constructs for money.
  3. The competitors chasing Google face the increasingly likely prospect that Google has jumped over its own present position and will land even farther ahead of the likes of IBM, Microsoft, Oracle, and SAP. Yahoo. Forget them. The smart Yahooligans are either at Google or surfing on Google.

Now I expect some push back from the anti Google crowd. Have at it. Just make sure you have internalized Google’s technical papers, Google “talks”, and the patent documentation. This goose is not too interested in uninformed challenges. You can read more about Google semantics and in my forthcoming Google and Publishing study from my trusty publisher Infonortics Ltd. located near Oxford, England, in the spring.

Stephen Arnold, January 8, 2009

Data for the 21st Century

January 6, 2009

A happy quack to Max Indelicato for his “Scalability Strategies Primer: Database Sharding” here. Mr. Indelicato has gathered very useful information about data management tactics. Unlike the IBM-Microsoft-Oracle database information, this write up delivers useful, interesting information. Download and save the article. For me, the most important comment in the write up was:

You may be wondering if there is a high amount of overhead involved in always connecting to the Index Shard and querying it to determine where the second data retrieving query should be executed. You would be correct to assume that there is some overhead, but that overhead is often insignificant in comparison to the increase in overall system performance, as a result of this strategy’s granted parallelization. It is likely, independent of most dataset scenarios encountered, that the Index Shard contains a relatively small amount of data. Having this small amount of lookup data means that the database tables holding that data are likely to be stored entirely in memory. That, coupled with the low latencies one can achieve on a typical gigabit LAN, and also the connection pooling in use within most applications, and we can safely assume that the Index Shard will not become a major bottleneck within the system (have fun cutting down this statement in the comments, I already know it’s coming 🙂

Ah, the Google legacy coming to light.

Stephen Arnold, January 6, 2009

Cloud Data Storage

January 5, 2009

The UK publication Sys-con.com published “Data Storage Has Been Taken for Granted” here. You may have to fight through some pop ups and kill the sound on the auto-running commercial, but you will want to put up with this wackiness to read Dave Graham’s article. Mr. Graham does a good job of highlighting the needs for cloud data storage. This initial article will be followed by other segments, so you will want to snag each of them. In this first installment, for me the most important comment was:

Each type of content, whether it be structured or unstructured,  has different influencing factors affecting its storage and retrieval.

The significance of this comment is that a vendor or storage provider will have to have the specific framework in place to handle the demands of different types of data storage and access. Why is this important? I run into quite a few people who dismiss storage as a non-issue. These issues are not trivial and data management remains one of the factors that govern the performance and cost of a storage system. The phrase “garbage in, garbage out” has given way to “get data in, get data out” easily, quickly, economically.

Stephen Arnold, January 5, 2009

New Conference Pushes beyond Search

January 5, 2009

After watching some of the traditional search and content processing conferences fall on their swords, muffins, and self-assurance in 2008, I have rejiggled my conference plans for 2009. One new venue that caught my attention is The Rockley Group’s event in Palm Springs, California, January 29-30, 2009. You can get more informatio0n about the program here. The event organizer is Ann Rockley, who is one of the people emphasizing the importance of intelligent content.

image

Ann Rockley, The Rockley Group

I spoke with Ms. Rockley on January 2, 2008. The text of that conversation appears below:

Why is another conference needed?

Admittedly there are a lot of conferences around for people to attend, but not one that focuses specifically on the topic of Intelligent Content. My background is content management, structured content and XML. There are lots of conferences that focus mainly on the technology, others that focus on the content vehicle or channel (e.g., web) and others that focus on XML. The technology oriented conferences often lose sight of the content; who it’s for, how can we most effectively create it and most importantly how can we optimize it for our customers. The content channel oriented conferences e.g. Web, focus only on the vehicle and forget that content is not just about the way we distribute it; content should be optimized for each channel yet at the same time it should be possible to repurpose and reconfigure the content for multiple channels. And XML conferences tend to be highly technical, focusing on the code and the applications and not on how we can optimize our content using XML so that we can manipulate it and transform it much the way we do data. So this conference is all about the CONTENT! Identifying how we can most effectively create it so that we can manipulate it, transform it and deliver it in a multitude of ways personalized for a particular audience is an area of focus sadly lacking in many conferences.

With topics like Web 2.0 and Social Search I am at a loss to know what will be covered. What are the issues your conference will address?

Web 2.0 is about social networking and sharing of content and media and it has had a tremendous influence on content. Organizations have huge volumes of content stuck in static web pages or files and they have a growing volume of content stuck, and sometimes lost in the masses of content being accumulated in wikis, blogs, etc. How can organizations integrate their content, share their content and make it most useful to their customers and readers without a lot of additional work? How do we combine the best of Web 2.0 with the best of traditional content practices? Organizations don’t have the time, resources or budget to do all the things we need and want to do for our customers, but if we create our content intelligently in the first place (structure it, tag it, store it) we can increase our ability to do so much more and increase our ability to effectively meet our customers’ needs. This conference was specifically designed to answer those questions.

Intelligent Content provides a venue for sharing information on such topics as:

  • Personalization (structured content, metadata and XQuery)
  • Intelligent publishing (dynamic multichannel delivery)
  • Hybrid content strategies (integrating Web 2.0 content with traditional customer content)
  • Dynamic messaging/personalized marketing
  • Increasing findability
  • Content/Information Management

Most attendees complain about two things: The quality of the presentations and the need for better networking with other attendees. How are you addressing these issues?

We are doing things a little differently. All the speakers have been assigned a mentor for review of their outline, drafts and final materials. We are working with them closely to ensure that the presentations are top notch and we have asked them all to summarize their information in Best Practices and Tips. In addition, Intelligent Content was designed to be a small intimate conference with lots of opportunities to network. We will have a luncheon with tables focused on special interests and we are arranging “Birds of a Feather” dinners where like-minded people can get together over a great meal and chat, have fun and network. We also have a number of panels which are designed to work interactively with the audience. And to increase the feeling intimacy we have not chosen to hold the conference in a traditional “big box” hotel, rather we have chosen a boutique hotel, the Parker Palm Springs (http://www.starwoodhotels.com/lemeridien/property/overview/index.html?propertyID=1911), a hotel favored by Hollywood stars from the 1930s. It is a very cool hotel with lots of character that encourages people to have fun while they interact and network.

What will you offer attendees?

The two day conference includes 16 sessions, 2 panels, breakfast, lunch and snacks. It also includes organized networking sessions both for lunch and dinner, and opportunities to ask the Experts key questions. And the conference isn’t over when it is over, we are setting up a Community of Practice including a blog, discussion forum, and webinars to continue to share and network so that every attendee will have an instant ongoing network.

I enjoy small group sessions and simple things like going to dinner with a group of people whom I don’t know. Will you include activities to help attendees make connections?

Absolutely. We deliberately designed the conference to be a small intimate learning experiencing so people weren’t lost in the crowd and we have specifically created a number of luncheon and dinner networking experiences.

How can I register to attend? What is the url for more information.

The conference information can be found at www.intelligentcontent2009.com. Contact info@intelligentcontent2009.com if you have questions. Note that the conference hotel is really a vacation destination so we can only hold the rooms at the special rate for a limited time and that expires January 12th so act quickly. And we’ve extended the early bird registration to Jan. 12 as well. If you have any other questions you can contact us at moreinfo@rockley.com.

Stephen Arnold, January 5, 2008

Google Free Security Monograph

January 3, 2009

The Google press machine has churned out a security monograph. The Browser Security Handbook is typical Google. Terse, basic information designed to make life easier for Google is available without charge. You can download a copy from the Google Online Security Blog link here. Enjoy.

Stephen Arnold, January 3, 2008

Browser Share Drop for Microsoft Is Bad News

January 2, 2009

The netbooks have arrived in rural Kentucky. Beyond Search now has two of these devices. Nothing beats the IBM mainframe in my opinion, but even old geese have to adapt. Netbooks can run applications, but we find ourselves using portable applications and services available via a WiFi or the Verizon wireless service. Once Firefox is up and running, we have found that over time cloud-based services such as Google Apps are good enough. As fond as we are of the MVS/TSO approach to computing, the browser or browser like environments seem to be the future. Victor Goinez’s “Internet Explorer’s Share of the Browser Market Fell below 70% in November” here struck us as bad news for Microsoft. The article contains a nifty graphic showing the vendors’ respective market shares too. Data reported second or third hand can be wide of the mark. Let’s assume that these figures are spot on. So what? In our opinion, a decline in Internet Explorer share of the market means that other vendors have sucked some oxygen from the Microsoft ecosystem. Microsoft can keep on breathing, but the company needs to address the problem. Other browser developers may ramp up their attack on IE, which has lagged Chrome, Firefox, Safari, and Opera in some key features. If the shift is evident to computer users in rural Kentucky, the more informed folks in more intellectually astute areas will be even more aware of the importance of the browser and browser like environments. Chrome, in our opinion, only looks like a browser. Chrome is a software airlock that connects a computing device to the Google mothership. If Chrome succeeds in snapping its airlock on more computers, Microsoft’s share of the browser market may continue to experience labored breathing.

Stephen Arnold, January 2, 2008

Dataspace Boomlet

December 30, 2008

First, the UK moved toward pervasive monitoring. You can read about it here. Australia made some moves in a similar direction which you can read about here. Now The Hindu reports “Crime Scene Investigation Now to Function with Broadband.” You can read this story about POLNET here. For me the most interesting comment in the article was:

Police stations across the country will feed and upload video and still footage of the crime spot on POLNET which will then be transmitted to the Central server in Delhi and can be accessed by authorized experts.  Such analysis would help in understanding the modus operandi of criminals and terrorists and prepare a strategy to tackle the same.

These systems are baby steps toward nation state dataspaces. Unlike a dataspace, the metadata available to investigators becomes richer. I have no position about the policies these nation states are implementing. What’s important to me are these issues:

  • The idea of a dataspace, not a database, is clearly gaining traction. Obviously traditional databases cannot delivering the value that their licensees desire.
  • The dataspace analyses will place considerable strain on the nation states’ data processing ability. The jam regarding the Bush White House digital data is an example of the data management burden that will become a major issue in 2009.
  • The company best positioned to provide cloud based processing of these data is, in my opinion, Google. If there is a dip in advertising, the GOOG can contact certain countries with an offer to use Google’s proprietary data management systems for a fee. The pricing model can be a variant of the Clearwell Systems’ approach; that is, by the gigabyte.

I know that most people are blissfully unaware of the dataspace technology. I can point you to the for fee report Sue Feldman and I wrote in September 2008. I cannot reproduce that document IDC Number 213562 here. Some dataspace information appears in my Gilbane Report Beyond Search which is available here. In my view from my hollow in rural Kentucky, there will be some activity in the dataspace sector in 2009.

Stephen Arnold, December 30, 2008

Google Translation Nudges Forward

December 27, 2008

I recall a chipper 20 something telling me she learned in her first class in engineering; to wit, “Patent applications are not products.” As a trophy generation member, flush with entitlement, she’s is generally correct, but patent applications are not accidental. They are instrumental. If you are working on translation software, you may want to check out Google’s December 25, 2008, “Machine Translation for Query Expansion.” You can find this document by searching the wonderful USPTO system for US20080319962. Once you have that document in front of you, you will learn that Google asserts that it can snag a query, generate synonyms from its statistical machine translation system, and pull back a collection. There are some other methods in the patent application. When I read it, my thought was, “Run a query in English, get back documents in other languages that match the query, and punch the Google Translate button and see the source document in English.” Your interpretation may vary. I was amused that the document appeared on December 25, 2008, when most of the US government was on holiday. I guess the USPTO is working hard to win the favor of the incoming administration.

Stephen Arnold, December 27, 2008

Google and the Telcos: The Saga Continues

December 24, 2008

Earlier this year, Mercer Island Group and I held a series of briefings for telco executives. We reviewed Google’s considerable body of technology related to telephony. In those briefings, we encountered push back. Telcos did not understand what Google had been doing for seven or eight years. Furthermore, the telcos viewed the world of Google as one confined to looking up innocuous information on a free Web search with mostly meaningless advertisements sprinkled on the edges of the results. The Wall Street Journal reported that the GOOG allegedly communicated with some telcos to get a deal for high speed access for certain types of services. The story ran and Googzilla showed its fangs. But the story did not die. Now you can read more by Adam Lashinsky, editor at large, for Fortune Magazine, a dead tree output of the giant Time Warner. The digital version of the story “Google Wants Something for Nothing” here. I don’t subscribe to paper magazines anymore, so I can’t say if this CNN version is the whole enchilada or just the crumbs. The article runs down the Wall Street Journal’s story and takes more of a Google is doing something approach. For me the most interesting comment was:

The bottom line here isn’t the fine points of public policy. The main thing is attitude. The Web culture thinks things should be free. Internet access is a commodity. Music videos are for the taking.

You may want to read the story to get some insight into the perils of writing about Google and then rationalizing the differences between the “Web culture” and the dead tree crowd. My thought is that neither the telcos nor outfits like New York magazine publishers have a solid understanding of the scope of Google’s services and their implications for companies with business models that no longer work very well. I want to see what the New Year brings.

Stephen Arnold, December 24, 2008

Digg: A Case Study in Cost Control

December 20, 2008

In 1993, Chris Kitze and I created The Point (Top 5% of the Internet). Most people don’t know about this Web site. We sold it to Lycos a few years after opening for business. Now Lycos has gone out of business. Chris and I learned about Internet costs, and we were able to remain unstained by the red ink that many companies find ruining their sneakers. Flash forward 15 years, and the Digg story that carries such enticing headlines as “Digg’s Miserable business” and “Digg’s Sorry Revenue Stream” made me realize how little knowledge there is about Internet costs and the challenge of scaling a business in line with revenue or available capital. Digg, I want to point out, has venture backing and can operate at a loss for a while. But the news stories and the way the writers have addressed the fundamental issue–cost control–shows how some basic fundamentals can drag down an operation. My hope is that the losses or negative cash flow that Digg is experiencing will focus more attention on the challenge of Internet economics. Chris and I learned that traditional business school cost assumptions don’t work in some Internet centric functions. For example, one expects that over time, costs for hardware, software, and maintenance would stabilize and chug along with a reasonable and predictable annual growth rate. Wrong. Exogenous factors like a crash have to be fixed at once, which is expensive. Environmental changes such as a operating system software upgrade or an innovation have to be accommodated. Again the costs spike. Another example is what I call the burden of success. We started The Point, agreed to a specific amount of bandwidth, and then watched as our traffic surged on a daily basis. When the bandwidth bill came, we had to cover it. We did not expect that success translated into predatory pricing from our friendly service provider. Digg’s financial situation is instructive. I hope that its economics influence some of the budgeting that other young companies are doing as I write this post. In a meeting on Friday, December 19, 2008, one of my clients discussed financial assumptions with my team. The Digg example was instructive, and I think the business plan costs section will be reworked. Internet economics, not technology, tolls a death knell for many promising operations. Traditional MBA thinking has demonstrated it does not work in the broader economy, and it does not work for many individual operations.

Stephen Arnold, December 20, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta