Dead Tree Update: Chicago and Suburban Shoppers

December 29, 2008

Newsweek Magazine, a dead tree publication in some danger of marginalization, published “Chicago’s Newspapers Facing a Troubled Future” here. When I read this article, I had the impression that the author, F.N. D’Alessio, was writing about Newsweek and the Associated Press. Mr. D’Alessio refers to newspaper “addicts”. I don’t know too many. I receive four dead tree newspapers: the Courier Journal, USA Today (affectionately known as McPaper), the New York Times, and the Wall Street Journal. I used to get the Financial Times, but the delivery was so erratic I dropped the paper in January 2008. I received an offer of a year’s subscription for $99, and I threw it in the trash. Too much hassle trying to work through clumps of papers arriving twice a week. For me, the most significant comment in the Newsweek story was a comment about the Tribune’s rival, the Chicago Sun Times:

Hollinger’s biggest move was to create the Sun-Times Media Group by buying up 70 suburban and neighborhood newspapers, more than a dozen of which are dailies. Some of those are profitable, and some newspaper analysts envision the Sun-Times company shutting down the namesake paper and keeping the suburban ones.

I read this as a clear statement that big city papers are gone geese. Check out the Tribune’s online version of the newspaper. It is a disaster. My discussion of this wounded duck is here.

The future for dead tree outfits–if there is to be one–is to become ad supported, micro publications serving narrow markets. For years, I thought the Gaithersburg Gazette was had potential. Now that type of publication along with penny shoppers may be the margin of the information world available to the dead tree crowd.

You can make money in niches, but the revenue will buy used Malibus, not the flashy Mercedes the princes of journalism see as suitable transportation.

Stephen Arnold, December 29, 2008

Duplicates and Deduplication

December 29, 2008

In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.

Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.

At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.

Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:

  1. The two emails are not identical and include both and the two attachments
  2. The earlier email is the accurate one and exclude the later email
  3. The later email is accurate and exclude the earlier email.

Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.

image

Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg

Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.

Read more

Google Translation Nudges Forward

December 27, 2008

I recall a chipper 20 something telling me she learned in her first class in engineering; to wit, “Patent applications are not products.” As a trophy generation member, flush with entitlement, she’s is generally correct, but patent applications are not accidental. They are instrumental. If you are working on translation software, you may want to check out Google’s December 25, 2008, “Machine Translation for Query Expansion.” You can find this document by searching the wonderful USPTO system for US20080319962. Once you have that document in front of you, you will learn that Google asserts that it can snag a query, generate synonyms from its statistical machine translation system, and pull back a collection. There are some other methods in the patent application. When I read it, my thought was, “Run a query in English, get back documents in other languages that match the query, and punch the Google Translate button and see the source document in English.” Your interpretation may vary. I was amused that the document appeared on December 25, 2008, when most of the US government was on holiday. I guess the USPTO is working hard to win the favor of the incoming administration.

Stephen Arnold, December 27, 2008

Algorithms for All

December 27, 2008

A happy quack to the reader who sent me this link to the ACM’s collection of old, bad, not too old and not too bad algorithms. You can access the list and download the algorithms here. The collection task was a big one. Tim Hopkins, University of Kent, has his name on the referenced page. The geese at ArnoldIT.com want to thank him for his work. Keep in mind that algorithms’ beauty may be found in the eye of the beholder. Some of these are gems; others will choke even a modern hot rod computer. Test and retest, quacks the goose.

Stephen Arnold, December 27, 2008

Reading Google Paw Lines to Foretell Its Future

December 26, 2008

Alex Chitu must have been close enough to Googzilla to get it to show its paw for a fortune telling session. You can read his “Predictions for Google’s 2009” in Google Operating System here. His observations for the most part are interesting and I think, like Nostradamus, some of these predictions may be “true”. For example, Google Translate will become a more widely deployed function in Google products and services. You will find my discussion of Google’s December 25, 2008, patent application US20080319962 germane to this prediction. If you want to peer beyond Mr. Chitu’s flat statement, download the patent document and check out the claims. I also agree that Google Contacts will gain some beef in 2009. If you have been watching the weird ritual mating dance between Googlers and Salesforce.com, you may conclude that the GOOG wants more from customer relationship management than a quick buy out of Salesforce.com for its multi tenant inventions and the company’s potent marketing engine.  The personalized search ads have been visible to me on a couple of my Google “ig” sessions, so that’s a slam dunk for 2009. You can read his other prognostications here. I would like to mention three predictions that I hoped he would mention but did not. These are quite addled, and so these are ideal for the Beyond Search addled goose crystal ball output; namely:

  1. Companies in sectors unrelated to Web search and online advertising realize that the GOOG is disrupting their businesses. The addled goose watched in 2008 as commercial database companies and telecommunications companies woke up to a strange, new, Googley world. Can you guess the business sectors? You can get a list of these plus a diagram in my 2007 Google Version 2.0 which is still available. Click here to order.
  2. Authors will turn to Google as a way to sell, not just market, their original work. With dead tree publishing companies racing toward Armageddon, the GOOG as a publishing medium will come into its own. Google has quite a few technical documents explaining in considerable detail how to make this happen.
  3. Regulators in various countries will realize that Google heralds a new spin on globalization. Local operations deliver quite specific products and services, yet the plumbing exists “out there” so it is tough to deal with the GOOG under existing regulatory umbrellas.

What do you think the GOOG will do in 2009? Oh, I know that Google is just a Web search company in the business of selling ads in a deteriorating economic climate. I am a silly goose for having articulated that Google is more, much more.

Stephen Arnold, December 26, 2008

European Digital Library Back Online

December 26, 2008

The Inquirer reported here that the digital library sponsored by the European Union is online again. You can read the announcement here in “Euro Library Re-Opens”. More servers and more optimism should help the service which crashed when it first opened. The addled goose asks, “When will the EU lose its appetite for pumping money into infrastructure?” I am now calculating the odds that the EU seeks help from a company able to scale. Google is a long shot, but the Exalead engineers could contribute.

Stephen Arnold, December 26, 2008

Yahoo’s Four Issues

December 26, 2008

TheStreet.com ran Eric Jackson’s “Reasons behind Yahoo’s Four Year Slump” here. Mr. Jackson does a good job of summarizing the received wisdom about the company’s challenges. Few can disagree that Yahoo’s leadership has been uninspired. Mr. Jackson moves quickly to identify product leadership leadership and the company’s organizational structure challenges. I wanted to add several observations that, in my opinion, also contribute to the company’s singular lack of effectuality:

First, the Yahoo technology generates one offs. News releases accompany these initiatives. That’s great for the public relations company and for the developers who hop on the Yahoo bandwagon. Build Your Own Search is a good example. Yahoo makes it easy for search developers to piggyback on Yahoo’s Web index. The excitement is certainly due to Yahoo’s making this service available without charge. Google offers some free searching too, but from what I hear the GOOG is quick to contact those developers who come to Google’s attention. Fees are never far from Googzilla’s mind. My point is that monetization does not seem to be a top priority at Yahoo. In today’s business environment, I think that is an issue.

Second, over the years Yahoo has acquired a wide range of companies. Based on the information I have, Yahoo had been content to let these outfits chug along. Yahoo was on the portal path when the GOOG decided to focus on search and seek inspiration from the Overture paid search service. The GOOG, whether by luck or input from former Altavista.com engineers, created a relatively homogeneous computing infrastructure. Not the Yahooligans. After collection companies, some of these outfits operated as services available within a portal, portlets, if you will. Instead of integrating acquisitions into a homogeneous platform, Yahoo has a more heterogeneous infrastructure. As a result, agility and cost control are difficult, if not impossible, for Yahoo to deliver on a daily basis.

Third, Yahoo has managed to create the internal environment that preceded the Pan Slavic initiatives of the last century. One the surface, Yahooligans get along and love one another. In the day to day dealings, I have heard that the sweetness and light dissipates. With cultural issues in information technology and the types of management and leadership problems at which Mr. Jackson hints, I think Yahoo is in a vulnerable position.

What will happen in 2009? The Yahoo of the 1996 to 1999 period will become a dim memory. The 2009 Yahoo is morphing into an America Online with a different logo. Now tell me why I am wrong. Just offer up some holiday facts to support your position.

Stephen Arnold, December 25, 2008

Pew Speeds Quantifies the Dead Tree Blight in the Information Forest

December 25, 2008

Pew Research Center for the People and the Press (a go-to info source for the Washington DC crowd) released a news story here with the snappy title “Internet Overtakes Newspapers as News Source.” the write up is typical charts and graphs style. You can wade through the data and the inevitable footnotes designed to make it easy for Statistics 101 teachers to create an assignment via cut and paste. For me, the key point was:

Currently, 40% say they get most of their news about national and international issues from the Internet, up from just 24% in September 2007. For the first time in a Pew survey, more people say they rely mostly on the Internet for news than cite newspapers (35%). Television continues to be cited most frequently as a main source for national and international news, at 70%.

The rest of the data are interesting but not the pivot point for me like this shift to the Internet and the crowning of TV as king for 2009.

image

The outlook for traditional publishing. Source: http://static.howstuffworks.com/gif/deforestation-2.jpg

My interpretation of these data, as you may expect, is slightly different from the “newspapers are dead” analysis I have offered in the past. Here’s the variant:

  1. More newspapers will chase the online world. I think this is akin to what I have seen described as death throes. These action presage the passing, not the dawning of a rebirth. Citizen input forms for breaking news, anyone?
  2. The magazines and professional journals are next in line. These intellectual mavens will find tough rowing when budget caps crash on hapless accountants, crushing the publishers of serials under the weight of increasing costs.
  3. The anti Google crowd needs to speed up their efforts to crush Google Video Search, YouTube.com, and the Google Channel. The children of the dead tree crowd are already defecting, so the publishing and media elite are not able to generate folks who share their mums’ and daddies’ love of 16th century intellectual artifacts. I can see the scene now. Mum says, “Take out the garbage.” Media child pulls iPod ear buds out of her ears and says, “What?” Media child goes back to the Macbook video, stuffs the ear buds in her ears, and grabs her iPhone to send a text message that says, “Parentz R 2 lame.”

Now you may point out that I omitted book publishers. No, I didn’t omit them. When the New York publishing houses started to announce cut backs in new titles last month, I wrote that segment off as intellectual meat through the sausage machine.

Okay, dead tree lovers, tell me I am wrong. Just include facts. Pew data are okay. Examples like the tiny 10,000 circulation newspaper in New Jersey are fine. Fancy books used to decorate law offices and upscale dwellings qualify as well. Just include data with your addled goose guidebook.

Stephen Arnold, December 25, 2008

When Will the Dead Tree Times Come Crashing Down

December 24, 2008

Peter Kafka, writing in Media Memo, provided a useful summary of the New York Times’s bleak November 2008. You can read his article “New York times: November Was So Terrible, Even Our Internet Ads Were Down” in “D: All Things Digital” here. Mr. Kafka provides a link to more nitty gritty here. For my purposes, this key point was that Internet ad revenue and other Internet revenue declined as well. What is my agenda? Three points to bright your December 24th:

  1. What’s the big surprise? The Wall Street Journal’s early push into proprietary desktop software more than a decade ago and then its flirtation with BRS search generated losses as well. As one traditional newspaper after another brought its Gutenberg business model to online, revenues were not just down, most online ventures were disasters. Anyone recall the Knight Ridder “play”? Or, what about the Times’s own smooth move to kill the “Times file save tool”? Nice, especially with no provision to save the content elsewhere. Info about this gaffe is here.
  2. For most traditional publishers, the online train has left the station. I used to work at a newspaper, and I have watched costs run out of control. The response has been to trim down staff and raise rates. Stepping back and thinking about alternative business models was a common practice. Whilst pondering, new media luminaries rose (Mike Harrington’s TechCrunch) and new distribution systems (Google to name one) emerged. Where were the traditional newspapers? Most were congratulating themselves on their ability to “manage” their problems. Wrong. The management was a hoax, and the problems are not really expensive and painful to solve.
  3. It is not a question of which newspaper is next. My analysis suggests that the old doctrine of the Domino Theory” is right for our time and our place. Once the gray lady topples, the others will plummet with little or no warning. In death as in life, most of today’s dead tree publications are a bit like sheep. These four footed wizards are heading toward the cliff’s edge.

You can read Erick Schonfeld’s analysis here.

image

My thought picture of the New York Times and some other dead tree publishers. Source: http://parkerlab.bio.uci.edu/pictures/photography%20pictures/bigthumbs/screenDead%20tree%20on%20Inyo%20crest.jpg

I am delighted I have four or five readers. One or two may take issue with my opinions. If I reached more people, I would have to deal with assertions, cat calls, and so-so vilifications. My specialist study Publishing on the Internet: A New Medium for a New Millennium (Infonortics, 1996) included this statement:

… Technology’s impact cannot be predicted. Small innovations and incremental improvements interact almost organically and behave in complex ways. the use of those technologies and the impact of those instrumental applications of what appear to be harmless inventions create a new type of information environment. this environment is not routinely recognized as a distinct construct almost in the way that a fish does not recognize water. Experts on electronic information are only beginning to come to grips with the rhetorical and syntactical rules of the network publishing information types. The social impact is not fully understood. the implications for commerce, education, medicine, and politics are not understood. Indeed many think it is business as usual. We do not know what will come next. Many aspects of digital life seem unpredictable. they are. It is the datasphere showing its true colors.

I suggest beet red as the new color to signify failure. Okay, tell me why I am wide of the mark even for an addled goose in rural Kentucky.

Stephen Arnold, December 24, 2008

Internet Explorer and European Users at Odds

December 24, 2008

Update: December 25, 2008–You will want to read this useful analysis of Microsoft’s brower market share woes here. Wolfgang Gruener has done a good job in his article “How serious is the market share loss of Microsoft’s Internet Explorer?”

Original Post

PCWorld ran a story called “IE Loses European Market Share” here. Certain types of data interest me, and I fancy statistics that chart the ups and downs of certain software giants. Microsoft is fun to watch. It offers what I call the “ejecting executive game.” The idea is to count the number of executives who leave the company’s search and content processing units. Another interesting activity is watching the company’s share of the Web search market. The Microsoft Web search 8.3 percent market share is documented here. Ouch. I found the data about Internet Explorer’s market share even more interesting; to wit:

Microsoft’s browser dipped under 60% for the first time in August, rallied slightly in September, but then dipped below that bar again during October and November, said XiTi Monitor, a Web measurement site operated by Applied Technologies Internet of Merignac, France.

Why the slip? Internet Explorer lacks those nifty plug ins and the cachet which seems to surround Firefox. I use Firefox portable, and I am delighted with its speed. I find it intuitive. The weird focus changes in Internet Explorer drive me wild with the opportunity to retype certain strings.

Add up the declining share of the Web search market, toss in a few ejected execs, and the loss of Internet Explorer users in Europe. Ask yourself, “What’s this mean?” And let you mind conclude that Microsoft has trouble in the old country.

Stephen Arnold, December 24, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta