Big Data, Big Hassles
April 4, 2011
InfoWorld warns, “Big Data runs afoul of big lawyers.” They emphasize that increasingly popular “Big Data” can be inexpensive- until the attorneys get involved.
Big Data has come to refer to large datasets and the tools used to analyze them, a combination which can yield important information if used correctly. It can also be inexpensive.
However, you might want to bring in your council before going too far. The article tells the story of Pete Warden, who:
“. . .Described how he spent just $100 to scrape 500 million Web pages, including 220 million Facebook public profiles, using his own Web crawler and a 100-machine cluster running on Amazon EC2. He was able to analyze the information to match Twitter, LinkedIn, and Facebook accounts with the email accounts of users of his email tool.
“Then, just for fun, he created interactive maps showing how various countries, U.S. states, and cities connect with each other over social media and what types of fan pages they frequent.”
Neat, huh? Facebook didn’t think so. Their legal department cost him over 30 times the money he spent on the adventure.
So, venture forth, but be careful as you explore this new arena.
Cynthia Murrell, April 4, 2011
Freebie
People and Big Data: Analytics for Mr and Ms Couch Potato
March 24, 2011
I have to admit that the idea of big data and the “people” was a concatenation new to me. I just read “Data Science Tookit Brings Big Data Analysis to the People.” Let’s look at this snippet:
Data Science Toolkit offers OCR functionality to convert PDFs or scanned image files to text files, filter geographic locations from news articles and other types of unstructured data or find political district and neighborhood information for any given location. Data Science Toolkit is available as a web service online, but it can also be downloaded and run on an Amazon EC2 or VM virtual machine.
I live in Harrod’s Creek, Kentucky. The “people” in this metropolis of a couple of thousand people consists of folks who use the Internet to look at pictures, send email, and maybe check out some online information about the local basketball scene. The sophisticated data consumers mostly work in my office. I know from my good morning chats at the local filling station cum junk food outlet that I am skewing the demographics with my generalization about Internet usage. Close enough for horse shoes as my grandfather used to say.
I think the idea of “big data” is interesting. We publish a curated blog called Inteltrax that covers some of the interesting companies in the data fusion market. But if you think interest in a $1.0 million enterprise search system appeals to a narrow readership, data fusion has the same magnetism. There are not any “people.” There are college graduates with mathematical expertise and an compelling need to process information. Here in Harrod’s Creek, the “people” are more likely to check email and then fire up the flat screen to watch hoops.
Maybe the observation about “people” is a variant of Potomac Fever; that is, those exposed to the craziness of power and money in Washington, DC, think that “everyone” has the same visceral reaction to political push ups. I once heard a person who worked in a think tank describe the firm’s discussions about client engagements as “drinking our own Kool-Aid.” Tastes great, but the Kool-Aid is not enjoyed with the same lip smacking elsewhere. When was the last time you guzzled pumpkin or red bean Kool-Aid?
My view:
- A useful service such as the one described in the write up looks a heck of a lot more magnetic than it may be. That’s the unsupported assertion about “people” when the reality is that a tiny percentage of savvy folks will get with the big data program as a Web service.
- The notion that “people” can manipulate big data and find a pot of gold at the end of the analytics rainbow is charming, but essentially incorrect. There are quiet a few considerations to evaluate in the big data game. A shortcut can save time but also put the rental car in the ditch.
- Big data are the norm in many online operations. What is helpful to me is to explain that a tiny percentage of those with big data know what to do to squeeze nuggets from the log files.
Quite a story for me: I thought it was one of those PR, promo, search engine optimization type write ups. I then realized it was a Kool-Aid break after a lunch break in Silicon Valley where there is no Internet bubble. Absolutely not.
Stephen E Arnold, March 25, 2011
Freebie
Aster Data Snagged by Teradata
March 13, 2011
The food chain is a metaphor that can be applied to the business world without much imagination. The Register’s story, “Teradata Snaps Up Aster Data for $263m” demonstrates how the larger predator eats the smaller. Teradata learned that it needed a company that would rein in its clustered parallel database. Aster Data Systems was sniffed out and after a swift, painless hunt involving $263 million, Teradata swallowed its prey.
Aster Data’s nCluster software is a hybrid row and column database that runs on parallel clusters. They also have a patent-pending SQL-MapReduce product that is a hybrid of normal data warehousing of structured data and organizer of unstructured data. (For some insight into Aster Data’s approach, you can read the September 22, 2010, interview with Quentin Gallivan in the ArnoldIT.com Search Wizards Speak subsite here.
The Register story stated:
“Aster Data has been peddling its big data analytics software on Dell’s cloudy PowerEdge-C boxes as an appliance. Teradata also uses Dell iron to build its flagship Enterprise Data Warehouse (EDW) clustered appliances, so Aster and Teradata already have that in common.”
The tie up should prove interesting for two reasons. First, Teradata has some traditional data management technology and methods. Aster Data is one of the new breed of data management companies. Second, large firms are slow to change so working out the social aspects may require some cycles as well. But the deal looks like a good one and is another indication that data management is a hot sector.
Whitney Grace, March 13, 2011
Freebie
Suggest.io Database
February 7, 2011
Here’s an interesting new idea: self-learning databases. Suggest.io is designed to track information and feedback from visitors to your Web site.
By creating a free account, Suggest.Io will make a database that will track search content on your website. This database will then make suggestions, much like Google, when something is typed into the search box.
“With Suggest.io you can find out more about your visitors. For example, you may figure out what your visitors are more interested in by means of our powerful statistic tool, that allows you to spot the top rated search request…”
This sounds like a good add-in for any website and it allows you to be more like Google. Related information suggestions in search boxes are a handy tool to have. The graphic for the service may catch attention and give others a visual jolt.
Whitney Grace, February 7, 2011
Freebie
Linguamatics Says, Keep Experimenting
January 24, 2011
Linguamatics, which produces natural language processing technology, has posted a blog entry titled “Trend Analysis- Can a Prediction be Made?” The answer depends on the mathematics and the definition of a “prediction.”
For its example, the blog compares the popularity of a couple of politicians during their debates, as recorded through Twitter, to their election results. Using their I2E text mining software to analyze the Tweets, Linguamatics found a strong correlation.
However, the blog is missing details needed to definitively answer their own question. How did they use their data to calculate probability? Furthermore, what other types of predictions could this process make, and how?
The company claims that:
“This case study shows how the power of using NLP with the I2E software platform can be used to gain quite powerful insights on what is likely to happen based on opinions expressed by people using social media platforms.”
I’m afraid I’d have to see more results before I can agree with that opinion.
To read more about the company on their website, go to www.linguamatics.com.
Cynthia Murrell January 22, 2011
Freebie
Nuggets: Real or Fake Gold?
January 10, 2011
Xoogler Daniel Tunkelang wrote a short item back linking to his earlier write up about information nuggets. You may want to take a look at “Exploring Nuggetize”. The illustration shows how the “nugget” method converts Noisy Channel articles into what are digital Post It notes with the key points extracted from the source. In the “Exploring Nuggetize” article there are references to facets, snippets, and search.
The key point in “Exploring Nuggetize” in my opinion was:
The nuggets are full sentences, and thus feel quite different from conventional search-engine snippets. Conventional snippets serve primarily to provide information scent, helping users quickly determine the utility of a search result without the cost of clicking through to it and reading it. In contrast the nuggets are document fragments that are sufficiently self-contained to communicate a coherent thought. The experience suggests passage retrieval rather than document retrieval.
Overall I am okay with the notion of nuggets and the highlighting of Dhiti and its Dive service. You can learn more about both at http://dhiti.com.
What caught my attention was the response by Dhiti in the comments section to the follow on write up “Enabling Exploratory Search with Dhiti”. The question Dhiti answered was related to the user’s behavior when the Dhiti “nuggetizing” widget is implemented on a blog.
Here’s the comment. Please, check the original here because I have trimmed the remarks for this post. Emphasis added by Beyond Search as well:
We [Dhiti] observe the following patterns…:
1) The widget does contribute to increased engagement. We see about 5-10% of readers “interact” with the widget, either to click through on an article… About 60% of the interactions are clicks on articles.
2) We notice that there’s a higher probability of readers reading the articles fully than normal…
3) We observe search referrals interact a lot more with the widget…. So there is more likelihood for exploration.
4) When a search query brings traffic to a page, Users … want to explore the site more for the same query!
5) Through the pivots, the publisher gets to know what their readers [are] … interested to explore around….
6) The pivots also provide cues to the publisher to create reference pages (like Wikipedia) …
Several observations:
First, “nuggets” is probably the wrong metaphor for this type of “informed extraction.”
Second, the approach offers some useful opportunities to metrics about a blog reader’s behavior. My reaction was, “Ah, something more useful than AdSense clicks or traditional log files.”
Third, the company has a good idea, is small with “three co-founders,” and based in Bangalore. Good idea and I have a hunch some of the big outfits in the world of search may be thinking about this function.
Stephen E Arnold, January 10, 2011
Freebie
Netvibes Dashboards and Search
December 19, 2010
The San Francisco Gate gives us another story about dashboards: “Introducing Netvibes Dashboard Intelligence Solutions: Business Intelligence Reinvented for the Real-Time Web.” Netvibes has invented the Dashboard Intelligence solution, a dashboard programmed with features, including SmartTagging, to collect, interpret, and organize real time information for businesses. Netvibes’s advertising declares that the dashboard will save time, generate usable, current data, and keep businesses abreast about all social media information. The SmartTagging feature is how most of these actions will be accomplished.
“SmartTagging can capture hidden value generated by an infinite number of everyday work activities. Users won’t need to learn any complex new tools–they will soon be able to simply click and tag anything they access online with their personal sentiment and share their expertise with the entire organization.”
SmartTagging will then distribute this information to other personnel, who then can comment, and their additions will be sent out. This creates a cyclical process, augmented by new, real time information that keeps being fed into the system. I wonder if there will any repeated information or systems will get overloaded. Conclusion: do these dashboards actually make information access easier or harder in your opinion? Or, do dashboard provide a better user experience with the data pre-processed and ready to consume without critical thinking?
Whitney Grace, December 19, 2010
Freebie
Ant Tech: Not So New
December 16, 2010
Short honk: “Next Generation of Algorithms Inspired by Problem-Solving Ants” talked about swarming algorithms and hungry ants. If you are interested in swarming ants, you will find that ants are clever beasties. The write up reminded me of the Inferno search method, developed by NuTech Solutions. I did some work for the company eight years ago. The NuTech method involved swarming algorithms applied to search and retrieval. Just wanted to remind my two or three readers that information that seems so fresh and novel is often not that. NuTech was realigned and the Inferno search system shelved. The math wizard behind the product moved to Australia. Inferno did not need live ants, just algorithms that implemented certain numerical recipes based in part on what is called mereology.
Stephen E Arnold, December 16, 2010
Freebie
Google Explains Objectivity
December 15, 2010
Fast Company ran an interesting, but terse, article “Top Google Engineer.” The article addresses the issue of objectivity in Google search results. Some companies feel that Google is delivering less than objective search results. Here’s the passage that caught my attention:
“What we do at Google and what we’ve done for years is to not inject any subjectivity into these algorithms,” says Amit Singhal, Google Fellow and head of the company’s search quality, ranking, and algorithm team. “We didn’t want to introduce any bias into the mathematical modeling—our modeling is predicting, given a letter, what’s the probability of completion.”
Questions that crossed my mind upon reading the Fast Company article were:
- What is the role of hit boosting in the administrative components of the search algorithms?
- What administrative (human or algorithmic) interactions take place for results from large advertisers, partners, or internal Google units; for example, Google Apps or Google Local special advertising offers?
- What “supervisor” or “library look up” functions operate to place certain content in certain regions of a display page; for example, ads at the top and the side of a page or in other displays such as video?
- What is the numerical recipe for filling the various containers on a custom Google page; for example, Google.com/ig? Are humans involved in setting or tweaking threshold settings for page displays?
I understand the point about algorithms. I am curious about the supervisory functions performed by other algorithms and the Google engineers responsible for certain operations. I don’t have the answers to these questions, and I don’t think the recent articles about search result objectivity shine much light into the dark corners of search and page display administration (particularly the role of human engineers), and code “janitors” ( a term used in a Google patent document), and supervisory operations for results page assembly.
Stephen E Arnold, December 15, 2010
Freebie
Oracle Search Still Not Working
November 11, 2010
I know you think that SES11g is the best darned search system in the world. Like the Microsoft offering, SES11g has some interesting characteristics and a fascinating history. With sufficient resources, SES11g can search and retrieve.
However, this article addresses a different Oracle search. Navigate to “Desperately Seeking The CEO: Oracle Said To Hire Detectives To Find Apotheker.” Discover that Oracle cannot locate HP’s Leo Apotheker. Oracle wants Mr. Apotheker in order to get him to the court room. Excitement, Oracle hopes, will ensue. SAP already admitted that it made a misstep. Mr. Apotheker’s appearance will be like whipped cream on a hot fudge confection. Hey, SAP said it had stumbled, but it was just business.
According to the write up:
HP has said that Oracle’s efforts to get Apotheker to testify are interfering with his CEO duties and has called Oracle’s actions "harassment." The dispute is souring relations between one-time allies Oracle and HP.
My thought is to stuff known information about Mr. Leo into an Oracle database. Use Oracle’s business intelligence tools to crunch the data. Query the data sets with SES22g. Look at the outputs and go fetch Mr. Apotheker. Oh, I guess this did not work. Private contractors are looking for Mr. Apotheker the way I have had to hunt for certain data in Oracle tables. Manual stuff. Expensive. Doesn’t always work either. Rats. Might make a good movie, “Where in the World Is Leo Apotheker?”
Stephen E Arnold,November 11, 2010
Freebie

