Big Data and Its Fry Cooks Who Clean the Grill

April 1, 2016

I read “Clearing Big Data: Most Time Consuming, Least Enjoyable Data Science Task, Survey Says.” A survey?

According to the capitalist tool:

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data.

The point is that few wizards want to come to grips with the problem of figuring out what’s wrong with data in a set or a stream and then getting the data into a form that can be used with reasonable confidence.

Those exception folders, annoying, aren’t they?

The write up points that a data scientist spends 80 percent of his or her time doing housecleaning. Skip the job and the house becomes unpleasant indeed.

The survey also reveals that data scientists have to organize the data to be analyzed. Imagine that. The baloney about automatically sucking in a wide range of data does not match the reality of the survey sample.

Another grim bit of drudgery emerges from the sample which we assume was conducted with the appropriate textbook procedures was that the skills most in demand were for SQL. Yep, old school.

Consider that most of the companies marketing next generation data mining and analytics systems never discuss grunt work and old fashioned data management.

Why the disconnect?

My hunch is that it is the sizzle, not the steak, which sells. Little wonder that some analytics outputs might be lab-made hamburger.

Stephen E Arnold, April 1, 2016

Search as a Framework

March 26, 2016

A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.

If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.

The article states:

Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.

The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.

Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”

Stephen E Arnold, March 26, 2016

A Data Lake: Batch Job Dipping Only

February 11, 2016

I love the Hadoop data lake concept. I live in a mostly real time world. The “batch” approach reminds me of my first exposure to computing in 1962. Real time? Give me a break. Hadoop reminded me of those early days. Fun. Standing on line. Waiting and waiting.

I read “Data Lake: Save Me More Money vs. Make Me More Money.” The article strikes me as a conference presentation illustrated with a deck of PowerPoint goodies.

One of the visuals was a modern big data analytics environment. I have seen a number of representations of today’s big data yadda yadda set ups. Here’s the EMC take on the modernity:

image

Straight away, I note the “all” word. Yep, just put the categorical affirmative into a Hadoop data lake. Don’t forget the video, the wonky stuff in the graphics department, the engineering drawings, and the most recent version of the merger documents requested by a team of government investigators, attorneys, and a pesky solicitor from some small European Community committee. “All” means all, right?

Then there are two “environments”. Okay, a data lake can have ecosystems, so the word environment is okay for flora and fauna. I think the notion is to build two separate analytic subsystems. Interesting approach, but there are platforms which offer applications to handle most of the data slap about work. Why not license one of those; for example, Palantir, Recorded Future?

And that’s it?

Well, no. The write up states that the approach will “save me more money.” In fact, one does not need much more:

The savings from these “Save me more money” activities can be nice with a Return on Investment (ROI) typically in the 10% to 20% range. But if organizations stop there, then they are leaving the 5x to 10x ROI projects on the table. Do I have your attention now?

My answer, “No, no, you do not.”

Stephen E Arnold, February

Big Data Blending Solution

January 20, 2016

I would have used Palantir or maybe our own tools. But an outfit named National Instruments found a different way to perform data blending. “How This Instrument Firm Tackled Big Data Blending” provides a case study and a rah rah for Alteryx. Here’s the paragraph I highlighted:

The software it [National Instruments] selected, from Alteryx, takes a somewhat unique approach in that it provides a visual representation of the data transformation process. Users can acquire, transform, and blend multiple data sources essentially by dragging and dropping icons on a screen. This GUI approach is beneficial to NI employees who aren’t proficient at manipulating data using something like SQL.

The graphical approach has been part of a number of tools. There are also some systems which just figure out where to put what.

The issue for me is, “What happens to rich media like imagery and unstructured information like email?”

There are systems which handle these types of content.

Another challenge is the dependence on structured relational data tables. Certain types of operations are difficult in this environment.

The write up is interesting, but it reveals that a narrow view of available tools may produce a partial solution.

Stephen E Arnold, January 20, 2016

Machine Learning Hindsight

January 18, 2016

Have you ever found yourself saying, “If I only knew then, what I know now”?  It is a moment we all experience, but instead of stewing over our past mistakes it is better to share the lessons we’ve learned with others.  Data scientist Peadar Coyle learned some valuable lessons when he first started working with machine learning.  He discusses three main things he learned in the article, “Three Things I Wish I Knew Earlier About Machine Learning.”

Here are the three items he wishes he knew then about machine learning, but know now:

  • “Getting models into production is a lot more than just micro services
  • Feature selection and feature extraction are really hard to learn from a book
  • The evaluation phase is really important”

Developing models is an easy step, but putting them in production is difficult.  There are many major steps that need attending to and doing all of the little jobs isn’t feasible on huge projects.   Peadar recommends outsourcing when you can.  Books and online information are good reference tools, but when they cannot be applied to actual situations the knowledge is useless.  Paedar learned that real world experience has no comparison.  When it comes to testing, it is a very important thing.  Very much as real world experience is invaluable, so is the evaluation.  Life does not hand perfect datasets for experimentation and testing different situations will better evaluate the model.

Paedar’s advice applies to machine learning, but it applies more to life in general.

 

Whitney Grace, January 18, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Open Source Data Management: It Is Now Easy to Understand

January 10, 2016

I read “16 for 16: What You Must Know about Hadoop and Spark Right Now.” I like the “right now.” Urgency. I am not sure I feel too much urgency at the moment. I will leave that wonderful feeling to the executives who have sucked in venture money and have to find a way to generate revenue in the next 11 months.

The article runs down the basic generalizations associated with each of these open source data management components:

  • Spark
  • Hive
  • Kerberos
  • Ranger/Sentry
  • HBase/Phoenix
  • Impala
  • Hadoop Distributed File System (HDFS)
  • Kafka
  • Storm/Apex
  • Ambari/Cloudera Manager
  • Pig
  • Yarn/Mesos
  • Nifi/Kettle
  • Knox
  • Scala/Python
  • Zeppelin/Databricks

What the list tells me is two things. First, the proliferation of open source data tools is thriving. Second, there will have to be quite a few committed developers to keep these projects afloat.

The write up is not content with this shopping list. The intrepid reader will have an opportunity to learn a bit about:

  • Kylin
  • Atlas/Navigator

As the write up swoops to its end point, I learned about some open source projects which are a bit of a disappointment; for example, Oozie and Tez.

The key point of the article is that Google’s MapReduce which is now pretty long in the tooth is now effectively marginalized.

The Balkanization of data management is evident. The challenge will be to use one or more of these technologies to make some substantial revenue flow.

What happens if a company jumps on the wrong bandwagon as it leaves the parade ground? I would suggest that it may be more like a Pig than an Atlas. The investors will change from Rangers looking for profits to Pythons ready to strike. A Spark can set fire to some hopes and dreams in the Hive. Poorly constructed walls of Databricks can come falling down. That will be an Oozie.

Dear old Oracle, DB2, and SQLServer will just watch.

Stephen E Arnold, January 10, 2016

Short Honk: Hadoop Ecosystem Made Clear

January 3, 2016

Love Hadoop. Love all things Hadoopy? You will want to navigate to “The Hadoop Ecosystem Table.” You have categories of Hadoopiness with examples of the Hadoop amoebae. You are able to see where Spark “fits” or Kudu. Need some document data model options? The table will deliver: ArangoDB and more. Useful stuff.

Stephen E Arnold, December 30, 2015

Data Managers as Data Librarians

December 31, 2015

The tools of a librarian may be the key to better data governance, according to an article at InFocus titled, “What Librarians Can Teach Us About Managing Big Data.” Writer Joseph Dossantos begins by outlining the plight data managers often find themselves in: executives can talk a big game about big data, but want to foist all the responsibility onto their overworked and outdated IT departments. The article asserts, though, that today’s emphasis on data analysis will force a shift in perspective and approach—data organization will come to resemble the Dewey Decimal System. Dossantos writes:

“Traditional Data Warehouses do not work unless there a common vocabulary and understanding of a problem, but consider how things work in academia.  Every day, tenured professors  and students pore over raw material looking for new insights into the past and new ways to explain culture, politics, and philosophy.  Their sources of choice:  archived photographs, primary documents found in a city hall, monastery or excavation site, scrolls from a long-abandoned cave, or voice recordings from the Oval office – in short, anything in any kind of format.  And who can help them find what they are looking for?  A skilled librarian who knows how to effectively search for not only books, but primary source material across the world, who can understand, create, and navigate a catalog to accelerate a researcher’s efforts.”

The article goes on to discuss the influence of the “Wikipedia mindset;” data accuracy and whether it matters; and devising structures to address different researchers’ needs. See the article for details on each of these (especially on meeting different needs.) The write-up concludes with a call for data-governance professionals to think of themselves as “data librarians.” Is this approach the key to more effective data search and analysis?

Cynthia Murrell, December 31, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Caution about NoSQL Databases

December 22, 2015

I read “Exasol and Birst Join In-Memory Database to ‘Networked’ BI to Aid Mutual Expansion.” Another day, another marketing tie up. But the article contained a very interesting statement, attributed to a Birst big dog:

NoSQL databases are great for atomic storage and retrieval, and for elastic scaling over a distributed [server] environment, but when it comes to doing aggregations with joins – and that’s what analytics is about – it is just not what they are built for.”

I wonder if that shot is aimed at outfits like MarkLogic. Worth watching this partnership.

Stephen E Arnold, December 22, 2015

Two AI Paths Pondered by Teradata

December 20, 2015

I read the content marketing write up by Karthik Guruswamy. I like the “guru” part of the expert’s name. I am stuck with the “old” part of my name.

The write is called “Data Science: Machine Learning Vs. Rules Based Systems.” I know a little bit about both of these methods, and I know a teeny tiny bit about Teradata, an outstanding data warehouse solution chugging along with its stock in the high $20s per share. The Google finance chart suggests that the company has some challenges with net income and profit margin to my unlearned eye:

image

Looks like some content marketing oomph is needed to move that top line number.

I learned in the write up:

Rules based systems will work effectively if all the situations, under which decisions can be made, are known ahead of time.

Okay. Insight. Know everything ahead of time and one can write rules to cover the situation. Is this expensive? Is this a never ending job? Consultants sure hope so.

There is an alternative:

Enter Machine Learning or ML! If we classify the data into good vs. bad data sets or categorize them into different labels like A, B, C, D etc., the Machine Learning algorithms can help build rules on the fly. This step is called training which results in a model. During operationalization, this model is used by the prediction algorithm to classify the incoming data in the right way which in turn leads to sound decision making.

I recall that Autonomy used this approach for its system. Those familiar with Autonomy have some experience with retraining, Bayesian drift, and other exciting facets of machine learning based systems. Consultants love to build new training sets.

The write up asserts:

With Machine Learning, one can iteratively achieve good results by cleansing & prepping the data, changing or combining algorithms or merely tweaking the algorithm parameters. This is becoming much easier thanks to the increased awareness and the availability of different types of data science tools in the market today.

High five.

My view is that the write up left out some information. But there is one omission which warrants a special comment.

Neither of these systems works without human intervention.

Bummer. Reality is sort of a drag, but maybe that’s why Teradata is wrestling with revenue and net profit alligators. Consultants, on the other hand, can bill to enhance either approach.

What about the customer? Well, some customers of brand name data warehouse systems struggle to get data into and out of this whiz bang systems in my experience. Regardless of the craziness involved with Hadoop and Spark, these open source approaches may make more sense than pumping six or seven figures into a proprietary system.

Consultants can still bill, of course. That’s one upside of any approach one wishes to embrace.

Stephen E Arnold, December 20, 2015

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta