Easy as 1,2,3 Common Mistakes Made with Data Lakes
December 15, 2015
The article titled Avoiding Three Common Pitfalls of Data Lakes on DataInformed explores several pitfalls that could negate the advantages of data lakes. The article begins with the perks, such as easier data access and of course, the cost-effectiveness of keeping data in a single hub. The first is sustainability (or the lack thereof), since the article emphasizes that data lakes actually require much more planning and management of data than conventional databases. The second pitfall raised is resource allocation,
“Another common pitfall of implementing data lakes arises when organizations need data scientists, who are notoriously scarce, to generate value from these hubs. Because data lakes store data in their native format, it is common for data scientists to spend as much as 80 percent of their time on basic data preparation. Consequently, many of the enterprise’s most valued resources are dedicated to mundane, time-consuming processes that considerably lengthen time to action on potentially time-sensitive big data.“
The third pitfall is technology contradictions or trying to use traditional approaches on a data lake that holds both big and unstructured data. Be not alarmed, however, the article goes into great detail about how to avoid these issues through data lake development with smart data technologies such as semantic tech.
Chelsea Kerwin, December 15, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Big Data and Math: Puzzlers Abound
December 14, 2015
Interested in math? Navigate to “Big Data’s Mathematical Mysteries: Machine Learning Works Spectacularly Well, but Mathematicians Aren’t Quite Sure Why.” Yep, I know the certainties of high school math are annoyances when dealing with more sophisticated procedures. Dabble in C* algebras, and you will realize why home economics and general business were appealing.
The point of the write up is that numerical recipes can do their thing on existing and incoming data. If the recipes were trained correctly by a human, some of the niftier systems can learn. Now this is not the same as housebreaking your new Great Dane, but the analogy is close enough for would be mathematicians.
If a system does not require humans to supervise it, methods exist to explore hidden structures. Think patterns a human cannot perceive.
Here’s the passage I highlighted:
These methods are already leading to interesting and useful results, but many more techniques will be needed. Applied mathematicians have plenty of work to do. And in the face of such challenges, they trust that many of their “purer” colleagues will keep an open mind, follow what is going on, and help discover connections with other existing mathematical frameworks. Or perhaps even build new ones.
The idea is that good enough mathematicians can use numerical procedures and get pretty useful outputs. There you go. No need to fool around with Hilbert spaces.
Stephen E Arnold, December 11, 2015
Bill Legislation Is More Complicated than Sitting on Capitol Hill
December 14, 2015
When I was in civics class back in the day and learning about how a bill became an official law in the United States, my teacher played Schoolhouse Rock’s famous “I’m Just a Bill” song. While that annoying retro earworm still makes the education rounds, the lyrics need to be updated to record some of the new digital “paperwork” that goes into tracking a bill. Engaging Cities focuses on legislation data in “When Lobbyists Write Legislation, This Data Mining Tool Traces The Paper Trail.”
While the process to make a bill might seem simple according to Schoolhouse Rock, it is actually complicated and is even crazier as technology pushes more bills through the legislation process. In 2014, there were 70,000 state bills introduced across the country and no one has the time to read all of them. Technology can do a much better and faster job.
“ A prototype tool, presented in September at Bloomberg’s Data for Good Exchange 2015 conference, mines the Sunlight Foundation’s database of more than 500,000 bills and 200,000 resolutions for the 50 states from 2007 to 2015. It also compares them to 1,500 pieces of “model legislation” written by a few lobbying groups that made their work available, such as the conservative group ALEC (American Legislative Exchange Council) and the liberal group the State Innovation Exchange(formerly called ALICE).”
A data-mining tool for government legislation would increase government transparency. The software tracks earmarks in the bills to track how the Congressmen are benefiting their states with these projects. The software analyzed earmarks as far back as 1995 and it showed that there are more than anyone knew. The goal of the project is to scour the data that the US government makes available and help people interpret it, while also encouraging them to be active within the laws of the land.
The article uses the metaphor “need in a haystack” to describe all of the government data. Government transparency is good, but when they overload people with information it makes them overwhelmed.
Whitney Grace, December 14, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Short Honk: Space Time for the Couch Potato Physicist
December 11, 2015
Hey, the weekend is almost here with football and thoughts about the joy of the upcoming holidays with family. For a relaxing read, check out “What Is Spacetime, Really?” For fans of network and graph analysis, Stephen Wolfram shares his most recent big idea. In addition to references to his books, which I assume you, gentle reader, have read, Wolfram explains spacetime. If you are a bit rusty on the notion of space time, Einstein’s theories, and theoretical physicals, don’t worry. The answer is nodes.
Stephen E Arnold, December 11, 2015
Harvard, Lingerie, and Instinctual Decisions
December 10, 2015
I read “Proof Data Can’t Always Help You Make Decisions.” (Note: you may have to jump through some hoops to read this article. If that’s the case, speak with the real journalists at Fortune, not me.)
The write up is by a person who attended Harvard and founded a lingerie company called Adore Me. This seems quite plausible.
The point of the write up is that a person with an advanced degree from the prestigious French university Les Mines, originally École Nationale Supérieure des Mines de Paris.
The write up featured this statement:
I [Morgan Hermand-Waiche] did some heavy researching and everything I found only emphasized what a terrible idea it was to start a lingerie company: A single player whose reign seemed never-ending fully dominated the lingerie market, barriers to entry were sky-high, and every single player that had tried to penetrate the lingerie market had failed—even huge brands with a lot of money and resources like Abercrombie & Fitch and the now-deceased Fredericks of Hollywood. The data clearly pointed out that this was a no-go. As a usually very rational guy, the story should have ended right then because data was telling me to give up. But for the first time in my life, I had a gnawing feeling that didn’t go away. And because that gut feeling went so against the very definition of my data-driven DNA, I knew I just couldn’t ignore it. And so I, a man with no knowledge of lingerie, started a lingerie startup.
Adore Me was a success. No math, no analytics, no Big Data. Just a hunch. I wonder what Louis Paul Cailletet (1832–1913), physicist and inventor or Georges Charpak (1924–2010), Nobel Prize in Physics 1992 would say about this approach.
Harvard, lingerie, hunch—makes sense because Adore Me is “ranked No. 2 on the Inc. 500 list of fastest-growing companies in NYC.”
Stephen E Arnold, December 10, 2015
Medical Search Solved Again
December 10, 2015
I have looked at a wide range of medical information search systems over the years. These range from Medline to the Grateful Med.
I read “A Cure for Medical Researchers’ Big Data Headache.” The Big Data in question is the Medline database. The new search tool is ORiGAMI (I love that wonky upper and lower case thing).
The basic approach involves:
Apollo, a Cray Urika graph computer, possesses massive multithreaded processors and 2 terabytes of shared memory, attributes that allow it to host the entire MEDLINE database and compute multiple pathways on multiple graphs simultaneously. Combined with Helios, CADES’ Cray Urika extreme analytics platform, Sukumar’s team had the cutting-edge hardware needed to process large datasets quickly—about 1,000 times faster than a workstation—and at scale.
And the payoff?
Once the MEDLINE database was brought into the CADES environment, [Sreenivas Rangan Sukumar’s [a data scientist at the Department of Energy’s Oak Ridge National Laboratory] team applied advanced graph theory models that implement semantic, statistical, and logical reasoning algorithms to create ORiGAMI. The result is a free online application capable of delivering health insights in less than a second based on the combined knowledge of a worldwide medical community.
My view is that Medline is not particularly big. The analysis of the content pool can generate lots of outputs.
From my vantage point in rural Kentucky, this is another government effort to create a search system. Perhaps this is the breakthrough that will surpass IBM Watson’s medical content capabilities?
Does your local health care provider have access to a Cray computer and the other bits and pieces, including a local version of Dr. Sukumar?
Stephen E Arnold, December 10, 2015
Bing Wants Google Bridge to Fall down, My Dear Lady
December 10, 2015
Microsoft has not given up on Bing yet. While the Microsoft’s brand name search engine has not gained much traction to take on Google in the United States, the United Kingdom might prove else wise. The Independent reports that “Rik Van Der Kooi: Microsoft Ups Its Challenge To Google With Big Plans For Bing” in the United Kingdom. Rik van der Kooi is Microsoft’s global head of search advertising and he wants to give Bing users a more ambient experience. Microsoft is integrating Bing into more features and applications, such as Microsoft Office, Cortana, Gumtree, Windows 10, and Skype.
Kooi is very eager to introduce Bing into Skype, because it will only benefit users. He says that:
“In the future we are thinking about not artificially pushing it in but maybe putting it in where it’s of use to the user. I could imagine a scenario where if you were either talking with somebody via Skype or chatting via Skype, that providing a search experience inside of Skype is a very valuable experience. And if it’s valuable to the user then we would consider it.”
Google still controls 88 percent of the UK’s search market, but Kooi did not stoop to using insults when he was asked about Google. Instead, he said that Bing and Google have different business approaches. Google is more focused on advertising as a model, which is different than what Bing does. Microsoft has a clear plan for Bing, including the knowledge that it has a lot of advertiser demand and forming partnerships with more UK platforms for quality traffic. Kooi is faithful that Bing will continue to gain traction in the UK and the US, it’s already in the double digits.
Whitney Grace, December 10, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Big Data Myths Debunked
December 4, 2015
An abundance of data is not particularly valuable without the ability to draw conclusions from it. Forbes recognizes the value of data analysis in, “Text Analytics Gurus Debunk Four Big Data Myths.” Contributor Barbara Thau observes:
“And while retailers have hailed big data as the key to everything from delivering shoppers personalized merchandise offers to real-time metrics on product performance, the industry is mostly scratching its head on how to monetize all the data that’s being generated in the digital era. One point of departure: Over 80% of all information comes in text format, Tom H.C. Anderson, CEO of, which markets its text analytics software to clients such as Coca-Cola KO +0.00% told Forbes. So if retailers, for one, ‘aren’t using text analytics in their customer listening, whether they know it or not, they’re not doing too much listening at all,’ he said.”
Anderson and his CTO Chris Lehew went on to outline four data myths they’ve identified; mistakes, really: a misplaced trust in survey scores; putting more weight on social media data than direct contact from customers; valuing data from new sources over the customer-service department’s records, and refusing to keep an eye on what the competition is doing. See the article for the reasons these pros disagree with each of these myths.
Text analytics firm OdinText promises to draw a more accurate understanding from their clients’ data collections, whatever industry they are in. The company received their OdenText patent in 2013, and was incorporated earlier this year.
Cynthia Murrell, December 4, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Big Data? Nope, Bad Data
December 3, 2015
I have been skeptical of surveys generated by mid tier consulting firms in search of customers. I shudder when I read about customer surveys conducted using online Web data collection forms tossed open to anyone who stumbles upon the survey URL. The results are usually a source of amusement. I am sorely tempted to highlight some of these Fred Allen radio show scripts, but I don’t need letters from legal eagles threatening my continued existence here in rural Kentucky.
I did read “This Isn’t ‘Big Data.’ It’s Just Bad Data,” and it seems a handful of other folks share my concern about bogus studies. The write up appeared in a Bloomberg service. I am starting to think that the Bloomberg outfit is as skeptical as I am about the Big Data revolution. In today’s economic environment, a friendly convenient store selling essentials like baloney is a welcome sight to storm tossed managers.
The write up says:
With response rates that have declined to under 10 percent, public opinion polls are increasingly unreliable. Perhaps even more concerning, though, is that the same phenomenon is hindering surveys used for official government statistics, including the Current Population Survey, the Survey of Income and Program Participation and the American Community Survey. Those data are used for a wide array of economic statistics — for example, the numbers you read in newspapers on unemployment, health insurance coverage, inflation and poverty.
The key point in my opinion is unreliable. Academics are concerned as well. Hey, these folks have their own challenge with the reproducibility of results issue. Oh, well.
The article points out that some Federal survey funds may be allocated elsewhere. Yikes.
My view is that the bad data thing is a growing problem. As self service systems like using Cortana to get business intelligence become more widely available, fewer and fewer folks with worry about the validity of the data upon which the “intelligence” is based.
Is this a problem? Yep. Will the feisty Big Data cheerleaders take action? Nah. Revenue, baby.
Stephen E Arnold, December 3, 2015
Vic Gundotra Restarts His Career
December 2, 2015
Google+ is a social media failure and its creator Vic Gundotra doesn’t like talking about. No one can blame him after he created the social media equivalent of the ET Atari videogame, often dubbed the worst videogame in history. According to Mashable in the article, “Here’s What You Do After Google+: Start Fresh,” Gundotra left Google and was gun shy to accept another job in the technology field. He continued to get daily job offers as he spent over a year traveling and spending time with his family, but he finally decided to focus on his career again by accepting a job with AliveCor.
AliveCor is a heath startup that has received FDA approval to use mobile devices to detect heart problems. Gundotra was interested in taking a job with a health technology startup after his father suffered from two heart attacks.
“AliveCor, while a big step removed from working on building a social network, nonetheless got him excited because of his interests in machine learning and wearable health. It also appealed to him on a more personal level.”
The health tech startup is proud to announce their new employee, but they do not include Google+ in the list of accomplishments in the press release. Gundotra recognizes he did good work at Google, but that his vision for social network to compete with Twitter and Facebook was a washout. He’s eager to move onto more fruitful endeavors, especially technology that will make people’s lives better.
Whitney Grace, December 2, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

