Easy as 1,2,3 Common Mistakes Made with Data Lakes
December 15, 2015
The article titled Avoiding Three Common Pitfalls of Data Lakes on DataInformed explores several pitfalls that could negate the advantages of data lakes. The article begins with the perks, such as easier data access and of course, the cost-effectiveness of keeping data in a single hub. The first is sustainability (or the lack thereof), since the article emphasizes that data lakes actually require much more planning and management of data than conventional databases. The second pitfall raised is resource allocation,
“Another common pitfall of implementing data lakes arises when organizations need data scientists, who are notoriously scarce, to generate value from these hubs. Because data lakes store data in their native format, it is common for data scientists to spend as much as 80 percent of their time on basic data preparation. Consequently, many of the enterprise’s most valued resources are dedicated to mundane, time-consuming processes that considerably lengthen time to action on potentially time-sensitive big data.“
The third pitfall is technology contradictions or trying to use traditional approaches on a data lake that holds both big and unstructured data. Be not alarmed, however, the article goes into great detail about how to avoid these issues through data lake development with smart data technologies such as semantic tech.
Chelsea Kerwin, December 15, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Bill Legislation Is More Complicated than Sitting on Capitol Hill
December 14, 2015
When I was in civics class back in the day and learning about how a bill became an official law in the United States, my teacher played Schoolhouse Rock’s famous “I’m Just a Bill” song. While that annoying retro earworm still makes the education rounds, the lyrics need to be updated to record some of the new digital “paperwork” that goes into tracking a bill. Engaging Cities focuses on legislation data in “When Lobbyists Write Legislation, This Data Mining Tool Traces The Paper Trail.”
While the process to make a bill might seem simple according to Schoolhouse Rock, it is actually complicated and is even crazier as technology pushes more bills through the legislation process. In 2014, there were 70,000 state bills introduced across the country and no one has the time to read all of them. Technology can do a much better and faster job.
“ A prototype tool, presented in September at Bloomberg’s Data for Good Exchange 2015 conference, mines the Sunlight Foundation’s database of more than 500,000 bills and 200,000 resolutions for the 50 states from 2007 to 2015. It also compares them to 1,500 pieces of “model legislation” written by a few lobbying groups that made their work available, such as the conservative group ALEC (American Legislative Exchange Council) and the liberal group the State Innovation Exchange(formerly called ALICE).”
A data-mining tool for government legislation would increase government transparency. The software tracks earmarks in the bills to track how the Congressmen are benefiting their states with these projects. The software analyzed earmarks as far back as 1995 and it showed that there are more than anyone knew. The goal of the project is to scour the data that the US government makes available and help people interpret it, while also encouraging them to be active within the laws of the land.
The article uses the metaphor “need in a haystack” to describe all of the government data. Government transparency is good, but when they overload people with information it makes them overwhelmed.
Whitney Grace, December 14, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Censys Search Engine Used to Blow the Lid off Security Screw-Ups at Dell, Cisco
December 14, 2015
The article on Technology Review intriguingly titled A Search Engine for the Internet’s Dirty Secrets discusses the search engine Censys, which targets security flaws in devices hooked up to the Internet. The company has already caused some major waves while being used by SEC Consult to uncover lazy device encryption methods among high profile manufacturers such as Cisco and General Electric. The article also provides this revealing anecdote about Censys being used by Duo Security to investigate Dell,
“Dell had to apologize and rush out remediation tools after Duo showed that the company was putting rogue security certificates on its computers that could be used to remotely eavesdrop on a person’s encrypted Web traffic, for example to intercept passwords. Duo used Censys to find that a Kentucky water plant’s control system was affected, and the Department of Homeland Security stepped in.”
Censys uses software called ZMap to harvest data for search, which was developed by Zakir Durumeric, who is also directing the open-source project at the University of Michigan. The article also goes into detail on Censys’s main rival, Shodan. The companies use different software but Shodan is a commercial search engine while Censys is free to use. Additionally, the almighty Google has thrown its weight behind Censys by providing an infrastructure.
Chelsea Kerwin, December 14, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Search Data from Bing for 2015 Yields Few Surprises
December 11, 2015
The article on Search Engine Watch titled Bing Reveals the Top US and UK Searches of 2015 in the extremely intellectual categories of Celebs, News, Sport(s), Music, and Film. Starting with the last category, guess what franchise involving wookies and Carrie Fisher took the top place? For Celebrity searches, Taylor Swift took first in the UK, and Caitlyn Jenner in the US, followed closely by Miley Cyrus (and let’s all take a moment to savor the seething rage this data must have caused in Kim Kardashian’s heart.) What does this trivia matter? Ravleen Beeston, UK Sales Director of Bing, is quoted in the article with her two cents,
“Understanding the interests and motivations driving search behaviour online provides invaluable insight for marketers into the audiences they care about. This intelligence allows us to empower marketers to create meaningful connections that deliver more value for both consumers and brands alike. By reflecting back on the key searches over the past 12 months, we can begin to anticipate what will inspire and how to create the right experience in the right context during the year to come.”
Some of the more heartening statistics were related to searches for women’s sports news, which increased from last year. Serena Williams was searched more often than the top five male tennis players combined. And saving the best for last, in spite of the dehumanizing and often racially biased rhetoric we’ve all heard involving Syrian refugees, there was a high volume of searches in the US asking how to provide support and aid for refugees, especially children.
Chelsea Kerwin, December 11, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Google Executives Have a Look but No Touch Rule
December 11, 2015
Have you ever been to a museum and the curator told you to “look, but don’t touch the exhibits?” The phrase comes into play, because museums want to protect the integrity of the exhibits and to keep them preserved for the ages. One of the draws about these new, modern companies is that all employees are allowed to engage with each other in different departments and the higher-ups are available without a hassle. Or at least that is the image they want to project to the public, especially Google. Business Pundit exposes bow Google CEOs interact with their employees in “Google’ s Top Execs Are Always Visible But Almost Never Approachable” like a museum exhibit.
Larry Page, Sergey Brin, and Sundar Pichai make themselves seen at their Mountain View headquarters, but do not even think about going near them. They are walled off to small talk and random interactions because all of their time is booked.
Company developer advocate Don Dodge wrote on a Quora Q&A that Larry Page, Sergey Brin, and Sundar Pichai are in the no approach zone, Dodge explains:
“However, that doesn’t mean they are easy to approach and engage in discussion. They are very private and don’t engage in small talk. They are usually very focused on their priorities, and their schedule is always fully booked. Larry is a notoriously fast walker and avoids eye contact with anyone so he can get to his destination without disruption.”
Get Larry a Segway or one of those new “hoverboard” toys, then he will be able to zoom right past everyone or run them over. Add a little horn to warn people to get out of the way.
Whitney Grace, December 11, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Medical Search Solved Again
December 10, 2015
I have looked at a wide range of medical information search systems over the years. These range from Medline to the Grateful Med.
I read “A Cure for Medical Researchers’ Big Data Headache.” The Big Data in question is the Medline database. The new search tool is ORiGAMI (I love that wonky upper and lower case thing).
The basic approach involves:
Apollo, a Cray Urika graph computer, possesses massive multithreaded processors and 2 terabytes of shared memory, attributes that allow it to host the entire MEDLINE database and compute multiple pathways on multiple graphs simultaneously. Combined with Helios, CADES’ Cray Urika extreme analytics platform, Sukumar’s team had the cutting-edge hardware needed to process large datasets quickly—about 1,000 times faster than a workstation—and at scale.
And the payoff?
Once the MEDLINE database was brought into the CADES environment, [Sreenivas Rangan Sukumar’s [a data scientist at the Department of Energy’s Oak Ridge National Laboratory] team applied advanced graph theory models that implement semantic, statistical, and logical reasoning algorithms to create ORiGAMI. The result is a free online application capable of delivering health insights in less than a second based on the combined knowledge of a worldwide medical community.
My view is that Medline is not particularly big. The analysis of the content pool can generate lots of outputs.
From my vantage point in rural Kentucky, this is another government effort to create a search system. Perhaps this is the breakthrough that will surpass IBM Watson’s medical content capabilities?
Does your local health care provider have access to a Cray computer and the other bits and pieces, including a local version of Dr. Sukumar?
Stephen E Arnold, December 10, 2015
Metanautix: Big Data Search
December 9, 2015
I read “Ex-Google, Facebook Duo Aim to Simplify Big Data Search.” The idea is that people with Big Data cannot find what is needed to answer a question. The fix may be developed by Matanautix.
Sound familiar?
I have heard this user requirement for what is it now? 25, 30 years, or more?
According to the write up:
When a company wants to analyze data, typically it first has to input all of that information all into some type of database. Then an engine can be built to bring about answers to any inquires. What Metanautix does, however, is build-in search capabilities for an existing database.
I thought that a number of other firms have developed solutions for Big Data search; for example, Lucidworks. If the article is correct, the fine folks at Lucidworks will have to content with a competitor that does more than put out marketing assertions.
Stephen E Arnold, December 9, 2015
Understanding Trolls, Spam, and Nasty Content
December 9, 2015
The Internet is full of junk. It is a cold hard fact and one that will never die as long as the Internet exists. The amount of trash content was only intensified with the introduction of Facebook, Twitter, Instagram, Pinterst, and other social media platforms and it keeps pouring onto RSS feeds. The academic community is always up for new studies and capturing new data, so a researcher from the University of Arkansas decided to study mean content. “How ‘Deviant’ Messages Flood Social Media” from Science Daily is an interesting new idea that carries the following abstract:
“From terrorist propaganda distributed by organizations such as ISIS, to political activism, diverse voices now use social media as their major public platform. Organizations deploy bots — virtual, automated posters — as well as enormous paid “armies” of human posters or trolls, and hacking schemes to overwhelmingly infiltrate the public platform with their message. A professor of information science has been awarded a grant to continue his research that will provide an in-depth understanding of the major propagators of viral, insidious content and the methods that make them successful.”
Dr. Nitin Agarwal and will study what behavioral, social, and computational factors cause Internet content to go viral, especially if they have deviant theme. Deviant means along the lines something a troll would post. Agarwal’s research is part of a bigger investigation funded by the Office of Naval Research, Air Force Research, National Science Foundation, and Army Research Office. Agarwal will have a particular focus on how terrorist groups and extremist governments use social media platforms to spread their propaganda. He will also be studying bots that post online content as well.
Many top brass organizations do not have the faintest idea of even what some of the top social media platforms are, much less what their purpose is. A study like this will raise the blinders about them and teach researchers how social media actually works. I wonder if they will venture into 4chan.
Whitney Grace, December 9, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
The Google Cultural Institute Is a Digital Museum
December 8, 2015
Museums are the cultural epicenters of the human race, because the house the highest achievements of art, science, history, and more. The best museums in the world are located in the populous cities and they house price works of art that represent the best of what humanity has to offer. The only problem about these museums is that they are in a stationary location and unless you have the luck to travel, you can’t see these fabulous works in person.
While books have often served as the gateway museums’ collection, it is not the same as seeing an object or exhibit in real life. The Internet with continuously evolving photographic and video technology have replicated museums’ collection as life like as possible without having to leave your home. The only problem with these digital collections are limited to what is within a museums’ archives, but what would happen if an organization collected all these artifacts in one place like a social networking Web site?
Google has done something extraordinary by creating the Google Cultural Institute. The Google Cultural Institute is part digital archive, part museum, part Pinterest, and part encyclopedia. It is described as:
“Discover exhibits and collections from museums and archives all around the world. Explore cultural treasures in extraordinary detail, from hidden gems to masterpieces.”
Users can browse collections of art, history, and science ranging from classical works to street art to the Holocaust and World War I. The Google Cultural Institute presents information via slideshows with captions. Collections are divided by subject and content as well as by the museum where the collections originate. Using Google Street View users can also view the very place where the collections are stored. Users can also make their own collections and share them like on Pinterest.
This is an amazing step towards bringing museums into the next step of their own evolution as well as allowing people who might not have the chance to access them see the collections. The only recommendation is that it would be nice if they put more advertising into the Google Cultural Institute so that people actually know it exists.
Whitney Grace, December 8, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Bing Uses Image Search for Recipes
December 8, 2015
Recipe websites have become the modern alternative to traditional cookbooks, but finding the perfect recipe through an Internet search engine can be tedious. LifeHacker informs us that Bing is now using image search technology to help users whittle down the results in, “Find Recipes by Image in Bing’s Image Search.” Writer Melanie Pinola describes how it works:
“When you look up ‘baked ziti’ or ‘roast turkey’ or any other food-related term and then go to Bing’s images tab, photos that you can access recipes for will have a chef’s hat icon, along with a count of how many sites use that image. Click on the image to see the recipe(s) related to the image and load them in your browser. You’ll save some time versus click through to every recipe in a long list of search results, especially if you’re thinking of making something that looks a particular way, such as bacon egg cups.”
So remember to use Bing next time you’re hunting for a recipe online. Image search tech continues to improve, and there are many potential worthwhile uses. We wonder what it will be applied to next.
Cynthia Murrell, December 8, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph