Useful Probability Lesson in Monte Carlo Simulations
April 6, 2015
It is no surprise that probability blogger Count Bayesie, also known as data scientist Will Kurt, likes to play with random data samples like those generated in Monte Carlo simulations. He lets us in on the fun in this useful summary, “6 Neat Tricks with Monte Carlo Simulations.” He begins:
“If there is one trick you should know about probability, it’s how to write a Monte Carlo simulation. If you can program, even just a little, you can write a Monte Carlo simulation. Most of my work is in either R or Python, these examples will all be in R since out-of-the-box R has more tools to run simulations. The basics of a Monte Carlo simulation are simply to model your problem, and then randomly simulate it until you get an answer. The best way to explain is to just run through a bunch of examples, so let’s go!”
And run through his six examples he does, starting with the ever-popular basic integration. Other tricks include approximating binomial distribution, approximating Pi, finding p-values, creating games of chance, and, of course, predicting the stock market. The examples include code snippets and graphs. Kurt encourages readers to go further:
“By now it should be clear that a few lines of R can create extremely good estimates to a whole host of problems in probability and statistics. There comes a point in problems involving probability where we are often left no other choice than to use a Monte Carlo simulation. This is just the beginning of the incredible things that can be done with some extraordinarily simple tools. It also turns out that Monte Carlo simulations are at the heart of many forms of Bayesian inference.”
See the write-up for the juicy details of the six examples. This fun and informative lesson is worth checking out.
Cynthia Murrell, April 6, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Apache Sparking Big Data
April 3, 2015
Apache Spark is an open source cluster computing framework that rivals MapReduce. Venture Beat says that people did not pay that much attention to Apache Spark when it was first invented at University of California’s AMPLAB in 2011. The article, “How An Early Bet On Apache Spark Paid Off Big” reports the big data open source supporters are adopting Apache Spark, because of its superior capabilities.
People with big data plans want systems that process real-time information at a fast pace and they want a whole lot of it done at once. MapReduce can do this, but it was not designed for it. It is all right for batch processing, but it is slow and much to complex to be a viable solution.
“When we saw Spark in action at the AMPLab, it was architecturally everything we hoped it would be: distributed, in-memory data processing speed at scale. We recognized we’d have to fill in holes and make it commercially viable for mainstream analytics use cases that demand fast time-to-insight on hordes of data. By partnering with AMPLab, we dug in, prototyped the solution, and added the second pillar needed for next-generation data analytics, a simple to use front-end application.”
ClearStory Data was built using Apache Spark to access data quickly, deliver key insights, and making the UI very user friendly. People who use Apache Spark want information immediately to be utilized for profit from a variety of multiple sources. Apache Spark might ignite the fire for the next wave of data analytics for big data.
Whitney Grace, April 3, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
AI Technology Poised to Spread Far and Wide
April 3, 2015
Artificial intelligence is having a moment; the second half of last year saw about half a billion dollars invested in the AI industry. Wired asks and answers, “The AI Resurgence: Why Now?” Writer Babak Hodjat observes that advances in hardware and cloud services have allowed more contenders to afford to enter the arena. Open source tools like Hadoop also help. Then there’s public perception; with the proliferation of Siri and her ilk, people are more comfortable with the whole concept of AI (Steve Wozniak aside, apparently). It seems to help that these natural-language personal assistants have a sense of humor. Hodjat continues:
“But there’s more substance to this resurgence than the impression of intelligence that Siri’s jocularity gives its users. The recent advances in Machine Learning are truly groundbreaking. Artificial Neural Networks (deep learning computer systems that mimic the human brain) are now scaled to several tens of hidden layer nodes, increasing their abstraction power. They can be trained on tens of thousands of cores, speeding up the process of developing generalizing learning models. Other mainstream classification approaches, such as Random Forest classification, have been scaled to run on very large numbers of compute nodes, enabling the tackling of ever more ambitious problems on larger and larger data-sets (e.g., Wise.io).”
The investment boom has produced a surge of start-ups offering AI solutions to companies in a wide range of industries. Organizations in fields as diverse as medicine and oil production seem eager to incorporate these tools; it remains to be seen whether the tech is a good investment for every type of enterprise. For his part, Hodjat has high hopes for its use in fraud detection, medical diagnostics, and online commerce. And for ever-improving personal assistants, of course.
Cynthia Murrell, April 3, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
EBay Develops Open Source Pulsar for Real Time Data Analysis
April 2, 2015
A new large-scale, real-time analytics platform has been launched in response to one huge company’s huge data needs. VentureBeat reports, “EBay Launches Pulsar, an Open-Source Tool for Quickly Taming Big Data.” EBay has made the code available under an open-source license. It seems traditional batch processing systems, like that found in the widely used open-source Hadoop, just won’t cut it for eBay. That puts them in good company; Google, Microsoft, Twitter, and LinkedIn have each also created their own stream-processing systems.
Shortly before the launch, eBay released a whitepaper on the project, “Pulsar—Real-time Analytics at Scale.” It describes the what and why behind Pulsar’s design; check it out for the technical details. The whitepaper summarizes itself:
“In this paper we have described the data and processing model for a class of problems related to user behavior analytics in real time. We describe some of the design considerations for Pulsar. Pulsar has been in production in the eBay cloud for over a year. We process hundreds of thousands of events/sec with a steady state loss of less than 0.01%. Our pipeline end to end latency is less than a hundred milliseconds measured at the 95th percentile. We have successfully operated the pipeline over this time at 99.99% availability. Several teams within eBay have successfully built solutions leveraging our platform, solving problems like in-session personalization, advertising, internet marketing, billing, business monitoring and many more.”
For updated information on Pulsar, monitor their official website at gopulsar.io.
Cynthia Murrell, April 2, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
A Little Lucene History
March 26, 2015
Instead of venturing to Wikipedia to learn about Lucene’s history, visit the Parse.ly blog and read the post, “Lucene: The Good Parts.” After detailing how Doug Cutting created Lucene in 1999, the post describes how searching through SQL in the early 2000s was a huge task. SQL databases are not the best when it comes to unstructured search, so developers installed Lucene to make SQL document search more reliable. What is interesting is how much it has been adopted:
“At the time, Solr and Elasticsearch didn’t yet exist. Solr would be released in one year by the team at CNET. With that release would come a very important application of Lucene: faceted search. Elasticsearch would take another 5 years to be released. With its recent releases, it has brought another important application of Lucene to the world: aggregations. Over the last decade, the Solr and Elasticsearch packages have brought Lucene to a much wider community. Solr and Elasticsearch are now being considered alongside data stores like MongoDB and Cassandra, and people are genuinely confused by the differences.”
If you need a refresher or a brief overview of how Lucene works, related jargon, tips for using in big data projects, and a few more tricks. Lucene might just be a java library, but it makes using databases much easier. We have said for a while, information is only useful if you can find it easily. Lucene made information search and retrieval much simpler and accurate. It set the grounds for the current big data boom.
Whitney Grace, March 26, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
SAS Text Miner Provides Valuable Predictive Analytics
March 25, 2015
If you are searching for predictive analytics software that provides in-depth text analysis with advanced linguistic capabilities, you may want to check out “SAS Text Miner.” Predictive Analytics Today runs down the features and what SAS Text Miner and details how it works.
It is a user-friendly software with data visualization, flexible entity options, document theme discovery, and more.
“The text analytics software provides supervised, unsupervised, and semi-supervised methods to discover previously unknown patterns in document collections. It structures data in a numeric representation so that it can be included in advanced analytics, such as predictive analysis, data mining, and forecasting. This version also includes insightful reports describing the results from the rule generator node, providing clarity to model training and validation results.”
SAS Text Miner includes other features that draw on automatic Boolean rule generation to categorize documents and other rules can be exported into Boolean rules. Data sets can be made from a directory on crawled from the Web. The visual analysis feature highlights the relationships between discovered patterns and displays them using a concept link diagram. SAS Text Miner has received high praise as a predictive analytics software and it might be the solution your company is looking for.
Whitney Grace, March 25, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Modus Operandi Gets a Big Data Storage Contract
March 24, 2015
The US Missile Defense Agency awarded Modus Operandi a huge government contract to develop an advanced data storage and retrieval system for the Ballistic Missile Defense System. Modus Operandi specializes in big data analytic solutions for national security and commercial organizations. Modus Operandi posted a press release on their Web site to share the news, “Modus Operandi Awarded Contract To Develop Advanced Data Storage And Retrieval System For The US Missile Defense Agency.”
The contract is a Phase I Small Business Innovation Research (SBIR), under which Modus Operandi will work on the DMDS Analytic Semantic System (BASS). The BASS will replace the old legacy system and update it to be compliant with social media communities, the Internet, and intelligence.
“ ‘There has been a lot of work in the areas of big data and analytics across many domains, and we can now apply some of those newer technologies and techniques to traditional legacy systems such as what the MDA is using,’ said Dr. Eric Little, vice president and chief scientist, Modus Operandi. ‘This approach will provide an unprecedented set of capabilities for the MDA’s data analysts to explore enormous simulation datasets and gain a dramatically better understanding of what the data actually means.’ ”
It is worrisome that the missile defense system is relying on an old legacy system, but at least it is being upgraded now. Modus Operandi also sales Cyber OSINT and they are applying this technology in an interesting way for the government.
Whitney Grace, March 24, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
SharePoint’s Evolution of Ease
March 24, 2015
At SharePoint’s beginning, users and managers viewed it as a framework. It is often still referred to as an installation, and many third party vendors do quite well offering add-on options to flesh out the solution. However, due to users’ expectations, SharePoint is shifting its focus to accommodate quick and full implementation without a lengthy build-out. Read more in the CMS Wire article, “From Build It and Go, to Ready to Go with SharePoint.”
The article sums up the transformation:
“We hunger for solutions that can be quickly acquired and implemented, not ones that require building out complex and robust solutions. The world around us is changing fast and it’s exciting to see how productivity tools are beginning to encompass almost every area of our lives. The evolution not only impacts new tools and products, but also the tools we have been using all long. In SharePoint, we can see this in the addition of Experiences and NextGen Portals.”
SharePoint 2016 is on its way and there will be addition information to leak throughout the coming months. Keep an eye on ArnoldIT.com for breaking news and the latest releases. Stephen E. Arnold has made a career out of all things search, including enterprise and SharePoint, and his dedicated SharePoint feed is a great resource for professionals who need to keep up without a huge investment in research time.
Emily Rae Aldridge, March 24, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Data and Marketing Come Together for a Story
March 23, 2015
An article on the Marketing Experiments Blog titled Digital Analytics: How To Use Data To Tell Your Marketing Story explains the primacy of the story in the world of data. The conveyance of the story, the article claims, should be a collaboration between the marketer and the analyst, with both players working together to create an engaging and data-supported story. The article suggests breaking this story into several parts, similar to the plot points you might study in a creative writing class. Exposition, Rising Action, Climax, Denouement and Resolution. The article states,
“Nate [Silver] maintained throughout his speech that marketers need to be able to tell a story with data or it is useless. In order to use your data properly, you must know what the narrative should be…I see data reporting and interpretation as an art, very similar to storytelling. However, data analysts are too often siloed. We have to understand that no one writes in a bubble, and marketing teams should understand the value and perspective data can bring to a story.”
Silver, Founder and Editor in Chief of FiveThirtyEight.com is also quoted in the article from his talk at the Adobe Summit Digital Marketing Conference. He said, “Just because you can’t measure it, doesn’t mean it’s not important.” This is the back to the basics approach that companies need to consider.
Chelsea Kerwin, March 23, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Apache Samza Revamps Databases
March 19, 2015
Databases have advanced far beyond the basic relational databases. They need to be consistently managed and have real-time updates to keep them useful. The Apache Software Foundation developed the Apache Samza software to help maintain asynchronous stream processing network. Samza was made in conjunction with Apache Kafka.
If you are interested in learning how to use Apache Samza, the Confluent blog posted “Turning The Database Inside-Out With Apache Samza” by Martin Keppmann. Kleppmann recorded a seminar he gave at Strange Loop 2014 that explains his process for how it can improve many features on a database:
“This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but it’s more than that. Really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out. At its core is a distributed, durable commit log, implemented by Apache Kafka. Layered on top are simple but powerful tools for joining streams and managing large amounts of data reliably.”
Learning new ways to improve database features and functionality always improve your skill set. Apache Software also forms the basis for many open source projects and startups. Martin Kleppman’s talk might give you a brand new idea or at least improve your database.
Whitney Grace, March 20, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

