Topology Is Finally on Top
December 21, 2015
Topology’s time has finally come, according to “The Unreasonable Usefulness of Imagining You Live in a Rubbery World,” shared by 3 Quarks Daily. The engaging article reminds us that the field of topology emphasizes connections over geometric factors like distance and direction. Think of a subway map as compared to a street map; or, as writer Jonathan Kujawa describes:
“Topologists ask a question which at first sounds ridiculous: ‘What can you say about the shape of an object if you have no concern for lengths, angles, areas, or volumes?’ They imagine a world where everything is made of silly putty. You can bend, stretch, and distort objects as much as you like. What is forbidden is cutting and gluing. Otherwise pretty much anything goes.”
Since the beginning, this perspective has been dismissed by many as purely academic. However, today’s era of networks and big data has boosted the field’s usefulness. The article observes:
“A remarkable new application of topology has emerged in the last few years. Gunnar Carlsson is a mathematician at Stanford who uses topology to extract meaningful information from large data sets. He and others invented a new field of mathematics called Topological data analysis. They use the tools of topology to wrangle huge data sets. In addition to the networks mentioned above, Big Data has given us Brobdinagian sized data sets in which, for example, we would like to be able to identify clusters. We might be able to visually identify clusters if the data points depend on only one or two variables so that they can be drawn in two or three dimensions.”
Kujawa goes on to note that one century-old tool of topology, homology, is being used to analyze real-world data, like the ways diabetes patients have responded to a specific medication. See the well-illustrated article for further discussion.
Cynthia Murrell, December 21, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
The Modern Law Firm and Data
December 16, 2015
We thought it was a problem if law enforcement officials did not know how the Internet and Dark Web worked as well as the capabilities of eDiscovery tools, but a law firm that does not know how to work with data-mining tools much less the importance of technology is losing credibility, profit, and evidence for cases. According to Information Week in “Data, Lawyers, And IT: How They’re Connected” the modern law firm needs to be aware of how eDiscovery tools, predictive coding, and data science work and see how they can benefit their cases.
It can be daunting trying to understand how new technology works, especially in a law firm. The article explains how the above tools and more work in four key segments: what role data plays before trial, how it is changing the courtroom, how new tools pave the way for unprecedented approaches to law practice, how data is improving how law firms operate.
Data in pretrial amounts to one word: evidence. People live their lives via their computers and create a digital trail without them realizing it. With a few eDiscovery tools lawyers can assemble all necessary information within hours. Data tools in the courtroom make practicing law seem like a scenario out of a fantasy or science fiction novel. Lawyers are able to immediately pull up information to use as evidence for cross-examination or to validate facts. New eDiscovery tools are also good to use, because it allows lawyers to prepare their arguments based on the judge and jury pool. More data is available on individual cases rather than just big name ones.
“The legal industry has historically been a technology laggard, but it is evolving rapidly to meet the requirements of a data-intensive world.
‘Years ago, document review was done by hand. Metadata didn’t exist. You didn’t know when a document was created, who authored it, or who changed it. eDiscovery and computers have made dealing with massive amounts of data easier,’ said Robb Helt, director of trial technology at Suann Ingle Associates.”
Legal eDiscovery is one of the main branches of big data that has skyrocketed in the past decade. While the examples discussed here are employed by respected law firms, keep in mind that eDiscovery technology is still new. Ambulance chasers and other law firms probably do not have a full IT squad on staff, so when learning about lawyers ask about their eDiscovery capabilities.
Whitney Grace, December 16, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Content Matching Helps Police Bust Dark Web Sex Trafficking Ring
September 4, 2015
The Dark Web is not only used to buy and sell illegal drugs, but it is also used to perpetuate sex trafficking, especially of children. The work of law enforcement agencies working to prevent the abuse of sex trafficking victims is detailed in a report by the Australia Broadcasting Corporation called “Secret ‘Dark Net’ Operation Saves Scores Of Children From Abuse; Ringleader Shannon McCoole Behind Bars After Police Take Over Child Porn Site.” For ten months, Argos, the Queensland, police anti-pedophile taskforce tracked usage on an Internet bulletin board with 45,000 members that viewed and uploaded child pornography.
The Dark Web is notorious for encrypting user information and that is one of the main draws, because users can conduct business or other illegal activities, such as view child pornography, without fear of retribution. Even the Dark Web, however, leaves a digital trail and Argos was able to track down the Web site’s administrator. It turned out the administrator was an Australian childcare worker who had been sentenced to 35 years in jail for sexually abusing seven children in his care and sharing child pornography.
Argos was able to catch the perpetrator by noticing patterns in his language usage in posts he made to the bulletin board (he used the greeting “hiya”). Using advanced search techniques, the police sifted through results and narrowed them down to a Facebook page and a photograph. From the Facebook page, they got the administrator’s name and made an arrest.
After arresting the ringleader, Argos took over the community and started to track down the rest of the users.
” ‘Phase two was to take over the network, assume control of the network, try to identify as many of the key administrators as we could and remove them,’ Detective Inspector Jon Rouse said. ‘Ultimately, you had a child sex offender network that was being administered by police.’ ”
When they took over the network, the police were required to work in real-time to interact with the users and gather information to make arrests.
Even though the Queensland police were able to end one Dark Web child pornography ring and save many children from abuse, there are still many Dark Web sites centered on child sex trafficking.
Whitney Grace, September 4, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Chinese Opinion Monitoring Software by Knowlesys
August 18, 2015
Ever wonder what tools the Chinese government uses to keep track of those pesky opinions voiced by its citizens? If so, take a look at “IPOMS : Chinese Internet Public Opinion Monitoring System” at Revolution News. The brief write-up tells us about a software company, Knowlesys, reportedly supplying such software to China (among other clients). Reporter and Revolution News founder Jennifer Baker tells us:
“Knowlesys’ system can collect web pages with some certain key words from Internet news, topics on forum and BBS, and then cluster these web pages according to different ‘event’ groups. Furthermore, this system provides the function of automatically tracking the progress of one event. With this system, supervisors can know what is exactly happening and what has happened from different views, which can improve their work efficiency a lot. Most of time, the supervisor is the government, the evil government. sometimes a company uses the system to collect information for its products. IPOMS is composed of web crawler, html parser and topic detection and tracking tool.”
The piece includes a diagram that lays out the software’s process, from extraction to analysis to presentation (though the specifics are pretty standard to anyone familiar with data analysis in general). Data monitoring and mining firm Knowlesys was founded in 2003. The company has offices in Hong Kong and a development center in Schenzhen, China.
Cynthia Murrell, August 18, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Plethora of Image Information
July 24, 2015
Humans are visual creatures and they learn and absorb information better when pictures accompany it. In recent years, the graphic novel medium has gained popularity amongst all demographics. The amount of information a picture can communicate is astounding, but unless it is looked for it can be hard to find. It also cannot be searched by a search engine…or can it? Synaptica is in the process of developing the “OASIS Deep Image Indexing Using Linked Data,”
OASIS is an acronym for Open Annotation Semantic Imaging System, an application that unlocks image content by giving users the ability to examine an image closer than before and highlighting data points. OASIS is linked data application that enables parts of the image to be identified as linked data URIS, which can then be semantically indexed to controlled vocabulary lists. It builds an interactive map of an image with its features and conceptual ideas.
“With OASIS you will be able to pan-and-zoom effortlessly through high definition images and see points of interest highlight dynamically in response to your interaction. Points of interest will be presented along with contextual links to associated images, concepts, documents and external Linked Data resources. Faceted discovery tools allow users to search and browse annotations and concepts and click through to view related images or specific features within an image. OASIS enhances the ability to communicate information with impactful visual + audio + textual complements.”
OASIS is advertised as a discovery and interactive tool that gives users the chance to fully engage with an image. It can be applied to any field or industry, which might mean the difference between success and failure. People want to fully immerse themselves in their data or images these days. Being able to do so on a much richer scale is the future.
Whitney Grace, July 24, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Hadoop Rounds Up Open Source Goodies
July 17, 2015
Summer time is here and what better way to celebrate the warm weather and fun in the sun than with some fantastic open source tools. Okay, so you probably will not take your computer to the beach, but if you have a vacation planned one of these tools might help you complete your work faster so you can get closer to that umbrella and cocktail. Datamation has a great listicle focused on “Hadoop And Big Data: 60 Top Open Source Tools.”
Hadoop is one of the most adopted open source tool to provide big data solutions. The Hadoop market is expected to be worth $1 billion by 2020 and IBM has dedicated 3,500 employees to develop Apache Spark, part of the Hadoop ecosystem.
As open source is a huge part of the Hadoop landscape, Datamation’s list provides invaluable information on tools that could mean the difference between a successful project and failed one. Also they could save some extra cash on the IT budget.
“This area has a seen a lot of activity recently, with the launch of many new projects. Many of the most noteworthy projects are managed by the Apache Foundation and are closely related to Hadoop.”
Datamation has maintained this list for a while and they update it from time to time as the industry changes. The list isn’t sorted on a comparison scale, one being the best, rather they tools are grouped into categories and a short description is given to explain what the tool does. The categories include: Hadoop-related tools, big data analysis platforms and tools, databases and data warehouses, business intelligence, data mining, big data search, programming languages, query engines, and in-memory technology. There is a tool for nearly every sort of problem that could come up in a Hadoop environment, so the listicle is definitely worth a glance.
Whitney Grace, July 17, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Search Companies: Innovative or Not?
June 11, 2015
Forbes’ article “The 50 Most Innovative Companies Of 2014: Strong Innovators Are Three Times More Likely To Rely on Big Data Analytics” points out how innovation is strongly tied to big data analytics and data mining these days. The Boston Consulting Group (BCG) studies the methodology of innovation. The numbers are astounding when companies that use big data are placed against those who still have not figured out how to use their data: 57% vs. 19%.
Innovation, however, is not entirely defined by big data. Most of the companies that rely on big data as key to their innovation are software companies. According to Forbes’ study, they found that 53% see big data as having a huge impact in the future, while BCG only found 41% who saw big data as vital to their innovation.
Big data cannot be and should not be ignored. Forbes and BCG found that big data analytics are useful and can have huge turnouts:
“BCG also found that big-data leaders generate 12% higher revenues than those who do not experiment and attempt to gain value from big data analytics. Companies adopting big data analytics are twice as likely as their peers (81% versus 41%) to credit big data for making them more innovative.”
Measuring innovation proves to be subjective, but one cannot die the positive effect big data analytics and data mining can have on a company. You have to realize, though, that big data results are useless without a plan to implement and use the data. Also take note that none of the major search vendors are considered “innovative,” when a huge part of big data involves searching for results.
Whitney Grace, June 11, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Prepare To Update Your Cassandra
June 2, 2015
It is time for an update to Apache’s headlining, open source, enterprise search software! The San Diego Times let us know that “DataStax Enterprise 4.7 Released” and it has a slew of updates set to make open source search enthusiasts drool. DataStax is a company that built itself around the open source Apache Cassandra software. The company specializes in enterprise applications for search and analytics.
The newest release of DataStax Enterprise 4.7 includes several updates to improve a user’s enterprise experience:
“…includes a production-certified version of Cassandra 2.1, and it adds enhanced enterprise search, analytics, security, in-memory, and database monitoring capabilities. These include a new certified version of Apache Solr and Live Indexing, a new DSE feature that makes data immediately available for search by leveraging Cassandra’s native ability to run across multiple data centers.”
The update also includes DataStax’s OpCenter 5.2 for enhanced security and encryption. It can be used to store encryption keys on servers and to manage admin security.
The enhanced search capabilities are the real bragging points: fault-tolerant search operations-used to customize failed search responses, intelligent search query routing-queries are routed to the fastest machines in a cluster for the quickest response times, and extended search analytics-using Solr search syntax and Apache Spark research and analytics tasks can run simultaneously.
DataStax Enterprise 4.7 improves enterprise search applications. It will probably pull in users trying to improve their big data plans. Has DataStax considered how its enterprise platform could be used for the cloud or on mobile computing?
Whitney Grace, June 2, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Is Collaboration the Key to Big Data Progress?
May 22, 2015
The article titled Big Data Must Haves: Capacity, Compute, Collaboration on GCN offers insights into the best areas of focus for big data researchers. The Internet2 Global Summit is in D.C. this year with many exciting panelists who support the emphasis on collaboration in particular. The article mentions the work being presented by several people including Clemson professor Alex Feltus,
“…his research team is leveraging the Internet2 infrastructure, including its Advanced Layer 2 Service high-speed connections and perfSONAR network monitoring, to substantially accelerate genomic big data transfers and transform researcher collaboration…Arizona State University, which recently got 100 gigabit/sec connections to Internet2, has developed the Next Generation Cyber Capability, or NGCC, to respond to big data challenges. The NGCC integrates big data platforms and traditional supercomputing technologies with software-defined networking, high-speed interconnects and visualization for medical research.”
Arizona’s NGCC provides the essence of the article’s claims, stressing capacity with Internet2, several types of computing, and of course collaboration between everyone at work on the system. Feltus commented on the importance of cooperation in Arizona State’s work, suggesting that personal relationships outweigh individual successes. He claims his own teamwork with network and storage researchers helped him find new potential avenues of innovation that might not have occurred to him without thoughtful collaboration.
Chelsea Kerwin, May 22, 2014
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Data Mining Algorithms Explained
May 18, 2015
In plain English too. Navigate to “Top 10 Data Mining Algorithms in Plain English.” When you fire up an enterprise content processing system, the algorithms beneath the user experience layer are chestnuts. Universities do a good job of teaching students about some reliable methods to perform data operations. In fact, the universities do such a good job that most content processing systems include almost the same old chestnuts in their solutions. The decision to use some or all of the top 10 data mining algorithms has some interesting consequences, but you will have to attend one of my lectures about the weaknesses of these numerical recipes to get some details.
The write up is worth a read. The article includes a link to information which underscores the ubiquitous nature of these methods. This is the Xindong Wu et all write up “Top 10 Algorithms in Data Mining.” Our research reveals that dependence on these methods is more wide spread now than they were seven years ago when the paper first appeared.
The implication then and now is that content processing systems are more alike than different. The use of similar methods means that the differences among some systems is essentially cosmetic. There is a flub in the paper. I am confident that you, gentle reader, will spot it easily.
Now to the “made simple” write up. The article explains quite clearly the what and why of 10 widely used methods. The article also identifies some of the weaknesses of each method. If there is a weakness, do you think it can be exploited? This is a question worth considering I suggest.
Example: What is a weakness of k means:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.
Note the key word “tricks.” When one deals with math, the way to solve problems is to be clever. It follows that some of the differences among content processing systems boils down to the cleverness of the folks working on a particular implementation. Think back to your high school math class. Was there a student who just spit out an answer and then said, “It’s obvious.” Well, that’s the type of cleverness I am referencing.
The author does not dig too deeply into PageRank, but it too has some flaws. An easy way to identify one is to attend a search engine optimization conference. One flaw turbocharges these events.
My relative Vladimir Arnold, whom some of the Arnolds called Vlad the Annoyer, would have liked the paper. So do I. The write up is a keeper. Plus there is a video, perfect for the folks whose attention span is better than a goldfish’s.
Stephen E Arnold, May 18, 2015