Statistics, Statistics. Disappointing Indeed

February 16, 2015

At dinner on Saturday evening, a medical researcher professional mentioned that reproducing results from tests conducted in the researcher’s lab was tough. I think the buzzword for this is “non reproducibility.” The question was asked, “Perhaps the research is essentially random?” There were some furrowed brows. My reaction was, “How does one know what’s what with experiments, data, or reproducibility tests?” The table talk shifted to a discussion of Saturday Night Live’s 40th anniversary. Safer ground.

Navigate to “Science’s Significant Stat Problem.” The article makes clear that 2013 thinking may have some relevance today. Here’s a passage I highlighted in pale blue:

Scientists use elaborate statistical significance tests to distinguish a fluke from real evidence. But the sad truth is that the standard methods for significance testing are often inadequate to the task.

There you go. And the supporting information for this statement?

One recent paper found an appallingly low chance that certain neuroscience studies could correctly identify an effect from statistical data. Reviews of genetics research show that the statistics linking diseases to genes are wrong far more often than they’re right. Pharmaceutical companies find that test results favoring new drugs typically disappear when the tests are repeated.

For the math inclined the write up offers:

It’s like flipping coins. Sometimes you’ll flip a penny and get several heads in a row, but that doesn’t mean the penny is rigged. Suppose, for instance, that you toss a penny 10 times. A perfectly fair coin (heads or tails equally likely) will often produce more or fewer than five heads. In fact, you’ll get exactly five heads only about a fourth of the time. Sometimes you’ll get six heads, or four. Or seven, or eight. In fact, even with a fair coin, you might get 10 heads out of 10 flips (but only about once for every thousand 10-flip trials). So how many heads should make you suspicious? Suppose you get eight heads out of 10 tosses. For a fair coin, the chances of eight or more heads are only about 5.5 percent. That’s a P value of 0.055, close to the standard statistical significance threshold. Perhaps suspicion is warranted.

Now the kicker:

And there’s one other type of paper that attracts journalists while illustrating the wider point: research about smart animals. One such study involved a fish—an Atlantic salmon—placed in a brain scanner and shown various pictures of human activity. One particular spot in the fish’s brain showed a statistically significant increase in activity when the pictures depicted emotional scenes, like the exasperation on the face of a waiter who had just dropped his dishes. The scientists didn’t rush to publish their finding about how empathetic salmon are, though. They were just doing the test to reveal the quirks of statistical significance. The fish in the scanner was dead.

How are those Big Data analyses working out, folks?

Stephen E Arnold, February 16, 2015

Sci Tech Ripple: Lousy Data, Alleged Cover Ups

February 14, 2015

Short honk: I don’t want to get stuck in this tar pit. Read “Are Your Medications Safe?” A professor and some students dug up information that is somewhat interesting. If you happen to be taking medicine, you may not have the full dosage of facts. What’s up? Would the word “malfeasance” be suitable? It is Friday the 13th too.

Stephen E Arnold, February 14, 2015

Lexalytics Now Offers Intention Analysis

February 12, 2015

Lexalytics is going beyond predicting consumers’ feelings, or sentiment analysis, to anticipating their actions with what they call “intention analysis.” Information Week takes a look at the feature, soon to be a premium service for the company’s Semantria platform, in “Big Data Tool Analyzes Intentions: Cool or Creepy?” Writer Jeff Bertolucci consulted Lexalytics founder and CEO Jeff Catlin, and writes:

Catlin explained via email how intention analysis software would deconstruct the following tweet: “I’ve been saving like crazy for Black Friday. iPhone 6 here I come!”

“There are no words like ‘buy’ or ‘purchase’ in this tweet, even though their intention is to purchase an iPhone,” wrote Catlin. Here’s how an intention analysis tool would tag the tweet:

– intention = “buy”

– intended object = “iPhone”

– intendee = “I”

Grammar-parsing technology is the engine that makes intention analysis work.

“Intention is kind of the sexy feature, but the grammar parser is the key that makes it go, the ability to understand what people are talking about, regardless of content type,” said Catlin. “We’ve built a grammar-parser for Twitter, which deals with the fact that there’s bad punctuation, weird capitalization, and things like that.”

Companies can use the technology to determine buying patterns, of course, and may use it to ramp up personalized advertising. Another potential market is that of law enforcement, where agents can use the tool to monitor social media for potential threats.

Lexalytics has been leaders in the sentiment analysis field for years, and counts big tech names like Oracle and Microsoft among their clients. Designed to integrate with third-party applications, their analysis software chugs along in the background at many data-related organizations. Founded in 2003, Lexalytics is headquartered in Amherst, Massachusetts.

Cynthia Murrell, February 12, 2015

Sponsored by ArnoldIT.com, developer of Augmentext

Cyber Threats Boost Demand for Next Generation System

February 10, 2015

President Obama’s announcement of a new entity to combat the deepening threat from cyber attacks adds an important resource to counter cyber threats.

The decision reflects the need for additional counter terrorism resources in the wake of the Sony and Anthem security breaches. The new initiative serves both Federal and commercial sectors’ concerns with escalating cyber threats.

The Department of Homeland Security said in a public release: “National Cybersecurity and Communications Integration Center mission is to reduce the likelihood and severity of incidents that may significantly compromise the security and resilience of the Nation’s critical information technology and communications networks.”

For the first time, a clear explanation of the software and systems that perform automated collection and analysis of digital information is available. Stephen E. Arnold’s new book is “CyberOSINT: Next Generation Information Access” was written to provide information about advanced information access technology. The new study was published by Beyond Search on January 21, 2015.

The author is Stephen E Arnold, a former executive at Halliburton Nuclear Services and Booz, Allen & Hamilton . He said: “The increase in cyber threats means that next generation systems will play a rapidly increasing part in law enforcement and intelligence activities.”

The monograph explains why next generation information access systems are the logical step beyond keyword search. Also, the book provides the first overview of the architecture of cyber OSINT systems. The monograph provides profiles of more than 20 systems now available to government entities and commercial organizations. The study includes a summary of the year’s research behind the monograph and a glossary of the terms used in cyber OSINT.

Cyber threats require next generation information access systems due to proliferating digital attacks. According to Chuck Cohen, lieutenant with a major Midwestern law enforcement agency and adjunct instructor at Indiana University, “This book is an important introduction to cyber tools for open source information. Investigators and practitioners needing an overview of the companies defining this new enterprise software sector will want this monograph.”

In February 2015, Arnold will keynote a conference on CyberOSINT held in the Washington, DC area. Attendance to the conference is by invitation only. Those interested in the a day long discussion of cyber OSINT can write benkent2020 at yahoo dot com to express their interest in the limited access program.

Arnold added: “Using highly-automated systems, governmental entities and corporations can detect and divert cyber attacks and take steps to prevent assaults and apprehend the people that are planning them. Manual methods such as key word searches are inadequate due to the volume of information to be analyzed and the rapid speed with which threats arise.”

Robert David Steele, a former CIA professional and the co-creator of the Marine Corps. intelligence activity said about the new study: “NGIA systems are integrated solutions that blend software and hardware to address very specific needs. Our intelligence, law enforcement, and security professionals need more than brute force keyword search. This report will help clients save hundreds of thousands of dollars.”

Information about the new monograph is available at www.xenky.com/cyberosint.

Ken Toth, February 10, 2015

Advanced Analytics Are More Important Than We Think

February 3, 2015

Alexander Linden, one of Gartner’s research directors, made some astute observations about advanced analytics and data science technologies. Linden shared his insights with First Post in the article, “Why Should CIOs Consider Advanced Analytics?”

Chief information officers are handling more data and relying on advanced analytics to manage it. The data is critical gaining market insights, generating more sales, and retaining customers. The old business software cannot handle the overload anymore.

What is astounding is that many companies believe they are already using advanced analytics, when in fact they can improve upon their current methods. Advanced analytics are not an upgraded version of normal, descriptive analytics. They use more problem solving tools such as predictive and prescriptive analytics.

Gartner also flings out some really big numbers:

“One of Gartner’s new predictions says that through 2017, the number of citizen data scientists will grow five times faster than the number of highly skilled data scientists.”

This is akin to there being more people able to code and create applications than the skilled engineers with the college degrees. It will be a do it yourself mentality in the data analytics community, but Gartner stresses that backyard advanced analytics will not cut it. Companies need to continue to rely on skilled data scientists the interpret the data and network it across the business units.

Whitney Grace, February 03, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Recorded Future: Google and Cyber OSINT

February 2, 2015

I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.

Does Google really get an F in time? The answer is, “Nope.”

In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.

Now to temporal matters in the here and now.

CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:

  • The date and time stamp
  • Versioning (previous, current, and future document, data items, and fact iterations)
  • Times and dates contained in a structured data table
  • Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.

For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.

image

A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.

Sounds like a lot of computational and technical work.

In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.

An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.

Read more

Linguistic Analysis and Data Extraction with IBM Watson Content Analytics

January 30, 2015

The article on IBM titled Discover and Use Real-World Terminology with IBM Watson Content Analytics provides an overview to domain-specific terminology through the linguistic facets of Watson Content Analytics. The article begins with a brief reminder that most data, whether in the form of images or texts, is unstructured. IBM’s linguistic analysis focuses on extracting relevant unstructured data from texts in order to make it more useful and usable in analysis. The article details the processes of IBM Watson Content Analytics,

“WCA processes raw text from the content sources through a pipeline of operations that is conformant with the UIMA standard. UIMA (Unstructured Information Management Architecture) is a software architecture that is aimed at the development and deployment of resources for the analysis of unstructured information. WCA pipelines include stages such as detection of source language, lexical analysis, entity extraction… Custom concept extraction is performed by annotators, which identify pieces of information that are expressed as segments of text.”

The main uses of WCA are exploring insights through facets as well as extracting concepts in order to apply WCA analytics. The latter might include excavating lab analysis reports to populate patient records, for example. If any of these functionalities sound familiar, it might not surprise you that IBM bough iPhrase, and much of this article is reminiscent of iPhrase functionality from about 15 years ago.

Chelsea Kerwin, January 30, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Enterprise Search Pressured by Cyber Methods

January 29, 2015

I read “Automated Systems Replacing Traditional Search.” The write up asserts:

Stephen E. Arnold, search industry expert and author of the “Enterprise Search Report” and “The New Landscape of Search,” has announced the publication of “CyberOSINT: Next-Generation Information Access.” The 178-page report explores the tools and methods used to collect and analyze content posted in public channels such as social media sites. The new technology can identify signals that provide intelligence and law enforcement analysts early warning of threats, cyber attacks or illegal activities.

According to Robert Steele, co-founder of USMC Intelligence Activity:

NGIA systems are integrated solutions that blend software and hardware to address very specific needs. Our intelligence, law enforcement, and security professionals need more than brute force keyword search.

According to Dr. Jerry Lucas, president of Telestrategies, which operates law enforcement and training conferences in the US and elsewhere:

This is the first discussion of the innovative software that makes sense of the flood of open source digital information. Law enforcement, security, and intelligence professionals will find this an invaluable resource to identify ways to deal with Big Data.

The report complements the Telestrategies ISS seminar on CyberOSINT. Orders for the monograph, which costs $499, may be placed at www.xenky.com/cyberosint. Information about the February 19, 2015, seminar held in the DC area is at this link.

The software and methods described in the study has immediate and direct applications to commercial entities. Direct orders may be placed at http://gum.co/cyberosint.

Don Anderson, January 29, 2015

ZyLAB Publishes File Analysis Market Guide

January 27, 2015

If you need help finding file analysis solutions, Nieuwsbank published this press release that might help you with your research, “Gartner ZyLAB in ‘The File Analysis Market Guide, 2014.’” File analysis refers to file storage and users who have access to them. It is used to manage intellectual property, keep personal information private, and protect data. File analysis solutions have many benefits for companies, including reducing businesses risks, reducing costs, and increasing operational efficiencies.

The guide provides valuable insights:

“In the Market Guide Gartner writes: ZyLAB enter this market from the perspective of and with a legacy in eDiscovery. The company has a strong presence in the legal community and is widely used by governments and organizations in the field of law enforcement. ZyLAB emphasis on activities such as IP protection and detection, fraud investigations, eDiscovery and responsibly removal of redundant data. ZyLAB supports storage types 200 and 700 file formats in 450 languages. This makes it a good choice for international companies. ‘”

ZyLAB is a respected eDiscovery and information risk management solutions company and this guide is a compilation of their insights. The articles point out that companies might have their own file analysis manuals, but few actually enforce its policies or monitor violations. Gartner is a leading market research and their endorsement should be all you need to use this guide.

Whitney Grace, January 27, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

IBM on Skin Care

January 19, 2015

Watson has been going to town in different industries, putting to use its massive artificial brain. It has been working in the medical field interpreting electronic medical record data. According to Open Health News, IBM has used its technology in other medical ways: “IBM Research Scientists Investigate Use Of Cognitive Computing-Based Visual Analytics For Skin Cancer Image Analysis.”

IBM partnered with Memorial Sloan Kettering to use cognitive computing to analyze dermatological images to help doctors identify cancerous states. The goal is to help doctors detect cancer earlier. Skin cancer is the most common type of cancer in the United States, but diagnostics expertise varies. It takes experience to be able to detect cancer, but cognitive computing might take out some of the guess work.

Using cognitive visual capabilities being developed at IBM, computers can be trained to identify specific patterns in images by gaining experience and knowledge through analysis of large collections of educational research data, and performing finely detailed measurements that would otherwise be too large and laborious for a doctor to perform. Such examples of finely detailed measurements include the objective quantification of visual features, such as color distributions, texture patterns, shape, and edge information.”

IBM is already a leader in visual analytics and the new skin cancer project has a 97% sensitivity and 95% specificity rate in preliminary tests. It translates to cognitive computing being accurate.

Could the cognitive computing be applied to identifying other cancer types?

Whitney Grace, January 19, 2015
Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta