Smarter Content for Contentier Intelligence
December 28, 2016
I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.
Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.
The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:
What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:
A Data Ninja API
A Data Ninja “knowledgebase,” which I think is an iPhrase-type or TeraText type of method. Not familiar with iPhrase and TeraText, feel free to browse the descriptions at the links.
A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.
The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.
Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:
- Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
- Smart data; that is, I am not sure what this means
- Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
- Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
- The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).
The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.
If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.
Stephen E Arnold, November 28, 2016
On the Hunt for Thesauri
December 15, 2016
How do you create a taxonomy? These curated lists do not just write themselves, although they seem to do that these days. Companies that specialize in file management and organization develop taxonomies. Usually they offer customers an out-of-the-box option that can be individualized with additional words, categories, etc. Taxonomies can be generalized lists, think of a one size fits all deal. Certain industries, however, need specialized taxonomies that include words, phrases, and other jargon particular to that field. Similar to the generalized taxonomies, there are canned industry specific taxonomies, except the more specialized the industry the less likely there is a canned list.
This is where the taxonomy lists needed to be created from scratch. Where do the taxonomy writers get the content for their lists? They turn to the tried, true resources that have aided researchers for generations: dictionaries, encyclopedias, technical manuals, and thesauri are perhaps one of the most important tools for taxonomy writers, because they include not only words and their meanings, but also synonyms and antonyms words within a field.
If you need to write a taxonomy and are at a lost, check out MultiTes. It is a Web site that includes tools and other resources to get your taxonomy job done. Multisystems built MultiTes and they:
…developed our first computer program for Thesaurus Management on PC’s in 1983, using dBase II under CPM, predecessor of the DOS operating system. Today, more than three decades later, our products are as easy to install and use. In addition, with MultiTes Online all that is needed is a web connected device with a modern web browser.
In other words, they have experience and know their taxonomies.
Whitney Grace, December 15, 2016
The Robots Are Not Taking over Libraries
December 14, 2016
I once watched a Japanese anime that featured a robot working in a library. The robot shelved, straightened, and maintained order of the books by running on a track that circumnavigated all the shelves in the building. The anime took place in a near-future Japan, when all paper documents were rendered obsolete. While we are a long way off from having robots in public libraries (budget constraints and cuts), there is a common belief that libraries are obsolete as well.
Libraries are the furthest thing from being obsolete, but robots have apparently gained enough artificial intelligence to find lost books, however. Popsci shares the story in “Robo Librarian Tracks Down Misplaced Book.” It explains a situation that librarians hate to deal with: people misplacing books on shelves instead of letting the experts put them back. Libraries rely on books being in precise order and if they are in the wrong place, they are as good as lost. Fancy libraries, like a research library at the University of Chicago, have automated the process, but it is too expensive and unrealistic to deploy. There is another option:
A*STAR roboticists have created an autonomous shelf-scanning robot called AuRoSS that can tell which books are missing or out of place. Many libraries have already begun putting RFID tags on books, but these typically must be scanned with hand-held devices. AuRoSS uses a robotic arm and RFID scanner to catalogue book locations, and uses laser-guided navigation to wheel around unfamiliar bookshelves. AuRoSS can be programmed to scan the library shelves at night and instruct librarians how to get the books back in order when they arrive in the morning.
Manual labor is still needed to put the books in order after the robot does its work at night. But what happens when someone needs help with research, finding an obscure citation, evaluating information, and even using the Internet correctly? Yes, librarians are still needed. Who else is going to interpret data, guide research, guard humanity’s knowledge?
Whitney Grace, December 14, 2016
Need Data Integration? Think of Cisco. Well, Okay
November 25, 2016
Data integration is more difficult than some of the text analytics’ wizards state. Software sucks in disparate data and “real time” analytics systems present actionable results to marketers, sales professionals, and chief strategy officers. Well, that’s not exactly accurate.
Industrial strength data integration demands a company which has bought a company which acquired a technology which performs data integration. Cisco offers a system that appears to combine the functions of Kapow with the capabilities of Palantir Technologies’ Gotham and tosses in the self service business information which Microsoft touts.
Cisco acquired Composite Information in 2013. Cisco now offers the Composite system as the Cisco Information Server. Here’s what the block diagram of the federating behemoth looks like. You can get a PDF version at this link.
The system is easy to use. “The graphical development and management environments are easy to learn and intuitive to use,” says the Cisco Teradata information sheet. For some tips about the easy to use system check out the Data Virtualization Cisco Information Server blog. A tutorial, although dated is, at this link. Note that the block diagram between 2011 and the one presented above has not significantly changed. I assume there is not much work required to ingest and make sense of the Twitter stream or other social media content.
The blog has one post and was last updated in 2011. But there is a YouTube video at this link.
The system includes a remarkable range of features; for example:
- Modeling which means import and transform what Cisco calls “introspect”, create a model and figure out how to make it run at an acceptable level of performance, and expose the data to other services. (Does this sound like iPhrase’s and Teratext’s method? It does to me.)
- Search
- Transformation
- Version control and governance
- Data quality control and assurance
- Outputs
- Security
- Administrative controls.
The time required to create this system is, according to Cisco Teradata, is “over 300 man years.”
The licensee can plug the system into an IBM DB2 running on a z/OS8 “handheld”. You will need a large hand by the way. No small hands need apply.
Stephen E Arnold, November 25, 2016
Genetics Are Biased
November 4, 2016
DNA does not lie. DNA does not lie if conducted accurately and by an experienced geneticist. Right now it is popular for people to get their DNA tested to discover where their ancestors came from. Many testers are surprised when they receive their results, because they learn their ancestors came from unexpected places. Black Americans are eager to learn about the genetics, due to their slave ancestry and lack of familial records. For many Black Americans, DNA is the only way they can learn where their roots originated, but Africa is not entirely cataloged.
According to Science Daily’s article “Major Racial Bias Found In Leading Genomics Database,” if you have African ancestry and get a DNA test it will be difficult to pinpoint your results. The two largest genomics databases that geneticists refer to contain a measurable bias to European genes. From a logical standpoint, this is understandable as Africa has the largest genetic diversity and remains a developing continent without the best access to scientific advances. These provide challenges for geneticists as they try to solve the African genetic puzzle.
It also weighs heavily on black Americans, because they are missing a significant component in their genetic make-up they can reveal vital health information. Most black Americans today contain a percentage of European ancestry. While the European side of their DNA can be traced, their African heritage is more likely to yield clouded results. On a financial scale, it is more expensive to test black Americans genetics due to the lack of information and the results are still not going to be as accurate as a European genome.
This groundbreaking research by Dr. O’Connor and his team clearly underscores the need for greater diversity in today’s genomic databases,’ says UM SOM Dean E. Albert Reece, MD, PhD, MBA, who is also Vice President of Medical Affairs at the University of Maryland and the John Z. and Akiko Bowers Distinguished Professor at UM SOM. ‘By applying the genetic ancestry data of all major racial backgrounds, we can perform more precise and cost-effective clinical diagnoses that benefit patients and physicians alike.
While Africa is a large continent, the Human Genome Project and other genetic organizations should apply for grants that would fund a trip to Africa. Geneticists and biologists would then canvas Africa, collect cheek swabs from willing populations, return with the DNA to sequence, and add to the database. Would it be expensive? Yes, but it would advance medical knowledge and reveal more information about human history. After all, we all originate from Mother Africa.
Whitney Grace, November 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
NTechLab as David to the Google Goliath of Facial Recognition
October 27, 2016
The article titled A Russian Startup is Beating Google with Eerily Accurate Facial Recognition Technology on Business Insider positions NTechLab as the company leading the industry in facial recognition technology. In 2015, the startup beat Google to win the “MegaFace” competition. The article explains,
NTechLab sets itself apart from its competitors with its high level of accuracy and its ability to search an extensive database of photographs. At the MegaFace Championship, NTechLab achieved a 73 percent accuracy with a database of 1 million pictures. When the number dropped to 10,000 images, the system achieved a jaw-dropping accuracy of 95 percent. “We are the first to learn how to efficiently handle large picture databases,” said NTechLab founder Artem Kukharenko to Intel iQ.
The startup based its technology in deep learning and a neural network. The company has held several public demonstrations at festivals and amusement parks. Attendees share selfies with the system, then receive pictures of themselves when the system “found” them in the crowd. Kukharenko touts the “real-word” problem-solving capabilities of his system. While there isn’t a great deal of substantive backup to his claims, the company is certainly worth keeping an eye on.
Chelsea Kerwin, October 27, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Data Silos: Here to Stay
October 20, 2016
Data silos have become a permanent part of the landscape. Even if data reside in a cloud, some data are okay for certain people to access. Other data are off limits. Whether the data silo is a result of access controls or because an enthusiastic marketer has a one off storage device in his or her cubbies’ desk drawer, we have silos.
I read “Battling Data Silos: 3 Tips to Finance and Operations Integration.” This is a very good example of providing advice which is impossible to implement. If I were to use the three precepts in an engagement, I have a hunch that a barrel of tar and some goose feathers will be next to my horse and buggy.
What are the “tips”? Here you go.
- Conduct a data discovery audit.
- Develop a plan
- And my fave “Realize the value of the cloud for high performance and scalability.”
Here we go, gentle reader.
The cost of a data discovery audit can be high. The cost of the time, effort, and lost productivity mean that most data audits are limp wash rags. Few folks in an organization know what data are where, who manages those data, and the limits placed on the data. Figuring out the answers to these questions in a company with 25 people is tough. Try to do it for a government agency with dozens of locations and hundreds of staff and contractors. Automated audits can be a help, but there may be unforeseen consequences of sniffing who has what. The likelihood of a high value data discovery audit without considerable preparation, budgeting, and planning is zero. Most data audits like software audits never reach the finish line without a trip to the emergency room.
The notion of a plan for consolidating data is okay. Folks love meetings with coffee and food. A plan allows a professional to demonstrate that work has been accomplished. The challenge, of course, is to implement the plan. That’s another kettle of fish entirely. MBA think does not deliver much progress toward eliminating silos which proliferate like tweets about zombies.
The third point is value. Yep, value. What is value? I don’t know. Cloud value can be demonstrated for specific situations. But the thought of migrating data to a cloud and then making sure that no regulatory, legal, or common sense problems have been avoided is a work in progress. Data management, content controls, and security tasks nudge cloud functions toward one approach: Yet another data silo.
Yep, YADS. Three breezy notions crater due to the gravitational pull of segmented content repositories under the control of folks who absolutely love silos.
Stephen E Arnold, October 20, 2016
Recent Developments in Deep Learning Architecture from AlexNet to ResNet
September 27, 2016
The article on GitHub titled The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3) is not an article about the global media giant but rather the advancements in computer vision and convolutional neural networks (CNNs). The article frames its discussion around the ImageNet Large-Scale Recognition Challenges (ILSVRC), what it terms the “annual Olympics of computer vision…where teams compete to see who has the best computer vision model for tasks such as classification, localization, detection and more.” The article explains that the 2012 winners and their network (AlexNet) revolutionized the field.
This was the first time a model performed so well on a historically difficult ImageNet dataset. Utilizing techniques that are still used today, such as data augmentation and dropout, this paper really illustrated the benefits of CNNs and backed them up with record breaking performance in the competition.
In 2013, CNNs flooded in, and ZF Net was the winner with an error rate of 11.2% (down from AlexNet’s 15.4%.) Prior to AlexNet though, the lowest error rate was 26.2%. The article also discusses other progress in general network architecture including VGG Net, which emphasized depth and simplicity of CNNs necessary to hierarchical data representation, and GoogLeNet, which tossed the deep and simple rule out of the window and paved the way for future creative structuring using the Inception model.
Chelsea Kerwin, September 27, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
There is a Louisville, Kentucky Hidden Web/Dark Web meet up on September 27, 2016.
Information is at this link: https://www.meetup.com/Louisville-Hidden-Dark-Web-Meetup/events/233599645/
In-Q-Tel Wants Less Latency, Fewer Humans, and Smarter Dashboards
September 15, 2016
I read “The CIA Just Invested in a Hot Startup That Makes Sense of Big Data.” I love the “just.” In-Q-Tel investments are not like bumping into a friend in Penn Station. Zoomdata, founded in 2012, has been making calls, raising venture funding (more than $45 million in four rounds from 21 investors), and staffing up to about 100 full time equivalents. With its headquarters in Reston, Virginia, the company is not exactly operating from a log cabin west of Paducah, Kentucky.
The write up explains:
Zoom Data uses something called Data Sharpening technology to deliver visual analytics from real-time or historical data. Instead of a user searching through an Excel file or creating a pivot table, Zoom Data puts what’s important into a custom dashboard so users can see what they need to know immediately.
What Zoomdata does is offer hope to its customers for less human fiddling with data and faster outputs of actionable intelligence. If you recall how IBM i2 and Palantir Gotham work, humans are needed. IBM even snagged Palantir’s jargon of AI for “augmented intelligence.”
In-Q-Tel wants more smart software with less dependence on expensive, hard to train, and often careless humans. When incoming rounds hit near a mobile operations center, it is possible to lose one’s train of thought.
Zoomdata has some Booz, Allen DNA, some MIT RNA, and protein from other essential chemicals.
The write up mentions Palantir, but does not make explicit the need to reduce t6o some degree the human-centric approaches which are part of the major systems’ core architecture. You have nifty cloud stuff, but you have less nifty humans in most mission critical work processes.
To speed up the outputs, software should be the answer. An investment in Zoomdata delivers three messages to me here in rural Kentucky:
- In-Q-Tel continues to look for ways to move along the “less wait and less weight” requirement of those involved in operations. “Weight” refers to heavy, old-fashioned system. “Wait” refers to the latency imposed by manual processes.
- Zoomdata and other investments whips to the flanks of the BAE Systems, IBMs, and Palantirs chasing government contracts. The investment focuses attention not on scope changes but on figuring out how to deal with the unacceptable complexity and latency of many existing systems.
- In-Q-Tel has upped the value of Zoomdata. With consolidation in the commercial intelligence business rolling along at NASCAR speeds, it won’t take long before Zoomdata finds itself going to big company meetings to learn what the true costs of being acquired are.
For more information about Zoomdata, check out the paid-for reports at this link.
Stephen E Arnold, September 15, 2016
SAP In Memory: Conflicts of Opinion
September 13, 2016
I was surprised by the information presented in “SAP Hana Implementation Pattern Research Yields Contradictory Results.” My goodness, I thought, an online publication actually presents some ideas that a high profile system may not be a cat fully dressed in pajamas.
The SAP Hana system is a database. The difference between Hana and the dozens of other allegedly next generation data management solutions is its “in memory, columnar database platform.” If you are not hip to the lingo of the database administrators who clutch many organizations by the throat, an in memory approach is faster than trucking back to a storage device. Think back to the 1990s and Eric Brewer or the teens who rolled out Pinpoint.
The columnar angle is that data is presented in stacks with each item written on a note card. The mapping of the data is different from a row type system. The primary key in a columnar structure is the data, which maps back to the the row identification.
The aforementioned article points to a mid tier consulting firm report. That report by an outfit called Nucleus Research. Nucleus, according to the article, “revealed that 60 percent of SAP reference customers – mostly in the US – would not buy SAP technology again.” I understand that SAP engenders some excitement among its customers, but a mid tier consulting firm seems to be demonstrating considerable bravery if the data are accurate. Many mid tier consulting firms sand the rough edges off their reports.
The article then jumps to a report paid for by an SAP reseller, which obviously has a dog in the Nucleus fight. Another mid tier research outfit called Coleman Parks was hired to do another study. The research focused on 250 Hana license holders.
The results are interesting. I learned from the write up:
When asked what claims for Hana were credible, 92% of respondents said it reduced IT infrastructure costs, a further 87% stated it saved business costs. Some 98% of Hana projects came in on-budget, and 65% yet to roll out were confident of hitting budget.
Yep, happy campers who are using the system for online transactional processing and online analytical processing. No at home chefs tucking away their favorite recipes in Hana I surmise.
However, the report allegedly determined what I have known for more than a decade:
SAP technology is often deemed too complex, and its CEO Bill McDermott has been waging a public war against this complexity for the past few years, using the mantra Run Simple.
The rebuttal study identified another plus for Hana:
“We were surprised how satisfied the Hana license holders were. SAP has done a good job in making sure these projects work, and rate at which has got Hana out is amazing for such a large organization,” said Centiq director of technology and services Robin Webster. “We had heard a lot about Hana as shelfware, so we were surprised at the number saying they were live.”
From our Hana free environment in rural Kentucky, we think:
- Mid tier consulting firms often output contradictory findings when reviewing products or conducting research. If there is bias in algorithms, imagine what might luck in the research team members’ approaches
- High profile outfits like SAP can convince some of the folks with dogs in the fight to get involved in proving that good things come to those who have more research conducted
- Open source data management systems are plentiful. Big outfits like Hewlett Packard, IBM, and Oracle find themselves trying to generate the type of revenue associated with proprietary, closed data management products at a time when fresh faced computer science graduates just love free in memory solutions like Memsql and similar solutions.
SAP mounted an open source initiative which I learned about in “SAP Embraces Open Source Sort Of.” But the real message for me is that one can get mid tier research firms to do reports. Then one can pick the one that best presents a happy face to potential licensees.
Here in Harrod’s Creek, the high tech crowd tests software before writing checks. No consultants required.
Stephen E Arnold, September 13, 2016

