An Interview with Mike Sorah
The need for more sophisticated text processing is increasing. In part, key word indexing has shortcomings which often make it difficult to find information via search box. Another contributing factor is that it is almost impossible for individuals and organizations to keep pace with the flow of digital information. The “beyond search” movement has shifted from traditional Boolean to discovery. Finding information that a person knows about is popping up a level to having a system alert a person about information which is important.
At a recent conference, I learned about interesting technology developed by IMT Holdings. The firm’s product is well known within specific market niches. Now the company is finding itself the focus of interest from organizations struggling with information related to health, financial services, and marketing.
I spoke with the chief technical officer of IMT on November 1, 2012. The full text of the interview appears below. Note: In our conversation, Mr. Sorah demonstrated the Rosoka system, and I have included a screenshot in the interview text.
What's the history of IMT and Rosoka?
IMT Holdings, Corp. was founded in 2007. Our background is in US government contracting. In the course of our work, we continually saw that the existing NLP or Natural Language Processing (NLP) tools were not able to handle the volumes and complexities of the data they needed to process. In December of 2011 we started actively marketing our first product, Rosoka.
We designed Rosoka from the ground up to address the shortcomings in NLP.
What were some of the principles you and your team focused on?
That’s a good question. Let me boil down our approach this way.
First, the Rosoka tool suite needed to be extremely portable to work on as many platforms as possible with a single codebase. Rosoka is 100% pure Java. As long as the client’s platform has a JVM, Rosoka can run on it with no need to recompile the engine or linguistic knowledge bases.
Second, Rosoka needed to be truly multilingual. Many of the existing NLP tools claim to be multilingual, but what they mean is that they have linguistic knowledge bases usually acquired from vendors who provide dictionaries and libraries that make NLP an issue for many licensees. The idea is that these off-the-shelf components are available to process documents in a particular language other than English. Each language module is typically sold as separate, add-on functionality. Many of these third-party products may have a knowledge base that can process Chinese or Spanish documents, but they have one huge functional gap.
Okay, but what’s the gap?
Most of the NLP system don't process documents that contain English and Chinese or English and Spanish. In the world of our clients, mixed language documents are important. These have to be processed as part of the normal stream, not put in an exception folder and maybe never processed or processed after a delay of hours or days.
That’s a key point. Are there other principles?
Yes, in addition, in most multilingual NLP systems, the customer needs to know before they process the document what language the document is so they can load the appropriate language-specific knowledge base.
What we did via our proprietary Rosoka algorithms was to take a multilingual look at the world. Our system automatically understands that a document may be in English or Chinese, or even English and Spanish mixed. The language angle is huge. We randomly sample Twitter stream and have been tweeting the top 10 languages of the week are. English varies between 35 to 45% of the tweets.
Every language that Rosoka can process is included. Our multilingual support is not not sold as separate, add-on functionality.
That’s disruptive, right?
Yes. But we also had to make Rosoka needed to be scalable. Rosoka was designed to work on a platform as small as a smartphone, yet readily capable of processing within a cloud. We strive to provide tools that enable our clients to do something meaningful with their unstructured, natural language Big Data.
When did you personally become interested in text and content processing?
By training I am an electronics engineer. I originally worked on satellite remote sensing and image processing, lots of high end computers, mathematics and information theory.
When my father had a stroke with dyspraxia aphasia where he could understand everyone but could not say what he wanted, but he could write things out. He was fluent in seven different languages, and each of his language capabilities was affected in the same way.
So how did you jump to content processing?
The engineer in me became interested in how the human brain worked and how people understand one another. The dyspraxia seemed to be a contradiction to every textbook theory on language as well as information theory. The parts of your brain that are used to understand language, and to speak a language are separate. That fact triggered my thinking which has directly influenced the Rosoka algorithms.
Where does IMT fit in?
When I met Greg Roberts our CEO, with his background is in historical, socio-, and computational linguistics, we started comparing and contrasting the differences between information theory and linguistics theory.
After being the chief engineer for several very large scale programs that used NLP products, including a system used by a major US government agency, and a large law enforcement project known as the Law Enforcement National Data Exchange or N-Dex, it became evident to me and Greg that what NLP vendors said and what the NLP systems delivered were two different experiences.
We developed a Reese’s Peanut Butter Cup approach. We combined competing technologies into a single, unified and unique solution. My peanut butter (information and statistically- based system and methods) with my partner’s chocolate (linguistics-based technologies).
Can you give me a generalized example?
I will try. Most of the systems which we reviewed were not able to tell the difference that Tom Ridge is a person, not a place. This is similar to the problem for a system to figure out if Paris Hilton is a hotel or if Paris Hilton is the person.
Greg and I decided to start IMT Holdings and offer the Rosoka system to clients who needed NLP that worked in the real world flow of high-volume information and data.
IMT has been investing in search, content processing, and text analysis for several years. What are the general areas of your research activities?
I want to answer your question, but I will have to generalize in order to avoid getting tangled in some very fascinating and sophisticated methods.
Let me say that we lump things into three time-centric categories: Current, short term, and long term.
As I said the world is multilingual. Why why should our customers constrain themselves to the languages that they know? Our current product handles over 200 languages, and we continually strive to improve and expand this capability.
Second, we developed Rosoka to respond to the client who says, “Tell me what is new, tell me what I don’t already know, and please stop telling me what I already know its in the way.”
Most search engines today return documents. The systems don’t really return information that was asked for.
Big Data just makes this problem evident to anyone looking at a results list or a dashboard display of stuff. For the user, the person who needs accurate answers now, the search task switches from finding an answer, to sorting through the volumes of the answers becomes overwhelming.
Can you give me an example?
You run a query on Bing or Google. You enter the query, .Who wrote Shakespeare plays.. You get back a list of over two million documents. The list of documents comes back in less than a second, but the results are all documents. You still have to read through them to get your answer.
So where’s the bacon?
Francis Bacon is one of the contenders. Google doesn’t tell you that you have to read through the responses.
How much of that document list is duplicate documents? How much if that is duplicate information? How many of those unique documents contains information that I don’t already know? Our system answers the question based on the content processed by the system.
Wouldn’t it be nicer for the system to tell me the information and point me to the evidence?
So you are focusing the output?
Yes, and we continue to push our research into content understanding. At IMT, content understanding is a function of shared context. “Jesus loves you, man” means one thing in the Judeo Christian context, and possibly quite another in prison.
In the penitentiary context, the phrase is disambiguated with a following sentence “Jesus loves you man. He wants to see you in his cell.”
We are funding research which expands the existing capabilities of our algorithms in order to draw out implied contextual knowledge to refine the interpretation and real meaning.
Many vendors argue that analytics, mash ups are and data fusion are “the way information retrieval will work going forward.” I am referring to structured and unstructured information. How does its consumerization opening new opportunities or distracting people from more challenging applications of IMT’s technology?
Most people don’t know what to do with unstructured text other than to do a key word search on in. Most people just do a Bing or Google search to find something. In contrast, by reading the text you can extract the information necessary to do that mash up. This is what Rosoka does.
In terms of consumerization, I think we are barely scratching the surface. If I have my smart phone and I want to know where the best place to eat is, then by looking through all of the unstructured text to see what they are saying about all of the nearby restaurants I can make a much better choice. I don’t want to look at a list of restaurants and do a click to see a menu.
I don’t actually want to read the reviews, I just want to know the information in general term. You see this sort of thing reported on the news with the Twitter results for the debates, but why can’t a consumer get that same type of thing for the local restaurants?
Without divulging your firm’s methods or clients, will you characterize a typical use case for your firm’s search and retrieval capabilities?
Good question. We provide a very straight forward integration with our clients existing search and retrieval engines, rather than running on a proprietary search and retrieval engine. This allows our clients to use what they already have in place.
Can you give me an example?
Yes, one of our clients uses Cassandra Hadoop as the organization’s search engine. It took them less than 30 minutes to integrate Rosoka and make all of the names, organizations, and their relationships directly searchable. Before that they would have had to read through all of the documents and make that relationship map by hand. If there is an add-on federating system already in place, that client would have had to figure out how to integrate multilingual capability into the system and then develop a way to handle mixed language documents. The cost and complexity quickly put up a major hurdle.
Also, I want to reiterate that Rosoka is a “multilingual product for a multilingual world.” It is a multilingual world with near instantaneous communications with anyone anywhere. Since Rosoka is a truly multilingual, the system deals seamlessly with mixed languages.
When someone is talking in a foreign language and they don’t know a word they will typically say the word in their native language. Rosoka provides transliteration and entity translation capabilities called Rosoka Entity Translation. for languages that are not written in Roman script, such as Arabic, Chinese, Farsi, Japanese, Korean and Russian. So, if a Korean document contains Korean characters, Rosoka will provide a transliteration of Vladimir Putin. That document can now be found by entering the query Vladimir Putin instead of the Korean query.
Furthermore, Rosoka gives a user salience.
What do you mean with the term salience?
Salience in Rosoka means how important a term is to the document based on the linguistic usage in the document.
Is this term count?
No, I don’t mean term count. If I wanted learn about a candidate, say the mayor, I want the articles about the mayor. But I don’t want the documents which say, “The mayor had no comment.”
What are the benefits to an organization working with your firm?
Greg Roberts, IMT’s CEO tells the story about when he first started out as a linguistic intern for a small start-up company. His manager posted a sign outside his door that said “Infinity minus 1,000.” Greg finally worked up the courage to ask his boss what “Infinity 1,000 meant.”
The manager told him, “Language is infinite with infinite possibilities. With infinite possibilities there are infinite possibilities. If you write 1000 linguistic rules, you have written infinity minus a thousand...now get back to work.”
What he meant was make linguistic work count. That has been a guiding principle for our company.
And the company name IMT?
You have got it. We named our company IMT (Infinity Minus a Thousand). When you buy our products you are also buying support subscription. Language isn’t static, it is constantly changing.
How old is twitter, how old is l33t speech, what does “pwn3d” or “Pr0n” and can you find those in a dictionary. Our support mechanism and customer service are structured to supply our customers with continual updating. These are continuous updates that we provide as well as incorporation of feedback from our end users.
Because of underlying technology, we are able to provide true multilingual capability for what our competitors charge for a single language. Our customers don’t need to provide separate resources to provide each separate language.
How does an information retrieval engagement with your firm move through its life cycle?
Our approach is “mash palm here” for everything out of the box, but you have full access to all of the knobs, dials, switches.
You can run the installer and be up and running in minutes. There are some very common tasks that different customer have different preferences for, and we make these straightforward and easy to use.
Then there is tailoring to a specific problem domain. We provide without extra charge tools to import lists and other information that a customer typically already has that they want to take advantage of, their own value add.
Adding words to the dictionaries can be done as simple as highlighting the word in the GUI, and getting the popup for inputting the definition. So far these are things a subject matter expert would do without really bothering to read the manual. But if you want to write your own linguistic rules, pattern matching and so forth we provide training on how to do that, or we can provide that as a service.
One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content and the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume problem?
Our Rosoka products were designed and architected from day one to deal with “Big Data.” We run multi-threaded instances within a Java virtual machine instance. We support coordinated multi-instances within processing clusters or clouds. In short, we can run as many nodes as needed to get the throughput and latency needed and we provide that sizing information for our clients.
Comparing our performance to competing products, we provide the language ID, entity extraction, language gloss, and relationship mapping all in about one fourth the time as our nearest competitor does just one of those steps.
On my laptop the bottleneck is spindle speed and operating system’s file system, the same amount of time it takes to read a document of the disks and write it back to disk, with a 0.1 second latency. Rosoka goes as fast as an the operating system’s copy command.
But that is content throughput. What a difference if you can get the facts that you are after and read through the supporting evidence at your leisure? That changes the equations for dealing with information.
Another challenge, particularly in commercial competitive intelligence operations, is moving data from point A to point B; that is, information enters a system but it must be made available to an individual who needs that information or at least must know about the information. What does your firm offer licensees to address this issue of content "push", report generation, and personalization within a work flow?
We even provide direct access through the API to the data objects so your applications don’t need to waste time marshaling and unmarshaling the data objects. We want you to be able to use the products that you have already paid for, or whatever your favorite application is. Push the data into your spreadsheet or into your MarkLogic Server, Cassandra database, or Hadoop repository.
There has been a surge in interest in putting “everything” in a repository and then manipulating the indexes to the information in the repository. On the surface, this seems to be gaining traction because network resident information can “disappear” or become unavailable. What’s your view of the repository versus non repository approach to content processing?
That question is all marketing. For years the software development industry has been using local sandboxes, and distributed synced source code repositories giving local copies, repository and disaster copies. It is all originator and modifier tagged. On top of that there is privacy, proprietary, and other reasons not to want to place data in one place or another. Those model choices are also very document/content thinking rather than information thinking.
Entity extraction and other advanced text operations have been a great addition to briefings. On the other hand, some of the outputs can be a problem to those in stressful operational situations due to information overload. What’s your firm’s approach to presenting “outputs” for end user reuse or for mobile access?
Entity relationships are the important factor. But entity extraction by itself is just glorified highlighting, and doesn’t do much more than let a reader scan through a document faster. A name by itself is useless. A phone number by itself is useless. A phone number associated to a person starts to be useful.
An entity relationship map allows a user to visualize how things are related. That is a huge step forward. Now only show me what changed in this graph since the last time I looked at it. With this function, the overload just decreased a lot.
Now I can ask for where that information came from. Since Rosoka provides a measure of importance (salience) of each entity, the documents the where that each entity is more important can be prioritized to be looked at first. The document that a user would want to look at first would be the one where both related entities are important to the document. Round trip Rosoka, provides a drastic reduction in the amount of noise that a user out in the field would have to wade through. Our approach allows the user to focus on the important information.
I am on the fence about the merging of retrieval within other applications. What’s your take on the “new” method which some people describe as “search enabled applications?”
Search enabled applications are a great thing. It allows tailoring of results to individual users. Unfortunately, when most product vendors talk about that, they want to sell an entire system. That can be a lot of replication and expense for things you already have.
Our approach is to be a component into what you already have, or be one of the applications available in your search application. On the other hand there are companies that sell Software as a Service. Rosoka provides data enrichment for these search services.
There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are “cut off” from access to more robust systems? How does your firm see the computing world over the next 12 to 18 months?
I think looking forward is best done by looking at history, for instance, Moore’s Law. From the beginning of time until 2003 the world generated an estimated five billion gigabytes of data. By next year the world will generate that much data every 10 minutes. My cell phone has more memory on it than existed in the world when I wrote my first program.
Our product development strategy has been to engineer for ubiquitous data, compute any time, and and everywhere. That portion of our strategy will remain the same. As we expand, you will see more end user interfaces to take advantage of that strategy.
Put on your wizard hat. What are the three most significant trends that you see affecting your search business?
I am monitoring continued growth in globalized communications. Watching the software search industry today is like watching the traditional manufacturing companies in the late 70’s. Search industry must get multi-lingual. Some are some aren’t. Some are just doing it very slowly and inadequately.
Where does a reader get more information about your firm?
What strikes us is Rosoka’s ability to handle multiple languages and documents which contain a mixture of languages. The company embraces an open approach so that a customer can integrate or add additional functionality to a Rosoka-enhanced system. The system’s ability to generate related entity information is an important functionality. If your organization is struggling with Big Data and making sense of a wide range of data and information, you will want to learn more about Rosoka.
Stephen E. Arnold, November 21, 2012