An Interview with Andrew Smith
Australia has been an innovation leader in search and content processing. In addition to the large Google presence in Australia, the country has nurtured the search technology used by Lexmark. The Australian universities have been leaders in allied fields as well. I recall seeing a demonstration of three dimensional visualization at Australian National University several years ago. For many years, I have monitored the Leximancer.
I interviewed the founder on July 5, 2013. The full text of the interview appears below:
What's the history of Leximancer?
Leximancer is a content analysis and exploration tool that began life as my research at The University of Queensland, in Australia.
My intention was to address the problem of information awareness. My sense was that the gulf between the quantity of available information versus the actual human awareness, integration, and understanding of this information is a serious and insidious threat. Certainly we address the problem of not knowing the best search terms to use in a given context, but we also address the problem of not even knowing what questions can or should be asked.
What is Leximancer’s focus?
Leximancer is designed to show users the information landscape to raise awareness of the space of available knowledge, and enable them to generate and explore hypotheses.
The system is almost entirely data driven, so that the 'ontology' emerges from the data and so is faithful to the data.
Is Leximancer a large firm?
We are a small, science-based company, so our value is represented by our IP to a large extent. We could not protect investor value with an open source model, particularly given the hostile attitude of some patent offices toward software patents these days.
When did you become interested in text and content processing?
I was working in a university-based ISP, and I had to mine and integrate multiple semi-structured log files. It was really impossible to do this in a structured data way, so I thought that there must be a deeper order which could be found. My physics background probably led to this viewpoint.
When was this?
This was 1999, and the rise of the internet led me to the conclusion that the amount of information would grow exponentially, and patterns of meaning would be latent in that data. Some of these patterns would be crucial if human civilization is to navigate the changes to come. But humans have limited memory, time, and cognition, so these critical patterns of meaning might be missed by the people who needed to know.
So I started my post-grad research in information science, with a great advisor - Professor Robert Colomb. My goal from the start was to create a practical system for doing a kind of spectrum analysis on large collections of unstructured data, in a language independent and emergent manner.
What are the general areas of your research activities?
Our research premise is that language adapts to the environment and needs of the people using it. This is informed by modern language research in the fields of cognitive linguistics, corpus linguistics, and cultural linguistics.
Would you elaborate?
In simple terms, this means that word meanings, word combinations, and even syntax, evolve and adapt to the situation and social group in which they are used. This has become very apparent with the huge diversity and rapid evolution in social media channels.
So our technology is designed to rapidly discover concept thesauri and hierarchical concept networks from the data to be analyzed. This provides accurate information mapping and thesaurus driven retrieval which can usually achieve simultaneous precision and recall of over 80% with no supervised training.
Conceptual models can be discovered entirely automatically to characterize the supplied corpus, or can be 'seeded' so the user can profile desired topics.
What’s the payoff for a user?
Leximancer is a very good tool for text analysis by analysts, consultants, and researchers. It is not primarily a search engine, and is certainly not an enterprise search solution, though it is used as a component of such. Our system adds value to content.
Many vendors argue that mash ups are and data fusion are "the way information retrieval will work going forward. How does Leximancer perceive this blend of search and user-accessible outputs?
We are neutral about mash-ups. My feeling is that modern interfaces must be aesthetic and must be flexible so that they can be changed to suit the situation, and multi-media content requires some sort of fusion.
What we have seen is that many users are not prepared to think hard enough to understand complex or statistical truths - they are looking for a plausible story or anecdote from the data, even if it not representative (and just reading Kahneman's Thinking, Fast and Slow explains why). I think this is a danger in some interfaces for any user who is doing serious search/research/analysis. With our new product under development, we are designing to achieve both. We take care of statistical validity, and present the user with an attractive mash-up which is nevertheless a valid representation of the data.
Without divulging your firm's methods or clients, will you characterize a typical use case for your firm's system capabilities?
Leximancer is a tool for text analysis by analysts, consultants, and researchers. It is used most of the time for analyzing surveys, interviews, stakeholder submissions, social media extracts, research article and patent corpora, engineering documentation, policy documents, inquiry transcripts, etc. It is not primarily a search engine, and is certainly not an enterprise search solution, though it is used as a component of such.
In the latter case, the Leximancer engine is used for content search by connecting Leximancer to a third-party document retrieval platform. The documents retrieved by the user are mapped in near-real time by Leximancer and the user gets a satisfying and very productive surfing and exploration experience of the content. This really helps to rapidly get answers to the questions you should have asked.
What are the benefits to a commercial organization or a government agency when working with your firm?
We supply a very rapid, robust, and objective tool for visualizing, exploring, and quantifying bodies of text data. The desktop version is very cost effective and easy to install. Our text analysis science is still one of the few concept-based approaches in the industry which does not rely on supervised training or predefined rules of syntax, morphology, lexicon, or ontology, and the advantages of using Leximancer become very apparent when analyzing informal, diverse, and dynamic social media data.
I cannot currently think of any other commercial automatic text analysis system whose output model has been cross-validated in the scientific literature.
How does an engagement with your firm move through its life cycle?
Generally we sell text analysis software rather than retrieval. We spend time with the prospective customer taking them though analyses of their data, from scratch. We can do this in minutes using no expensive consulting effort to construct rules or supervised training, so we do. After purchase, we will then provide several days training in the use of Leximancer, so the client can fly solo.
One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content AND the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume problem?
Leximancer as offered now is a forensic tool, so it is up to the user to load a data corpus. As a search plug-in, obviously whatever third-party document base is employed handles updates.
Another challenge, particularly in professional intelligence operations, is moving data from point A to point B; that is, information enters a system but it must be made available to an individual who needs that information?
Leximancer does have quantitative and visual exports that form whole reports or reporting components. The SAAS version also offers sharable links to maps, so users can share and publish their analyses and indexes. But out of the box, Leximancer is an analyst support tool, not a workflow system. Leximancer can be integrated into or used with other systems; for example, open source data management or search tools.
There has been a surge in interest in putting "everything" in a repository and then manipulating the indexes to the information in the repository. On the surface, this seems to be gaining traction because network resident information can "disappear" or become unavailable. What's your view of the repository versus non repository approach to content processing?
Though Leximancer is generally supplied as a desktop or SAAS tool requiring data sets to be uploaded for analysis, the engine has a data abstraction layer which can be mapped to a document base, such as MongoDB or ElasticSearch. This makes it easy integrate Leximancer with a repository to enable concept mapping and sub-query of retrieved data sets. In this deployment, the concept model found by Leximancer can also be used to re-query the repository so the user can refine their starting query and 'pan' across the semantic space.
Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations. What's your firm's approach to presenting "outputs" for end user reuse or for mobile access?
Leximancer offers a concept map output for visualization, a dashboard output for a more quantitative view (but still with drilldown to the source content, what I call “verbatims”), and a large set of data exports. The dashboard and the map can both be viewed on a tablet.
I am on the fence about the merging of retrieval within other applications like legacy enterprise applications or new methods such as those used for text mining of social media.. What's your take on the "new" method which some people describe as "search enabled applications"?
As described above, we have enabled Leximancer to be bolted on to existing retrieval systems, but our focus is not on enterprise applications.
I do believe that most if not all current search technologies are not suitable for social media, or most fire hoses of weakly structured complex data such as system or transaction logs. These points support my reasons for this:
- Each data record is only a fragment of some unfolding story, and cannot be understood in isolation, and contains few if any of the obvious keywords for its thread.
- Multiple stories are being played out simultaneously in different time scales, and the fragments of each story are intermixed in the fire hose.
- Terms that make up the data items can mean different things in different contexts, or different terms can mean the same things in some contexts.
- New data terms can appear at any time.
I don’t want to go into detail, but my team and I are developing a search platform specifically designed for this sort of data.
There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are "cut off" from access to more robust systems? How does your firm see the computing world over the next 12 to 18 months?
Well, I guess the consumer market will go mobile. Maybe professionals will use light weight apps out of the office.
Of course, SaaS and cloud infrastructure can disappear anyway if the network or a server farm goes black, as has happened.
Put on your wizard hat. What are the three most significant technologies that you see affecting your search business?
That’s a difficult question. I would highlight mobile messaging, data storage aggregation, and content marketing as areas warranting some attention.
Where does a reader get more information about your firm?
I suggest visiting our Web site at www.leximancer.com. There is a contact form available, and we respond promptly to inquiries.
Leximancer is a content analysis system which produces useful outputs in text and graphic form. These outputs allow an analyst or a researcher to identify particular areas which warrant further analysis and investigation. The system is worth a close look. A trial is available.
Stephen E. Arnold, July 23, 2013