An Interview with Dr. Russ Couturier
As traditional vendors of information retrieval systems scramble to reinvent themselves, specialist firms have an advantage. The multi-purpose Swiss Army knife approach is often expensive to deploy, difficult to customize, and time consuming to keep in tip top shape.
A few days ago, I had an opportunity to learn about a number of next generation systems. After the speed dating presentations, I noted that one firm distinguished itself, Cybertap. I circled back with that company and arranged a more thorough conversation. Cybertap not only snagged my interest, but it had caught the attention of a number of people whom I consider my peers.
Cybertap is a privately held company that has products deployed in the public and private sectors worldwide. Cybertap’s Recon product applies big data analytics to captured network traffic to give organizations unparalleled visibility into what is transpiring both on and to their networks. This visibility means a Cybertap licensee can see the hot topics being discussed and the general mood of individuals. The system provides a way to learn what plans people are making and how sensitive information is being handled. For organizations sensitive to their intellectual property, the Cybertap system provides information about IP and possible improper actions by employees or contractors.
In my experience, few products offer this type of pragmatic insight without the costs and complexities of traditional systems built by cobbling together different vendors’ products. In today’s cost sensitive environment, the overhead required to baby-sit a Rube Goldberg contraption is out of reach of most commercial and governmental organizations.
In the briefing, it struck me that Cybertap can support a wide range of applications and case situations. I noted that Cybertap’s system can support lawful intercept to military and Intelligence tasks.
My take away from the briefing is that Cybertap’s Recon product starts with raw network traffic, reconstitutes it back to its original human-facing form, and indexes the of searchable content. The content processing tags attributes, metadata, and protocol data. An analyst can then extract surgically just the content that is relevant to their purposes. The system can handle high volume network traffic.
The system users can search hundreds of terabytes of network traffic to find items entering, traversing, or leaving a monitored network. Recon also identifies and correlates the cyber-personas – e-mail addresses, MAC addresses, chat ids, VoIP phone numbers, etc. – of the actors involved in the captured network transactions. The relationships among those actors and personas are made visible and clear.
Cybertap arranged for me to speak with Cybertap’s Chief Technology Officer and product founder, Dr. Russ Couturier. The full text of my interview with him appears below.
What’s the background for Cybertap?
I was the founder of Synthetic Networks and Imperfect Networks. These companies dealt with multi-protocol traffic generation at the application level and zero day threat testing. The knowledge we gained in protocol analysis and manipulation helped lay the foundation for Cybertap.
What problem did you set out to solve with Cybertap Recon?
Every corporation and government agency has networks that transport massive amounts of data. Most of this traffic is good, mission-oriented data, but some is not. Hidden within the network traffic are malicious attacks, personal and medical information leaks, and insider theft of intellectual property and financial information. Our clients use Recon to keep tabs on the good and the bad being done on their networks and who’s doing it, so that they can take the proper actions to mitigate any damage and bring the individuals to account.
What’s your background?
I got interested in content challenges when I was studying for PhD at the University of Massachusetts, Amherst.
That’s one of the leading content and analytics centers, isn’t it?
Yes. I learned a great deal. I realized that the ability to apply search engine technology and medical research methods to network forensics and big data analysis was very exciting to me. Looking back now, Cybertap was a natural evolution of this and the experience I gained throughout my career.
What’s the big idea behind Recon?
That’s a tough question. Let me come at it this way: For Recon, we developed inspectors for conventional network protocols as well as for emerging and proprietary Web 2.0 applications.
Once we had that problem in hand, we then developed techniques for indexing massive amounts of information within a search engine construct.
Perhaps one useful idea was that we recognized some natural parallels with medical research and exploiting their lessons learned and solutions sped up Recon’s ability to search instantly massive amounts of data and powerfully display this data for rapid comprehension.
What’s the distinguishing feature of Recon?
Let me come at that question by highlighting what I call a “fundamental idea.” Recon has exceptional comprehensiveness and power from its indexing method. Therefore, the ease and speed at which it can extract exactly what the analyst is looking for is excellent.
Other products may be able to find strings of characters embedded in small amounts of captured network files. Recon indexes everything contained within the captured network traffic.
What do you mean by “everything”?
Recon processes content, embedded files, attachments, attributes, network protocol data, metadata, and entities. And our system indexes every bit. Using a standard search engine interface, analysts can quickly search all of the captured network traffic to find precisely what they are looking for. Once found, Recon presents the information as it was originally seen so analysts can follow conversations and threads in context.
Does “context” mean the conversation and supporting items of information?
Exactly. So now our engineers are refining our methods for helping the analyst ignore irrelevant data.
Our clients are typically looking for a few needles in haystacks that are extremely large in size. We are incorporating semantic analysis tools to “roll-up” large volumes of data into what we call “themes” and “topics.”
This aggregation enables researchers to more quickly decide whether information is relevant or not.
Are there other research areas at Cybertap?
Yes, we have some interesting projects involving what we describe as “pre-categorizing traffic”. The idea is to take a Big Data flow and create “collections” of data and information that can be explicitly filtered.
Can you give me an example?
Yes, an analyst may not be interested in advertising, shopping, and religious data sets, but is interested in social networking, banking, and electronic trading. Our filters reduce the traffic, and thus the workload for the analysts considerably.
How does Cybertap perceive mash ups and fusion instead of a Google style results list?
In terms of search, our clients’ data sets are moving towards unstructured formats. Any unstructured data can be easily searched using a search engine, but the same is not true when using a database to search unstructured data. Therefore, the search of unstructured data is key to our technology.
How does Cybertap bridge this gap between unstructured and structured information?
Structured data by its definition is categorized; that is, column definitions have meaning for the data. Additionally, structured data allows relationship analysis between the structures, a very difficult task for search engines.
Cybertap derives relationship (structured) data from unstructured data to add meaningful collections to the content.
With that said, I believe that technologies that derive structure from unstructured content for access by both a search engine and SQL are going to be the norm in the foreseeable future.
And mash ups and fusion?
Mash ups and data fusion are crucial when dealing with big data. The key enablers are non-proprietary data standards and open XML interfaces. And let me give an example. Recon reconstitutes voice over IP telephone conversations and saves the result as a wav file that an analyst can listen to. Since Recon’s APIs are open, one of our clients mashed up Recon with third party speech-to-text and voice recognition technologies.
The mash up first passes the Recon wav files to the speech-to-text software and then to the voice recognition software. The system then passes the text file derived from the speech back to Recon for indexing and the identities of any voice-recognized participants back to Recon for correlation with Recon’s component called ePersona.
Our approach is powerful in my opinion. An analyst can use Recon’s search engine to find information from any telephone call, and possibly tie it to everyone involved. We have had similar success in fusing information provided by Recon with datasets of the geographic coordinates of IP addresses. Non-proprietary standards and open APIs provide incredible options to integrators seeking to create custom, high-functionality solutions.
Without divulging your firm’s methods or clients, will you characterize a typical use case for your firm’s search and retrieval capabilities?
Okay, let’s start with an alarm from a client’s data loss prevention tool that indicates a series of credit card numbers appear to be leaving the client’s enterprise. The analyst immediately takes steps to stop the flow of financial information and then employs Recon to find out what happened.
By querying on one of the credit card numbers, Recon displays the e-mail, chat or other type of network transaction that carried the credit card number. The metadata associated with the e-mail will identify the computer where the information came from and its IP and Media Access Control addresses.
At this point the investigation might jump to the individual who was logged into the computer or whose e-mail account was used to send the credit card number. The analyst then queries Recon to identify every communication to and from that computer or user for the last few days.
Via Recon, the analyst can step back in time. He can see the Web sites the computer user visited and individuals he communicated with during that period. With this information, the organization can put a stop to the behavior and also stem the spread of that same behavior from individual to individual.
What are some other advantages of Cybertap’s system?
There are three advantages based on feedback I have from our customers. Our licensees say that Recon can comprehensively reconstitute network traffic, extract and index its content and deliver powerful and easy-to-use tools to analysts to find what they are looking for.
Another is Recon’s ability to perform at sub-second response times while scaling to petabyte repositories.
Finally, we use an open architecture that readily enables the composition custom solutions that include Recon in conjunction with other best-of-breed third party technologies and home-grown tools.
What are the benefits of working with Cybertap?
That’s my favorite question. We are an agile startup with outstanding technology. This means we can respond very quickly to those organizations with opportunities, resources, and solutions. Let me give an example.
Our search engine technology finds strings of characters in the information we have extracted from the network. The characters can represent the English alphabet, Cyrillic, Arabic, Farsi, and others. Based on client requests, we have tuned Recon to not only support other languages, but also understand the proprietary protocols of Web 2.0 applications written in those languages.
How does an information retrieval engagement with your firm move through its life cycle?
Cybertap is strictly a software vendor and we work with clients or their integrator partners to customize end-to-end solutions. We ensure the touch points between Recon and other systems. Our clients can choose how best to employ Recon within their architecture.
No rip and replace?
Right. With Recon the licensee can use what is already in operation and add other systems without worrying about integration with Recon.
We have strategic relationships with major network packet capture and enterprise storage providers. We ensure the touch points between Recon and external technologies, such as packet capture, network traffic inspection and alerting, and other content analysis tools are as open, functional, and scalable as possible. As a result, we work in partnership with industry leaders to develop solutions that best meet our client’s needs.
What does your firm provide to clients to help them deal with the Big Data problem?
Search engine technology indexes unstructured data and allows researchers to make meaningful decisions. It also allows adding new information on top of previously archived information. We are incorporating big data analysis techniques as I described earlier to reduce the meaningless data and quantify the meaningful information using categorization, semantic, and sentiment tools.
What is the method?
I can answer that at a high level if that is okay?
Recon runs through two major steps prior to making information available to the user. The first is to identify and extract the payloads of every network transaction.
The second step is to index the textual characters and the meta-data of every file that recon extracted.
The speed of when information is made available to the analyst depends on the data rate of the network connection, the density of the information flowing through the connection, and the processing power of the servers that are running Recon. We design client systems to provide about one minute latency between the time when Recon receives the network flow and the information is available to the analyst.
What about the latency within Recon?
The beauty of Recon is that it extracts the information from the underlying data and indexes the information. Although Recon maintains a pointer to the file where the information was extracted from, it does not need the original data to operate.
The benefit of our approach is that it significantly reduces the amount of storage required for Recon’s forensic repository to maintain months or years records. It also allows us to maintain near-linear performance as the number of documents scales.
A second point is that Recon does not move data or information from one point to another.
What do you mean?
Recon creates a forensic repository and index at the point where network traffic is captured. In cases where a client has several network gateways, Recon can create a forensic repository at each location, and then federate the distributed indexes together so they appear and operate as a single virtual instance. (This technique is how Internet search engines can scale to index the entire World Wide Web.)
Also Recon has an embedded Web host that analysts access via a browser. Since our interface is identical to market leading Internet search engines, analysts require little training to find what they are looking for. Once an analyst finds what he is seeking, only the specific information selected by the analyst is sent from the forensic repositories to his browser.
What’s your view of the repository versus non-repository approach to content processing?
To me, the repository approach is not binary. I see a continuum of indexing ranging from data that is not important and does not need to be indexed to information that must be indexed in real-time. That is, store everything and index the relevant. Then create larger scale collections by categorizing or taking vertical slices through the repository. For example, this data came from e-mail in the customer service department or HR department. Other data came from e-commerce web sites or political campaigns. Depending upon the analytics, these collections may be indexed or not. Some collections, (insider threats) may be indexed in real-time. I believe there is a continuum spanning from non-important (no indexing) to categorization (important for a few weeks – index when you have time), to real-time (important now index now).
Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations? What's your firm’s approach to presenting “outputs” for end user reuse?
Visualization is a huge area for us. Our clients are dealing with massive amounts of data and, in many cases, they need clues as to where to search. Fortunately there has been a lot of work done in human genome mapping and sentiment analysis that we have been able to leverage. The genome tools were specifically designed for pattern identification and mapping of huge data sets. We have been able to leverage the technology to graphically present relationships and social constructs. Another visualization area involves quantification of the themes of conversations and the subordinate topics associated with the themes. Our upcoming release shows an analyst what the most discussed themes and topics are and the sentiment associated with those items. By integrating this visualization capability with our search engine technology, an analyst can determine what topics are “hot” right now and how those compare to what was hot an hour or day ago. The analyst can watch in real-time as conversation topics blossom and die out. You can imagine how powerful this capability would be for a company planning an IPO and wondering if information about the IPO was leaking out. Another company could use it for identifying influencers, trends, consumer habits, and other behaviors very relevant to the marketing of their new product.
What's your take on the “new” method which some people describe as “search-enabled applications”?
We have clients who have multiple 10 gigabits per second and above network connection points. They don’t want to have 100% of the information carried in those high-bandwidth pipes extracted and indexed by Recon; however, they do want everything relating to a particular network incident identified and processed by Recon immediately.
Recon includes a workflow tool that enables it to receive an alert from a Security Information and Event Manager (SIEM), or any other alerting tool for that matter, and initiate a process to automatically query the network packet store, extract and index the content, and notify the analyst of the incident requiring investigation.
Can you add some color for me?
Sure. Let me use the example I provided earlier involving credit card numbers being stolen from a company. A data loss prevention tool detects the theft and throws an event to the organization’s security information and event manager (SIEM). The SIEM sends an alert to Recon’s workflow tool, which subsequently makes a call to the network capture and storage devices for every captured network file that contains information related to the event for as far back in time as the storage contains. Before the analyst has completed measures to stop the flow of credit cards, Recon will have finished its processing of the event traffic and sent a notification to the analyst that the forensic repository is ready for investigation of the incident.
The same workflow will occur if an intrusion detection system detects a malware signature. Recon will call for and process all of the data related to the event in near-real-time, and have the forensic repository ready for the analyst immediately.
There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are “cut off” from access to more robust systems. How does your firm see the computing world over the next 12 to 18 months?
I agree with you that the endpoint devices will not carry petabytes of storage or petaflops of processing power in the near future. That said, iPads and mobile phones will continue to grow as important user endpoints. As I mentioned earlier, Recon is designed to operate near the data it works with. It works very well in a corporate or public cloud environment where storage and computing power are essentially unlimited and virtualized behind a robust security architecture. Recon contains an embedded web host, and its user interface is a simple web browser. Recon can perform the information extraction, indexing, query, and analytics in the cloud and the user can access the power of Recon through a PC, iPad or other mobile device for client rendering.
Put on your wizard hat. What are the three most significant trends that you see affecting your findability and discovery business?
Unlike some popular political statisticians, I am not sure I want to get into the “wizard hat” category. I am paying attention to three areas of activity.
First, I find it interesting that storage companies are making it cheaper to archive large volumes of data. Captured network traffic contains a record of every transaction the company made: every attack on the enterprise and everything that leaked out. In the past, organizations did not keep packet capture files because of the high cost of storage and lack of a way to extract information from the data. With the drop in the cost of storage and the ability of Recon to identify and index all of the information embedded in the captured network files, the value of keeping captured network packet data has increased dramatically.
Second, I find the activity related to semantic analysis engines significant. Some of these innovations may make a semantic approach one of the best technologies to aggregate or summarize unstructured data, and semantic methods continue to improve. We believe there is a continuum from data to information, and then with analytics, to intelligence. By integrating semantic analysis into Recon’s analytics suite, we have provided an additional dimension to the capabilities Recon provides to an analyst.
Finally, the Big Data bandwagon seems to have increasing momentum. An analyst—whether in law enforcement or financial services--must have tools to deal with the onslaught and to extract the latent value that lies within these treasure troves. The ability to provide structure (topics, themes, and categorization) to repository collections is key to getting all of the value you can out of the data you have. That is a key focus of Cybertap’s.
Where does a reader get more information about your firm?
I would suggest that our Web site be the first place to go. We respond to inquiries promptly. The URL is www.cybertapllc.com.
As traditional search vendors use marketing lingo to shift from a commodity business to a higher value sector, many organizations realize that next generation technology is more than a catchphrase. Cybertap is an interesting company because it combines search with a range of functions which allow a combination of alerting, discovering, and finding. Definitely worth a close look for organizations mindful of the need to move “beyond search.”
Stephen E. Arnold, December 4, 2012