ZyLAB

An Interview with Dr. Johannes Scholtes

Johannes Scholtes of ZyLAB .

In 2002, I was investigating optical character recognition and document scanning systems. I came across ZyLAB and ZyINDEX. The company's technology was interesting but smaller than the better known players at that time. Then in 2006 the president and CEO of ZyLAB contacted me, and I enjoyed an interesting conversation and compelling demonstration of the product. We connected after I had completed work on the third edition of Enterprise Search Report.

I knew that Dr. Johannes C. Scholtes took over leadership of ZyLAB in 2002, the company had undergone rapid growth with double-digit expansion and healthy profits. He had been an officer in the intelligence department of the Royal Dutch Navy. He has an M.S. degree in Computer Science from Delft University of Technology and a Ph.D. in Computational Linguistics from the University of Amsterdam.

When I began work on my Beyond Search study, I spoke with him again. ZyLAB is included in my new Beyond Search study, and I have included additional information about the company in particular and enterprise search in general in the interview below. Our original conversation took place in the Grand Amsterdam Hotel. We concluded by exchanging information via email.

The full text of the interview appears below:

Give me the elevator pitch about ZyLAB. What's the basic positioning of your company?

We like to point out that ZyLAB is a combination of cutting-edge search and text mining technology for paper, email and electronic files. We support content management technology such as eDiscovery and eDisclosure management, redaction, workflow, federation and compliant records management. We do search, of course, but we make search part of other information solutions. We now have about 8,000 installations worldwide.

Autonomy, Endeca, and Fast Search & Transfer, among others, assert that their approach is a platform, not a single point solution. What makes ZyLAB different?

This is my favorite subject. Just cut me off if I give you too much detail.

First, you do not need to select a database in advance, all data is stored in a straight forward file system as open XML and various native file formats. You can use a database, but it is not mandatory. This architecture makes our system very scaleable.

Do you mean a collection like selecting Google News or Web search?

Exactly. We have been doing this since our company was founded in 1983. Users don't know what's in a collection, so it's easier to make one search that goes across all of the content. If you want to create a collection in a separate part of the file system, that's okay too.

Another point is that we can search electronic documents in 400 electronic file formats, almost all variations of email, multimedia in more than 200 languages. And we can process paper documents and make those searchable as well.

Isn't the need to convert paper to digital form becoming less relevant in today's world?

No, I don't agree. Most organizations, government agencies, and law firms still have to deal with hard copy documents. We also see that many of our customers who have scanned documents into PDF, prefer to convert these documents in our XML-TIFF formats to make searching and viewing of documents faster, especially for large documents. In addition to this, we have invested significantly over the last six years to create an advanced, integrated solution for document and record management. To this capability, we provide search, discovery, text-mining, workflow and collaboration tools from capturing to disclosure.

Are you straddling the hard copy and digital world?

Yes, and we understand the problems of hard copy. We have also pushed hard to provide the advanced content processing that is now a requirement for search in most organizations.

Our approach has been to say to our customer, "Here's our list of components. Just select the ones you need. You pay only for these, so we don't ask our customers to pay huge fees for functions that will never be used."

Our modular approach is now mature, and I see more vendors in Europe and the US emulating what we've been doing for a long time. Our customers tell us our "couple-of-day" deployments are very unusual. For us, fast deployment is business as usual for us. These three and six month installation efforts are problems for many organizations, and these become great sales leads for us.

Do you offer special versions of ZyLAB for different business sectors?

Yes, we started doing this a number of years ago. We have a vertical market focus. Unlike some of the companies in the search and retrieval business, we deliver professional services and support from our staff. We think we are a leading niche player. But we are finding that our niches are exploding, especially in the e-discovery and compliance space.

I know that you have a content acquisition component that handles scanning. How does your system integrate the scanned content with the digital content that your system also processes?

We store everything in an open format based upon XML + TIFF, which we describe on our Web site. We can full text index all data. We can OCR in 200 languages. We automatically recognize the language from the bitmap. You can store all information types in one collection. There is no difference in ZyIMAGE between paper, email, electronic files or multimedia. To the user it is all transparent.

Because search is the pivot point for our applications, ZyLAB is almost the exact opposite of the IBM FileNet or Brainware approach. Search is first for us, then work flow, document management, records management, etc. comes after that.

I know you support key word search. What other access methods can a ZyLAB user tap into?

This is a great question for me. In the basic search, a user can see the number of hits for a query, hit-density ranking, file date and time for creation, modification, and access. There are many other features in basic mode.

For advanced search, you can rank on automatically extracted entities, including names, companies, countries, measurements, dates, monetary amounts, and named-phrases.

You can rank by semantic relevance using an automatically derived taxonomy or your own taxonomy. Results can be personalized. You can organize result lists in a variety of ways. You can run a query on a linguistic pattern like "a person got a job" and then rank results in these patterns higher than hits in the full text.

Through all this additional meta information, we can support clustering, full text similarity inside documents where precision and recall can be set.

One useful feature is that the advanced relevance ranking algorithms can be tweaked in order to guide relevancy for specific projects. This is quite different from the "black box" algorithms that are locked up and not tunable. You can set many different parameters for key words in context, fuzzy search, entity extraction, and so on.

With our search technology, we try to find more than our competitors do, especially someone tries to hide what they are doing. In other words, we are looking for very high recall. In addition, we have various tools to help our users to also keep the search-precision under control.

What's the scalability of the ZyLAB system?

We think it's very good.

You can use a 32-bit or 64-bit engine. One 32-bit search index can contain up to four billion unique key words or up to 2 million files. In general, this amount is the equivalent of 20 gigabytes of pure text per index or repository. Translated to email or electronic data of a hard disk, this is around 100 Gb per search index. Of course, you can have unlimited parallel repositories and search them all at once.

Theoretically, the 64-bit engine can contain 16 quintillion unique words and 40 trillion unique files or records per index. ZyIMAGE has not yet reached a collection size limit, although ZyIMAGE has easily managed several multi-terabyte collections that could very well grow into pica- bytes. This engine uses an internal 16-bit UNICODE to represent international character sets, and users can have unlimited parallel repositories. We think that the capabilities of our 64-bit engine are virtually unlimited.

In fact, in the last months, we picked up a lot of new business from customers that were using traditional database technology to store large paper or email collections and these systems could no longer scale to today’s volumes unless astronomical investments in hardware and database licenses were made. We have been able to make such collections fully searchable with standard hardware.

Do you have customers with large collections under your index?

Yes, but I am very conservative when it comes to naming our licensees. Many of our customers require that we sign NDA’s these days. I can tell you this, however.

Hundreds of ZyLAB customers now have collections holding more than 20 million pages, all of which can still be searched in under a second. ZyLAB also has several demonstrations we show prospects that hold millions of pages. On a million page document collection with on average 1000 results per query, the typical hardware needed for a 1/2 second query is a single Pentium 1.0 GHz with 512 Gb RAM.

The search speed consists of two parts: First, there is searching the index and collecting the results. Searching the index almost takes less than 0.5 second, unless the search is really broad (such as an open-ended wild card search). Then, depending on the number of results collected, 1000 results can be collected and sorted in 0.5 second. So almost all of our users get results in less than one second.

A growing number of ZyIMAGE collections host more than 200 million pages (tens of terabytes) of data, a figure that is expanding very quickly. In general, e-mail archives are becoming larger than paper archives. One of ZyLAB's customers, in fact, now has an archive with around 100's of millions of e-mail messages, which comes out to be several terabytes. And, search performance is still relatively very fast when discovery searches are done.

The size of collections is growing. For example, the best way to index E-mail from PST, Exchange connectors, Lotus Notes, GroupWise and other e-mail formats is to convert them into XML and native file formats with our conversion tools.

Does your system support federated search?

That too is a good question. The answer is, "Yes".

Our XML-based architecture makes it very easy to integrate with other systems such as SAP, Oracle, PeopleSoft, Microsoft (SQL, Navision and SharePoint), and IBM. We have several API’s and sample integrations to support this.

Often, ZyIMAGE is used to manage the unstructured data component (paper, e-mail, electronic files, and multimedia) for financial, ERP, human resources or other back-office applications. Typical integration projects with these large enterprise systems takes about one or two weeks.

Not long ago we started to extend our ZyIMAGE Federator with connectors supporting the new ATOM standard to other unstructured search engines such as Factiva, ,Google, MSN, Yahoo and Microsoft Search 2008. This allows our users to leverage other search engines and subscription content. Data can be de-duplicated and enriched with entity and fact extraction before external data is entered into a central internal repository.

Despite years of efforts by ourselves and others, most organizations' data are still stored in silos. With the ZyIMAGE Federator technology you can now have access to the information in these silos so users can get cross se silo's and combine the results in a single search. We are a great believer of federation technology.

Oh, one other important new component of the federated search module is ZyLAB's ability to monitor the content of an external Web site and alert users on changes.

How have you extended the functionality of the core ZyLAB search in the last year?

Yes, with extended text analytics and multi-media search.

Can you describe a representative customer's use of ZyLAB for eDiscovery or patent analysis?

If you search the US PTO Patent database with Google, you will actually find a patent from someone using ZyIMAGE to search and visualize patents with ZyIMAGE. Kind of interesting that you can patent that.

We have several commercial legal customers that use ZyIMAGE for patent or other legal searches. One of them is Sara Lee, these customers put trade name, patent and competitive information on their behind-the-firewall system as one ZyIMAGE enterprise system (tens of millions of documents) and they search and share this information with their employees worldwide.

In the context of eDiscovery, we have many corporate customers using our technology. Unfortunately, I cannot name any of these given the sensitivity of their projects. But all of them use our software on large email and electronic file collections and have selected ZyLAB because we showed them that we were able to find more relevant documents, especially in very large and complex multi-lingual collections that consisted of various different formats.

When you look at the consolidation in search--Microsoft buying Fast, for example--is this a healthy trend?

For ZyLAB this consolidation is actually great news: one specialist leader less. Large companies such as Microsoft always go after the billion dollar markets and that leaves enough room in certain large enough search niches for companies such as ZyLAB.

In the last few months I've noticed quite a few tie ups; for example, Groxis with Intellisearch, Content Analyst with dtSearch, and some others? What are the implication of this?

At the end of the day, many software companies will use open source products such as Lucene or other open source systems. I think that only for very specialist and high quality search that is part of an integrated solution, people will pay.

From data I've seen, mobile search is the next big thing? If that's true, what's the impact of mobile search on ZyLAB?

Our users always want to search everything: this is why we already searched paper in 1990, email in 1995, instant messaging in 2000 and multimedia now and we do not care if the data is local or remote. Format, origin, language and source are completely transparent to our users.

ArnoldIT Comment

ZyLAB offers a comprehensive content system. Search, although a key component, is one piece of the information access puzzle. A former intelligence officer, Dr. Scholtes has done a very good job of making content findable. The ZyLAB approach has won the company a strong following in the legal, business intelligence, and financial services markets. The company's support for records management gives it a leg up on competitors struggling to make headway in certain regulated industries. Give ZyLAB's system a test drive.

Stephen E. Arnold, May 5, 2008

Search AIT

ZyLAB

An Interview with Dr. Johannes Scholtes