Deep Web Technologies
An Interview with Abe Lederman
I am sitting in what may be the world's spiciest Mexican restaurant, Tia Sophia’s in Santa Fe, New Mexico. Abe Lederman suggested this restaurant. I am gulping iced tea in a losing battle to put out the fire.
His company – Deep Web Technologies – was founded in Los Alamos, best known for developing the atomic bomb during World War II. Deep Web developed and operates the search capabilities of Science.gov, a site that searches in real-time US government servers with unclassified scientific and technical information. His firm serves clients around the world with "deep web" indexing. Mr. Lederman, one of the wizards who founded Verity, started Deep Web Technologies after working on similar technologies for LANL as Los Alamos is called by locals in this part of New Mexico. The scenery is lovely, but after one leaves the mountains, some of the landscape looks hostile and barren.
The full text of my interview with Mr. Lederman appears below.
Do you eat this nuclear food every day?
No, I like to bring folks from Kentucky here just to observe their reactions. I like less fiery food myself.
What's your background?
My college career started at MIT 30 years ago where I earned BS and MS Degrees in Computer Science.
From my days at MIT I am most proud of putting together an ad-hoc group of classmates from my dorm that went on to win the ACM National Student Programming competition in 1978. Following that win, Steve Kirsch, reporter for The Tech, MIT’s school newspaper interviewed our team. You may recall that Steve Kirsch became an extremely successful serial entrepreneur, including becoming the founder of InfoSeek in 1993.
In 1987, I became the first person hired by Advanced Decision Systems (ADS), a Silicon Valley consulting firm focused on Artificial Intelligence research for the Department of Defense. ADS had developed a sophisticated concept-based retrieval system, named RUBRIC, written in LISP. I was hired to help re-write and productize this text retrieval system. This effort led to the formation of Verity in 1988.
Have you been involved with Web indexing at some of the past or present Web search giants?
I never worked for one of the big dinosaur web companies. In 1993, following the failed acquisition attempt of Verity by Frame Technologies, I left Verity and moved to Los Alamos, New Mexico to consult at the National Laboratory (LANL) where 50 years earlier the atomic bomb had been developed. At the time LANL was one of Verity’s largest customers. Within a few months of my arrival at LANL I had developed one of the very first Web-based applications, The Electronic Café. It searched regulations for running a DOE lab (there are more than 20,000 such pages), it searched a proposal preparation handbook, and it searched the Laboratory Newsletter. My Explorer search engine was built on top of NCSA’s httpd server and buggy Mosaic browser. If I had only known then where the Web was heading I might have been extremely wealthy now.
I did dabble with the Dot Com craze. Verity, which went public in 1995 or 1996, soon after Netscape, never quite got the valuations of some Dot Com Web companies. Perhaps, by 1997 or so, Verity had become a bit of a dinosaur. The senior management team was slow in jumping on the Web band-wagon.
At that time, Verity's technology was showing its age and was not able to scale to crawl, index and search the entire rapidly expanding Internet. Having an existing customer base also made it difficult for Verity to adapt.
Where did you get the idea for Deep Web Tech? Was it at LANL? Was it a bright idea that took place when aliens landed in Roswell?
No aliens. Just me.
Sometime in 1998, I was having a discussion with my brother Sol in one of our "wouldn't it be neat" conversations. We wanted a search system that would in real-time search multiple book selling Web Sites and find the cheapest online site to buy a book. As this idea was bouncing around my head Amazon became the place for buying books and I dropped plans to implement a book finding site.
A few months later one of my customers, OSTI, the Department of Energy Office of Scientific and Technical Information asked me whether I could implement a capability to provide environmental researchers (that is, scientists involved in research on how to best cleanup the environmental messes created by DOE) one-stop access to multiple research resources.
In April 1999 we launched ESN, the Environmental Science Network, and called the technology Distributed Explorit. ESN is still in operation today eight years since it was first developed.
When Gary Price and Chris Sherman started talking about the "invisible Web" I thought the idea was another way to say "dynamic Web site." What's your view of the "deep Web"?
Deep Web Technologies is a company focused on providing products and services that perform federated search. We allow users to search multiple information sources in parallel. Results are retrieved, aggregated, ranked and deduped. This doesn’t seem too difficult, but trust me it’s much harder than one might think.
Deep Web started out building federated search solutions for the Federal government. We run some highly visible public sites such as Science.gov, WorldWideScience.org and Scitopia.org. We have expanded our market in the last few years and sell to corporate libraries as well as academic libraries.
In 2001, Michael Bergman, wrote a much referenced white paper entitled – The Deep Web: Surfacing Hidden Value, in which he claimed that the “deep web”, the part of the web where information resides in databases inaccessible to Google and other “surface web” search engines is 400-550 times larger than the “surface web”. I have looked a number of times over the years, particularly when preparing to give a talk on the “deep web”, for a more current estimate on the size of the “deep web” I have not found a good current estimate. Let’s just say that at least 90% of the information on the Web lives in databases that you just won’t see on Google.
Then in 2001, Gary Price and Chris Sherman, wrote The Invisible Web: Uncovering Information Sources Search Engines Can’t See, in which they claimed that the “invisible web” was 86 percent of the web. Gary Price and Chris Sherman coined the term “invisible web” earlier than Michael Bergman coined the term “deep web”. Personally I prefer the term “deep web” as the information inside of databases is not really invisible.
That sounds a great deal like what Google says it will be doing with its Google Forms initiative. Is it? What's the difference between you and Google Forms?
What Deep Web does is nothing like what Google says its doing with their Google Forms effort. Deep Web goes out and in real-time sends out search requests to information sources. Each such request is equivalent to a user going to the search form of an information source and filling the form out.
Google is attempting to do something different. Using automated tools Google is filling out forms that when executed will retrieve search results which can then be downloaded and indexed by Google. This effort has a number of flaws, including automated tools that fill out forms with search terms and retrieve results will only work on a small subset of forms. Google will not be able to download every document in a database as it is only going to be issuing random or semi-random queries. A blog post by Sol in the Federated Search blog expands on the flaws and limitations of what Google is trying to do.
What are you doing different from what USA.gov does? Why should a user looking for US Government information care about structured versus unstructured data?
Science.gov is a good example of what we do. Launched in 2002, Science.gov is a collaborative effort of 12 Federal government agencies that have gotten together through the Science.gov Alliance to create one-stop access to most of the output of the $130B or more a year that the Federal government spends on R&D.
USA.gov is different than Science.gov in that most of the content available through USA.gov is content that is crawled and indexed by MSN. Results from a federated search of a few “deep web” databases are included in the results displayed by USA.gov. USA.gov provides broad Federal and State government information aimed at the broad citizen market. Science.gov, which is referred to as USA.gov for Science, is focused on accessing high-quality scientific information that is not available on Google or USA.gov.
You are absolutely right. Users don’t care whether the information that they are looking for comes from structured or unstructured information. Neither does our federated search engine. We can search for content in structured Oracle databases and in unstructured Autonomy databases and merge the results of the two.
Okay, but isn't most of the Web audio and video?
Yes, there is a lot of audio and video on the Internet. Our Federated Search engine is flexible enough that it can aggregate the results from searching multiple video sites or multiple audio sites. I wish I had created a site that aggregated all the major video sites right before You Tube became all the rage.
In discussions with some of our customers we have been talking about using tabs (similarly to the approach we’ve taken in Scitopia.org) to organize results by type. We might put all journal articles into one tab, all news articles into another tab, all video lectures into a third tab and all podcasts into a fourth tab. I track stealth search companies and slow to market VC backed firms. Frankly, I think most of these outfits are taking one piece of the info problem and mesmerizing investors.
What's your take on the more than 300 companies in the search and semantic text analysis space?
Ever since I have been involved in search, going on 20 years now, I have found it extremely difficult to look at search products and compare them. It was hard to do when I was at Verity and it is something that I struggle with here at Deep Web. How do I get potential customers to differentiate our product from that of our competitors?
I feel sorry for VCs and believe that almost all of the 300 companies that you mention will fail. Of course there might be a grand-slam there somewhere which comes up with a better approach to search that might one day displace Google, or more likely get bought by Google.
When you hit a Web site to pull content to process, how do you deal with the latency issue?
Yes, lots of government Web sites as well as many commercial Web sites are under-powered as well as suffer from having really poor, slow search engines.
The issue of latency or how quickly search results are returned by the information sources searched is certainly of major concern to federated search vendors such as Deep Web as our users would like sites such as Science.gov to be as fast as Google which is not possible.
For a number of years Deep Web has taken a fairly unique approach to solving this problem. Our Explorit Research Accelerator product implements an incremental display of search results capability. Within 3-4 seconds of when a user hits the Search button we display whatever results have already been returned in ranked order and allow the user to look at these results. In the background Explorit continues to process additional results as they become available and once we get the OK from the user we merge these new results.
Under resourced websites is the biggest limitation to scaling of federated search to serving a large number of users. We are currently researching a number of approaches to reduce the requirements on information sources from caching search results to being smart and only search the information sources that are most likely to return good results for a given query.
What are you doing to make it easier for the user to find information?
Although not terribly sexy, we’ve put a lot of effort into developing sophisticated relevance ranking algorithms that work well in ranking results where little information is available (e.g. just a title and a snippet). We also bring back many more results from each information source than other federated search products which combined with our relevance ranking makes it easier for our users to find just the information that they are looking for.
We are also excited to be introducing “smart clustering”, where search results within a cluster are displayed in rank order in the next few weeks. Clustering will provide our users with a complementary way of finding the most useful information in a large set of results.
We are, of course, looking at a number of visualization technologies that are out there such as that by Groxis and expect that by the end of this year we’ll integrate one of more visualizations technologies as optional capabilities of the Explorit product.
When you look forward, do you see mobile search as a big deal? If not, what's the next big thing in search?
Yes, I think that mobile search is a big deal. There are many more cell phones out there than PCs connected to the Internet, their owners are always attached to them and cell phones are becoming much more powerful.
I believe that the killer mobile search application is one that can easily deliver accurate local information to the user. For example it’s a Sunday afternoon and I just arrived at my hotel in the D.C. area. I would like my cell phone to provide me with the name and coordinates of the closest one or two barbershops to my hotel that are open now.
About six years ago we were going to provide federated search to a company that was going to setup a reference center in India. We talked about creating a service where for example a business person, before getting on a plane for an important business meeting would submit a research question to this service, e.g. how many lawn-mowers, costing over $100 were sold in Arizona in the last 12 months. ChaCha a startup company has switched focus and is now answering questions via mobile phones.
Over the next 12 to 18 months, what are the major trends that you see building like waves at sea, preparing to rush to the shore?
I can’t speak across all areas of search, but although many might say that social networking has peaked I believe that we will see some really interesting developments in the next year plus at the convergence of search and social networks applied to vertical search and communities with special interests.
I know you can't reveal any secrets, but can you give me a hint about what's next for Deep Web Tech?
We are very interested in something that I call Collaborative Discovery, which has elements of social networking and individuals whether they know each other or not working together, implicitly or explicitly to find answers to their questions.
We are also very interested in taking all the federated search capabilities that we’ve developed, particularly the connectors that we’ve developed, and build the world’s most comprehensive site for scientific and technical information.
Deep Web Technologies is one of those companies that has a solid product and a strong customer base. With roots in the early days of search, Abe Lederman has continued to innovate. Take a close look at the Deep Web services. This will be time well spent. If you visit the company, pick the restaurant.
Stephen E. Arnold
June 10, 2008