An Interview with Mike Horowitz
In the last year, interest in information analysis has sky rocketed. For many years, sophisticated software systems have been at work in government agencies, research institutes, and computer science laboratories. Fetch Technologies, based in El Segundo, California, has been one of the leaders in next-generation content analysis. Unlike the companies focusing on narrow functions for eDiscovery or consumer-oriented sentiment analysis, Fetch has built out its platform, programming tools, and analytic methods.
Founded in 1999, Fetch Technologies enables organizations to extract, aggregate and use real-time information from Web sites. Fetch's artificial intelligence-based technology allows precise data extraction from any Web site, including the so-called Deep Web, and transforms that data into a uniform format that can be integrated into any analytics or business intelligence software.
The company's technology originated at the University of Southern California's Information Sciences Institute. Fetch’s founders developed the core artificial intelligence algorithms behind the Fetch Agent Platform while they were faculty members in Computer Science at USC. Fetch’s artificial intelligence solutions were further refined through years of research funded by the Defense Advanced Research Projects Agency (DARPA), the National Science Foundation (NSF), the U.S. Air Force, and other U.S. Government agencies.
Corporations, business intelligence companies and news organizations use Fetch to connect with millions of Web sites for a myriad of applications including competitive intelligence, news aggregation, data analysis and background screening.
After quite a bit of poking and prodding, I was able to track down Michael Horowitz, Fetch's chief product officer. We sat in the Blue Butterfly Coffee Co. in lovely downtown El Segundo. Mr. Horowitz, a former Google wizard, patiently and graciously answered my questions about the Fetch platform. Although well-known in military and law enforcement circles, Fetch Technologies has a more modest profile in the commercial sector. The full text of my one hour discussion on July 8, 2010, with Mr. Horowitz appears below:
Where did the idea for Fetch originate?
As professors at the University of Southern California’s Information Sciences Institute (ISI), Steve Minton and Craig Knoblock saw the tremendous growth of information on the Internet and realized that a scalable way to accurately extract data from Web pages was needed to allow organizations to harness the power of this public data.
The Fetch solution uses artificial intelligence and machine learning to intelligently navigate and extract specific data from user specified Web sites. Users create “Web agents” that accurately and precisely extract specific data from Web pages. Fetch agents are unique in that they can navigate through form fields on Web sites, allowing access to data in the Deep Web, which search engines generally miss.
What's the "Deep Web"?
Good question. That's the content that often requires the user to formulate a specific query. For example, when you visit a travel site, you have to provide the date you want to travel, point of departure, etc. "Deep Web" also embraces sites that may require a user to log in before accessing the information on the site. The idea is that a general purpose Web spider may not index the content on these types of Web sites.
Thank you. Now what about Fetch agents?
As I was saying, once the agents are created, they can be scheduled to retrieve data on an automated basis. Data from different Web sites can vary significantly in format, providing users with difficult challenges in using or analyzing it. Fetch solves this problem by normalizing the extracted data into a uniform format, defined by the user.
Am I correct in concluding that Fetch processes content in many different formats, file systems, and file types and "standardizing" them or "normalizing" them?
Yes. Some people call this process ETL or extract, transform, and load. Others talk about file conversion. The point is that differences are handled by Fetch. Our customers can work with data without having to fiddle manually, which is impractical with large data flows.
What's your company's product line up today?
Fetch currently offers Fetch Live Access as an enterprise software solution or as a fully hosted SaaS option. All of our clients have one thing in common, and that is their awareness of data opportunities on the Web. The Internet is a growing source of business-critical information, with data embedded in millions of different Web sites – product information and prices, people data, news, blogs, events, and more – being published each minute. Fetch technology allows organizations to access this dynamic data source by connecting directly to Web sites and extracting the precise data they need, turning Web sites into data sources.
How does Fetch Live Access work?
Our hosted solution, Fetch Live Access, connects organizations directly to Web site data without any software to install. Customers choose from a set of existing “agent” libraries that can connect directly to Web sites, or Fetch can custom build a library for them. Fetch extracts data directly from Web sites and delivers it in a standardized format.
Our enterprise software solution, the Fetch Live Access Platform, allows large organizations to bring Fetch technology in-house with an enterprise software solution. The software includes AgentBuilder for the creation of powerful Web agents without programming, and AgentRunner, which can run and manage agents using a sophisticated parallel processing architecture to access and extract data from multiple sources at high speed. It also includes XML Transformer to standardize data collected by Fetch agents into a customized format, and AgentMonitor, for robust agent monitoring.
Your platform lets a licensee program the Fetch system to deliver reports or mash ups that answer a specific information need?
Close – Fetch allows users to access the data they need for reports, mashups, competitive insight, whatever. The exponential growth of the Internet has produced a near-limitless set of raw and constantly changing data, on almost any subject, but the lack of consistent markup and data access has limited its availability and effectiveness. The rise of data APIs and the success of Google Maps has shown that there are is an insatiable appetite for the recombination and usage of this data, but we are only at the early stages of this trend.
Your system performs some of the work that human analysts and programmers once had to do manually. Correct?
Yes. Estimates show that 75% of Web sites do not actively syndicate data, and even syndicated data can be “messy”, limiting its ability to be seamlessly integrated in business intelligence or analytics packages. We know that demand for data far outweighs syndication, given the rampant use of Web crawlers, inefficient Web scraping software, manual scripts, and millions of hours of manual Web site data cut-and-pasting performed by people. Automated systems and methods are required for speed, efficiency, and cost control.
Will most data be syndicated and normalized in the future?
Many experts point to a future in which most data is syndicated in some way, with every Web site offering perfectly structured and neatly marked data up using uniform standards, and with real-time data distribution. While this is a compelling vision, Fetch makes this work in the world of today, providing a mechanism for users to define the data they want, mark it up in a semantic context that works for them, and normalize disparate data from different Web sites into a uniform, usable format.
These technologies will lead to a universe of unlimited possibility, in which users can dynamically answer almost any question through the intelligent monitoring, parsing and aggregation of data from the Web.
But the reality today is that data are disparate, messy, fast changing, and vitally important. Fetch solves these problems today.
Without divulging your firm's methods, will you characterize a typical use case for your firm's capabilities?
I can't mention a client's name. Is that okay?
Here's a typical use case. Consider a large retail operation. This is a large business operating in a market sector that has become increasingly competitive. Margins are decreasing. The company has a few analysts who spend about 50% of their time going to competitors’ Web sites. Not only is this time consuming and expensive. The approach is inefficient. With our system, the company can monitor pricing in real time on the competitors' Web sites. In addition, the system acquires and processes news about the market, key raw materials, and other data points essential to our client's business.
How does your technology handle a more subtle task? Background screening is important in government agencies, financial institutions, and government contractors' operations?
The Fetch platform can handle this type of work easily. Background checks typically depend on manual collection and analysis of information about a person. The problem with manual checks, however, is that these are expensive and time consuming. Most organizations have one or two people who are really good at this type of work, but the demand for more checks and more timely information continues to go up.
We have a number of clients who have some tough budget targets and an increasing workload. The Fetch system automates the criminal record acquisition process and our system handles more than 200 online court sites. In addition, Fetch can sweep other sources of content. With the data in the Fetch system, background checks are quicker, more efficient, and tireless.
The clients tell us that with the Fetch platform, the checks are more accurate as well. Humans still play an important role, but each analyst is able to perform key tasks without the distraction of the manual processes.
I first learned about your system when I gave a talk for a US government agency. I think your company had an exhibit at the conference. What are the applications of Fetch in the intelligence community?
This is a part of our work that is vital to national security. Let me give you a very broad example. Is that okay?
Let's visualize a large government organization that is concerned about terrorism. The bad guys often share information on public Web sites. Some organizations operate full-scale information services which contain a broad range of information. The idea is that somewhere in the vast flows of information are items of information that are really significant.
Manual inspection of these Web sites is difficult. There are issues with latency. There can be some language barriers. There is the problem of the amount of information that flows 24x7. On the surface, the problem seems too big to resolve.
Our system acquires, normalizes, processes, and presents information from public Web sites. In addition, Fetch "reads" message boards, social networking content, blogs, and other sources.
This frees up analysts’ time to process the data, not collect it.
What are the benefits a company or non-profit organization can derive from Fetch?
First, Fetch allows users to gain access to new sources of important, real-time data to run effectively their business or organization. The blind spots are eliminated with our system.
Second, Fetch significantly slashes the cost of that data collection, by replacing costly manual data collection or unreliable custom scripts.
And, finally, Fetch increases the reliability of their data collection efforts, since most technology-based approaches to Internet data collection cannot handle the changing nature of Web sites.
One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content AND the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume problem?
Yes, I agree. It is absolutely true that organizations face a challenge in managing the overwhelming amount of content, particularly with the digitization of all documentation and the incorporation of new Internet data. The Internet, in particular, brings a new set of challenges due to the dynamic nature of the content as well as the changing nature of the Web sites themselves.
What's Fetch do to cope with this seemingly intractable problem?
You are skeptical. I understand that. First, let me point out that our method focuses on targeted data, not Web pages.
Don't Bing and Google focus on Web content in this way?
No. We help organizations gather only the data they need, without extracting extraneous content, ads or HTML. This provides for a small data footprint, and ensures that the data has semantic relevance (i.e. “price”, “name”) as opposed to a blob of unstructured text.
Bing and Google are good at broad indexing. We focus like a laser. We deliver actionable information. That's the big difference.
Aren't most Web data malformed and in a many different formats?
Yes, and most people assume that a general Web query is comprehensive. Most of those looking for information via Bing or Google assume that the index is comprehensive and complete. Our approach is to focus and normalize data from multiple Web sites, making the data immediately actionable for downstream analysis, whether that is in a sophisticated business intelligence system, a Web analytics package, or a standard application like Microsoft Excel. The point is that Fetch creates a consistent, easily analyzed representation of information and data.
Most of the ArnoldIT Web sites are now dynamic. In fact, one of my sites displays data only when a user clicks.
Right. Your approach is very widespread.
The Internet is not a standard database; it is changing constantly. Fetch technology assumes these changes. Its use of machine learning helps it to adapt as Web sites change, allowing organizations to stay connected to data instead of experiencing downtime.
Another challenge, particularly in professional intelligence operations, is moving data from point A to point B; that is, information enters a system but it must be made available to an individual who needs that information or at least must know about the information. What does your firm offer licensees to address this issue?
When organizations consume data using Fetch services or software, they receive normalized, tagged data that works with their existing data infrastructure. Whether it is business intelligence software or specialized analytics software, it provides easy integration and the same data portability offered by those systems.
We have designed Fetch to "play well with others"; that is, Fetch complements existing enterprise systems, including work flow tools or visualization systems.
There has been a surge in interest in putting "everything" in a repository and then manipulating the indexes to the information in the repository. On the surface, this seems to be gaining traction because network resident information can "disappear" or become unavailable. What's your view of the repository versus non repository approach to content processing?
Generally, our customers appear to take a hybrid model for content processing. Given the chain of events in acquiring Internet data (Web site availability, network issues, normalization, etc.), the majority of our clients choose to store data locally, which speeds downstream analysis and processing.
However, since these data are constantly changing, they maintain active “connections” to the content source, both to ensure that they have the most up-to-date content as well as for auditing reasons.
Is this a real time operation?
Yes, but licensees can configure the system to work on a schedule they determine. Fetch supports real-time content acquisition and processing or a "heartbeat" set up with processes occurring on a time mark or some event function.
Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations? What's your firm's approach to presenting "outputs"?
Our expertise is in the reliable and cost-effective access to the universe of Internet-based data, so we focus on data retrieval, aggregation, normalization, and basic analytics, and leave more advanced visualization to others. Our goal is to provide well-structured data sets that can be used by any of the myriad of sophisticated analytics and data mining solutions. I know that you have been writing in your blog about some of these systems. As I said, we mesh and integrate seamlessly.
I am on the fence about the merging of retrieval within other applications. What's your take on the "new" method which some people describe as "search enabled applications"?
I think this depends on what tasks these applications take on. Clearly search is an important component to any content-based application. The good news is that well designed, extensible applications can take advantage of APIs to incorporate data retrieval “into” their overall application. We’re seeing more of this given Google/Yahoo/Bing’s focus on data access. Fetch complements these “broad content search” systems nicely by focusing on directed data retrieval.
How does Fetch see the computing world over the next 12 to 18 months?
We think we understand the cloud.
It has been well documented that mobile devices will soon surpass more traditional computing systems in number. I think cloud computing will lead to an explosion of new consumer and business applications.
And with the increased availability of robust cloud-based systems to handle storage or computational-intensive tasks, users will get the best of both worlds – portability and power.
Fetch has already begun moving our operations into the cloud (in our case, via a private cloud), which will make us ambivalent whether the requesting device is a traditional Web server or an iPad.
If you look out across the regulatory landscape, do you see more or less government intervention in information processing? My hunch is that most people are unaware of how disparate items can be "assembled" into useful patterns.
This is a terrific question. I think we will see a few patterns.
First, we will continue to see acceleration in availability of content online. This is for several practical reasons, including the need for organizations to provide better access to data to significantly lower their data distribution costs. The U.S. Federal Government is trying to provide universal access to their data via the Web, and the growth in social networks has created huge new sources of data.
Second, while individuals will become more comfortable with having more content about them online, they will learn to be more guarded about exposing some information in a completely public forum. Some individuals are already feeling the backlash of having embarrassing photos of themselves available while they interview for jobs. Some specific use cases will continue to be regulated, such as the use of some public data for employment, but for most use cases the genie is already out of the bottle.
My view is that in this "brave new world" of broadly public information, a premium will be placed on monitoring, aggregating and managing this content proactively. Progressive companies are already beginning to do this by actively monitoring Twitter streams; individuals will shortly do the same.
In this context, Fetch is an enabler to help collect data on the Internet.
Put on your wizard hat. What are the three most significant technologies that you see affecting your business? How will your company respond?
That’s a tough question. Let me identify three technologies I think are important. Is it okay if I change my mind in three months?
Absolutely. I change my mind every couple of days.
First, I think new web standards and protocols such as HTML 5 are a big deal. The Internet is always evolving and new standards such as HTML 5 and PubHubSubhub will change the way that information is structured and disseminated. HTML 5 should enable more interactive and interesting Web sites as well as more powerful and efficient applications. Given that Fetch helps connect organizations with Internet-based data, we will be incorporating methods to take advantage of the benefits that HTML 5 will offer as Web sites begin incorporating it.
Second, the era of semantics is here. I would include machine-based translation and related "understanding" enablers. The web is highly unstructured, making it extremely difficult to understand the vast amounts of data being published. Semantic technologies, including fact and topic extraction from unstructured text, will help to drive information comprehension (identifying topics, data types, etc.). In addition, the rise of machine translation will help to bridge the language gap, and truly allow global data flow, abstracting location and language, and increasing the richness of available data. As these technologies increase the amount of structured data, Fetch will help connect it to other data sets, and to users that can benefit from this global data set.
And, my third technology, is what I call record linkage and entity resolution.
What's "record linkage and entity resolution mean?" Entity extraction? Hyperlinks?
What I mean is the ability to connect semantically identical (but possibly not cosmetically identical) data is becoming more important as the corpus of web data grows. Robust record linkage and entity resolution technologies are critical elements to create these connections, helping to organize and clarify the vast amount of web data around people, places, products, etc. Fetch Labs, the research arm of Fetch Technologies, has been working on the entity resolution problem for many years, understanding the importance of this task and the benefits of being able to resolve disparate data into a canonical set of entities.
Michael, thanks for your time. I see your mobile device jittering. Let's wrap up. One final question. Where can I get more information about Fetch Technologies.
Anyone interested in learning more about Fetch should go to http://www.fetch.com
Fetch Technologies is one of the leaders in web content acquisition and normalization. The firm's clients range from government agencies to financial institutions to commercial enterprises. The cost of normalizing data and information from Web sites is not well understood. Transformation can chew up as much as one third of an information technology unit's budget. A larger problem is keeping pace with the information that flows though social sites, Web logs, and other online sources. The cost of missing key information is high. In fact, an information error can have a significant impact on an organization's ability to complete a mission or make a profit. Fetch Technologies is a leader in web content processing. Worth a close look in my opinion.
Stephen E. Arnold, July 14, 2010