« Something fun happened today... | Main | Searching the future for Google... »

April 08, 2004

Retrospective vs Prospective Search (Past vs Future)

We've been finding that a significant percentage of the people with whom we discuss PubSub.com seem to most easily understand the idea of "Publish and Subscribe" when it is described as a form of "search". Thus, we're finding ourselves using the "Searching the Future" line quite a bit... This is a bit surprising since in the technical community, publish/subscribe has traditionally been seen primarily as a messaging technology. It has become clear to us that the pervasiveness of search tools such as Google is impacting not only the way we use the Internet, but also the way that many people think about it.

Since publish/subscribe systems are, in fact, very similar to search systems, I wouldn't object to the comparison except that as it turns out, there are quite a number of fairly important differences between what we do with the kind of publish/subscribe systems we build at PubSub.com and what is done by the search engines of folk like Google, AltaVista, Yahoo! and soon Microsoft. This note will attempt to begin to describe some of the differences between "traditional" search and the kind of "search" that is done by a PubSub system.

First, let me coin some terms so that it is easier to talk about these two styles of search:

  • Retrospective Search: That which is done by traditional search engines -- "Searching the past". This kind of search typically relies on net crawlers, spiders or other data entry methods to gather files into a historical, searchable collection. While the collection changes over time, at the time of any specific "search", the collection can be considered to be static. Any single search query will only be evaluated against those documents that were collected prior to the moment the query was submitted to the search engine. On the other hand, the queries in such a system are constantly changing and are thus dynamic.
  • Prospective Search: What PubSub does -- "Searching the future". This kind of search can depend on a wide variety of means to gather newly updated documents but does not rely on a static, historical collection. In a prospective search, a query is registered in a collection of queries and then evaluated against every new document as it is discovered. Thus, while the query is static, the collection of documents against which it is evaluated is dynamic. Also, the entire collection of currently active queries can be considered "static" at the moment that any new document arrives in the system.

So, a retrospective system is characterized by a static document collection and dynamic, single-use queries while a prospective search system will have, at any moment, a static set of persistent queries and a dynamic document under evaluation. Retrospective systems typically need to store large numbers of documents but have no need to store queries. Prospective systems need to store queries but can discard documents as soon as the collection of persistent queries has been evaluated against them.

These two styles of "search" address both different and complimentary needs. They are often both required as part of a comprehensive program for discovering and keeping knowledge up-to-date. For instance, it is often useful to first do a retrospective search to discover "what is known" about an area of interest and then to establish a persistent, prospective search in order to be notified "whenever something new appears." (You might, for instance, search the archives of the New York Times for stories about "copyright law" and then create a "news alert" to be notified whenever they write a new story on the subject.)

While the high-level differences between retrospective and prospective search systems are fairly obvious, there are also many sometimes subtle differences. I'll briefly identify some of the differences here and then expand on them in this and later posts.

  • Relevance: Retrospective search engines use relevance metrics primarily for ordering of results. Prospective systems use relevance metrics primarily for filtering.
  • Search Query Precision: Retrospective systems often support functions like stemming, fuzzy and phonetic matches that tend to broaden the number of results returned. Prospective systems tend to focus more on functions which limit the number of results that are returned.
  • Event Order: Prospective systems are often used in environments where information's value is time sensitive and where there are often important temporal relationships between the creation of information objects. Thus, prospective systems will often implement complex languages for specifying significant temporal relationships between items (i.e. search for press releases which follow an S-1 SEC filing.") This kind of "pattern matching" is less frequently deployed in retrospective systems.
  • Timeliness: Prospective systems are typically focused on minimizing the latency between the receipt of an information object and its processing in the system. However, retrospective systems often choose to optimize other qualities which can compromise the speed of insertion. For instance, most Internet search engines can take up to 4 to 8 weeks to index the average webpage but are not perceived as "slow" since the vast majority of their content is much older.

I'll expand on these points in later postings. But for now, let me say a bit more about the differences in the handling of Relevance by these two kinds of system.

More about Relevance:

As the quantity of information that can be searched has increased, the ability to deliver "Relevant" results and to define new and effective measures of "relevance" has become a significant search engine differentiator. For instance, there are many that claim that the key to Google's success in the retrospective seach business has been that their innovative "PageRank" metric allows them to deliver more relevant information than the search engines of their competitors. Of course, relevance is also important in prospective search systems. But, relevance is typically used very differently in retrospective systems than in prospective systems.

In retrospective systems, relevance is commonly used as a tool for ordering search results. For example, even though Google might tell you that there are 31,500 historical pages that match the query "pubsub", you typically won't scan all 31,500 pages since PageRank will usually order the most relevant references into the first few pages of the result set. On the other hand, in prospective systems, where new results are matched and delivered as they are discovered, you really can't "order" the results. If results are returned, they are returned in the order and at the time that they are discovered. Thus, relevance in a prospective search system is not used to determine the order in which results are returned. Rather, relevance is used to determine whether or not a result is returned at all! Relevance in a prospective system is a filtering function, not an ordering function. In such a system, items that fall below a certain relevance metric are simply not delivered. This difference in the use of relevance measures naturally leads to differences in the way that relevance metrics are defined within the two kinds of system.

Of course, search "professionals" will object to the simple distinction that I've made above between the handling of relevance in the two kinds of search systems. In "professional" retrospective search engines, relevance has traditionally been used for *both* ordering and filtering ever since the earliest computerized retrospective search systems were deployed in the 1960's. This is just one of the many ways in which the broad-market Internet based search systems tend to differ from the state-of-the-art. It should also be noted that in some prospective systems, matches are not made or reported at precisely the same time that new documents are discovered. In many prospective search systems, results are either gathered by regularly polling a retrospective system for "new" entries that match (i.e. entries since the last poll) or results are accumulated and delivered in batches either according to a schedule or whenever a certain number of results have been accumulated. In such systems, it does become possible to do some limited ordering of the results within a single result batch.

I'll say more about this in future postings. I hope these quick comments have been useful.

April 8, 2004 in PubSub.com | Permalink

Comments