Search Engine Components

A Search Engine consists of three core components; the Crawl, the Index and the Query Processor (runtime system). Each component has significant challenges in scale and function.

The Index is an internal system for each search engine provider and is generally bespoke technology that is not subject to wider discussion.

The Query Processor has been the largest area of interest for public research and discussion as this is the consumer facing component of a web search. The search form – which is typically an edit box to accept keywords and a search button – provides consumer access to the search index, and the search results are presented back to the user. Many large search engines that crawl and index the web, such as Google, Microsoft Live Search and Yahoo!, provide third party access to their search indexes via APIs. This has yielded a proliferation of so-called Meta Search Engines which implement the Query Processor but use the Index from other vendors. Meta search engines such as Dogpile, Excite and HotBot aggregate search results from multiple indexes, and present them in a single display.

Crawler Research

The Crawl is generally lacking in publically accessible published research. The large search engines are clearly working on advancing the methods of crawling the web, and some research is published [ref], but the volume is minimal and does not reflect the size of the problem. Sitemaps was the last major development in web site collaboration for making the crawling of the web more efficient, in June 2005 [ref].

There are three problems with the Crawl; Scalability, Frequency and Accessibility. Crawling the web is an enormous task that requires a lot of network bandwidth and processing power. As the number of sites grows, it is increasingly more difficult to crawl the sites regularly enough to access content that is current, and the deep web demonstrates the problem HTTP-client crawlers being unable to penetrate content contained within the Deep Web.

Subscription Service

Search Engine Companies and others that current crawl the world-wide-web can access the change information of websites directly through the Site Update Subscription Service (SUSS). The service provides on-demand delivery of change information relating to one or more specific sites, or for all sites, within a given time period. The information is returned to an authenticated subscriber in an XML document. Typical subscribers to the SUSS are those that currently crawl internet web sites for content or benefit in knowing that the content has changed. The largest of these are web Search Engines, and others include Page Monitors, Bookmark Managers, Offline Browsers and Website Mirroring.

An alternative subscription service is available to consumers as an RSS feed. This XML document contains a higher level change notification alert to provide an end-user page monitoring service using a standard technology.

Crawlers' Behaviour

The adoption of the Site Update Notification System by a search engine company, or other organisation that regularly crawls the web, will inevitably require changes to their Crawl implementations. The modifications are, however, relatively minor in concept as the system can be treated as an enhancement to the existing Crawl method.

The system's purpose is to act as a prompt to subscribers indicating that the content or presentation of specific pages on a site has changed. This prompt indicates that the spider should be sent to crawl those pages, accessing the new or changed information. The current methodology employs proprietary algorithms to schedule the spider to re-crawl sites based on factors such as demonstrated frequency of change (perhaps based on some historical trend analysis), supplied information about the intended frequency of change from the webmaster via sitemaps [ref] , the popularity of the site - as determined by further proprietary algorithms, or other statistical based guesswork as to the likeli-hood that a site may have changed since the last crawl.

The modification to support the system, then, is in the mechanics of scheduling the crawl away from polling and towards notification, removing the unnecessary effort of re-crawling unchanged pages which in turn frees bandwidth to be available to crawling new content.

An important design goal of the system has been to minimise the impact on spider implementations. The spider retains control of its own schedule based upon its own rules and policies. The system described empowers the spider to be more efficient by providing detailed information about real changes that are made to a web page. Rather than a spider blindly crawling a site, downloading com-paratively large volumes of data only to discover that the page is unchanged, the spider is now notified that the page has changed, and the spider can make its own decisions on when to act upon that knowledge and re-crawl the page.

Benefits

Search Engine Companies and others who crawl the web for content will reduce the workload and bandwidth needs of the crawler and reduce the workload of the indexer by receiving accurate information about web site changes made at source rather than working with changes identified in the HTTP delivered documents. Reducing the workload will enable access to more sites in a shorter timeframe with the same IT resources. Finally, the syndicated search index updates provide a mechanism to keep an index current with fast changing sites such as news and current affairs.

Read More...

Products for Web Masters

Products for Web Hosting Companies