Objectives

Current architectures for crawling the web – that of centralised content access over HTTP – have serious shortcomings which need to be overcome to sustain search engine indexes as the size of the web grows, and the importance of Search increases in consumers' lives.

The Site Update Notifications System is designed to overcome the current issues with crawler scalability, search engine index currency and restricted access to the Deep Web . The paradigm shift that enables the system to achieve these goals is to distribute the effort of the crawler among the web servers that host the content. In doing so, the function changes from a crawler that accesses a web site via HTTP requests to a web server agent with enough domain knowledge to create a more accurate record of changes to page content and presentation.

There is evidence to demonstrate that this distribution of effort will be accepted by the web community if the objectives are achieved. It is an obvious fact that search engines want their indexes to be as complete as possible, as well as containing current data, to provide accurate and comprehensive search results. This is evidenced by the massive resources and infrastructures in which the big players have invested to repeatedly crawl websites, and by current research into crawling the Deep Web.

The creation and success of the Search Engine Optimisation (SEO) market demonstrates that individual web masters are also willing to invest to increase their site's position in search engine results for relevant queries. The Site Update Notification System can be used as a valuable SEO tool to feed updates to indexers much more quickly than is currently possible.

Challenges facing the World Wide Web

In January 2008, there were 541,677,360 hosts advertised in the internet DNS. Such a large number of sites on the internet poses significant problems for everyone who uses the internet to publish or discover information;

For Search Engines

How to crawl, index and rank the contents of every page on every site often enough that regularly updated content is contained within the index and is accessible to users.

For Web Masters

How to make newly published information findable by interested readers – typically through popular search engines.

For End Users

How to find information which is relevant and up-to-date.

The method of maintaining a Search Engine's index of the web has hardly changed since the earliest search engines;   a   central   repository  collects

information from as many sources as it can and provides a web-based user interface to interrogate the repository using keywords. This needs a lot of investment in bandwidth and processing power to visit each of the 140,000 new sites per day and re-visit existing sites looking for pages that have been updated. Some techniques have been developed to help to reduce the workload of web crawlers, such as the introduction of The Robots Exclusion Protocol and Sitemaps, but these are crude and largely ineffective in solving the larger problem of indexing the Web.

The problem for Web Masters is a challenge large enough that a new market for Search Engine Optimisation (SEO) has evolved to provide documentation, tools and expertise to try to get a high ranking of web content in the major search engines' results.

As the web continues to grow and becomes increasingly dynamic in its content, the ability to maintain an up-to-date index of all sites is less achievable than it has ever been before, and even maintaining an index of a useful size to an increasing customer base is becoming unmanageable. The current solution of adding more bandwidth and capacity for crawling the web is unsustainable in the mid and long term.


Read more about the Deep Web.