In 2001, Michael K. Bergman wrote a white paper The Deep Web: Surfacing Hidden Value [1] in which he draws an analogy of a Search Engine dragging a net across the surface of the web, indexing information found on the surface but missing out on the denser and potentially more valuable information buried deep below. This analogy works as well today as it did seven years ago; the Deep Web (also called Deepnet, the Invisible Web, or the Hidden Web) is inaccessible to the mass population using mainstream search engines such as Google, Microsoft Live or Yahoo! The reasons for this are many. A Search Engine has an enormous job to crawl and index all sites that it knows about because of the volume of information contained within them. Google recently posted a blog entry entitled “We knew the web was big...” [2] claiming that it had hit a milestone of one trillion - that's 1,000,000,000,000 - active and unique URLs in its database, although it concedes not to index them all. Cuil (www.cuil.com), a new player in the search engine space, claims to be the world's biggest search engine with 121 billion pages indexed. This sounds like a lot of pages, but consider that there are 541 million hosts, this averages only 224 pages per host. The numbers are big, and in isolation sound impressive, but the fact is that this is still only the tip of a very large iceberg.
A lot of web pages are inaccessible to search engines because their sites restrict access with user accounts. If the spider cannot access the pages, then they will not be able to index them and the content will not be in included in the Search Engine's results for relevant queries.
While it is important to restrict some content to subscribers, it is also important to recognise that Search Engines provide a means of exposure, too. If protected content was available to search engines such that they could provide summary information to a public user, then this could lead to new subscribers for paid-for content. As it is today, the entire paid-for repository is often inaccessible to any user that is not currently subscribed. Account-protected content is an increasing trend on the web as businesses recognise the value in maintaining a list of users' email addresses and seek to exploit the large customer base provided by this new medium. As the trend grows, the Deep Web will become deeper, and proportionally less content will be accessible via generic Search Engines.
Online product sales are ever increasing with online sales reaching £26.5bn during the first six months of 2008 in the UK, up 38% on 2007 [3]. Dedicated online stores such as Amazon.com as well as online representations of high street stores such as Mothercare (www.mothercare.co.uk) are therefore keen for their pages to be at the top of Search Engine results for the products that they sell. There is a lot of work to do for these online retailers to structure their site effectively so that the deeper content of items for sale can be accessible to the Search Engine spider. Consumer sales sites such as eBay.com and Autotrader.co.uk have a bigger challenge still as their sale items are transient; these sites sell one-off items for a limited period, so need to rely on a Search Engine to crawl and index their site very regularly.
Web sites of agency services, such as house sales sites representing estate agents (e.g. realtor.com, rightmove.co.uk) and job vacancy sites representing recruitment agencies (e.g. monster.com) provide sophisticated search user interfaces to enable users to find property, jobs, etc. to suit their needs. The sites typically do not prevent bots from large search engines from accessing their sites, but the content is typically inaccessible to bots that simply navigate the site following href links. Like consumer sales sites, the products (properties, job vacancies, etc.) are transient; specific one-off entries for a limited period, so these sites need a Search Engine to crawl and index their site regularly.
1. Bergman, Michael K., WHITE PAPER: The Deep Web: Surfacing Hidden Value, Journal of Electronic Publishing 7(1), University of Michigan, August 2001 link
2. Official Google Blog, We knew the web was big...25th July 2008 link
3. Finch, J., Online sales boom as shoppers desert high street, The Guardian online, 18th July 2008 link
Read more about the
Site Update Notification System.