Please subscribe to my feed.

You can subscribe by entering your email address below, and you will never miss any good posts by our panel of authors. Don't worry, you can unsubscribe ANYTIME.

Should you want to join Kongtechnology.com, please read here.

Google recently announced via its product blogs that they have begun an effort to index the “invisible” web. The details point to perhaps a big step in the technology for indexing online content. The announcement refers to detection of online forms and filling them with suitable data so as to generate pages that could be indexed.

How deep is the web?

An excerpt from the Google Webmaster Central Blog.

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a

element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made.

The Invisible Web refers to the part of the Internet which is unavailable for indexing to the search bots or crawlers in the normal course of indexing.

Wikipedia has a synopsis of what qualifies as the invisible web.

Deep Web resources may be classified into one or more of the following categories:

  • Dynamic content – dynamic pages which are returned in response to a submitted query or accessed only through a form (especially if open-domain input elements e.g. text fields are used; such fields are hard to navigate without domain knowledge).
  • Unlinked content – pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).
  • Private Web – sites that require registration and login (password-protected resources).
  • Contextual Web – pages with content varying for different access contexts (e.g. ranges of client IP addresses or previous navigation sequence).
  • Limited access content – sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs or pragma:no-cache/cache-control:no-cache HTTP headers), prohibiting search engines from browsing them and creating cached copies.
  • Scripted content – pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or AJAX solutions.
  • Non-HTML/text content – textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

Until recently the invisible web was indexed only when the sites were made available through submission. Google’s approach hints to applications of several technologies that they have been researching on for years but have seldom mentioned in their products.

The new abilities of Google’s crawlers to simulate form submissions implies the application of artificial intelligence, language processing and contextual analysis – technologies that have come to Google by way of acquisition and in-house talent. The present process will be limited to forms that use GET for data submission.

What this means is that now Google will be able to more accurately address the questions posed by users. This is another effort from Google to be the one-stop-shop for all queries related to the web – essentially automating the search process at other websites so that the users get a final result page.

Share this story with the world:
  • Digg
  • del.icio.us
  • MisterWong
  • Wists
  • BlinkList
  • Furl
  • Reddit
  • StumbleUpon
  • Technorati