Trawling the Deep Web
The majority of web pages one can access through search engines were collected by crawling the so-called Static or Surface Web. It is a smaller portion of the Internet reportedly containing between 8 and 20 billion pages (Google vs. Yahoo index sizes). Though this number is already very large, the total number of pages available on the Web is estimated to 500 billion pages. This part of the Internet is often referred to as Deep Web, Dynamic Web, or Invisible Web. All these names reflect some of the features of this gigantic source of information - stored deep down in databases, rendered through DHTML, not accessible to standard crawlers. Pages in the Deep Web typically might not have a standard URL, and cannot be addressed in a standard fashion. In many cases, they actually do not even exist until a user asks a question by filling up fields in a form, and a response (page) is generated. Typical examples of deep web applications are airline reservation, online dictionaries, etc.
It is supposedly quite easy for a human to navigate through the Deep Web. One just needs to fill up a form by choosing one of several options like destinations and dates a on travel site, or entering a word to search for a meaning or a translation. It is much more difficult for a machine to do so automatically and generically. Because the Deep Web contains a lot of factual information, it can be seen metaphorically as an ocean with a lot of fish. That is why we call the system that navigates the Deep Web a trawler.
There are two major problems with navigating Deep Web automatically. First, the trawler needs to understand what questions to ask through aforementioned forms, and ask them exhaustively. Second, the trawler can not easily navigate from one page to another since pages do not have set URLs or might not even exist. That’s why the trawler needs to remember where it came from and return to the surface (like a whale) before “diving” again to ask the next question.
If the number of sites is relatively small, say a few thousands, each set of forms could be described manually through a templating system. Its major limitations are scalability, and non resilience to changes in page formats.
There is a third problem that is related to the size of the Deep Web. It is so big that one needs to focus on a particular subset (vertical) to have a chance to trawl it with some level of success, especially if high precision is an important factor. Since the task of determining what questions to ask includes understanding of semantics and context, the focus on a vertical comes handy.
Glenbrook’s approach to building a trawler is based on mimicking the behavior of a (human) user. It is a useful approach since the “doors” opening the Deep Web were built with a human in mind and reflect the standards (no matter how loose) that humans use to navigate the Web.
The Trawler consists of five layers:
- Discoverer - locates perspective target home pages in Surface Web
- Scout - navigates Surface Web part of a web site and finds the “doors” - DHTML pages that contain forms leading to the Deep Web part of a web site
- Locksmith - fills up the forms with various requests and collects responses
- Assessor - analyses responses and makes a decision to use this door as candidate to query the Deep Web part of the site or move elsewhere
- Harvester - collects all relevant pages from Surface and Deep Web parts of the web site
After all potentially relevant pages are harvested the Extractor takes over. The Extractor is a hybrid system that applies Pattern Recognition, Natural Language Processing and other AI techniques to extract facts, combine them and populate a database that is used to provide factual answers to search queries.
The Extractor will be the subject of another post.
Tag: Glendor
Cross-posted from Software Only
Buy Propecia 5mg Online Uk
Generic Viagra
Buy Viagra Cheap
Canadian Cialis United Pharmacy
Bestellen Levitra
Brand Viagra Canada
Cheapest Prices On Viagra
Canadian Pharmacy Scam
Cost Of Daily Cialis
Buy Viagra Online No Prescription
Purchase Viagra Etc From Canada
Cialis 5 Mg Italia
Buy Levitra Online Without Prescription
Bio Viagra Herbal
100 Mg Viagra
Canadian Pharmacy Online Cialis
Order Propecia
Order Cialis Online Canada
Buy Cialis Without Prescription
Cialis Delivery In 5 Days Or Less
Propecia 5mg
Diuretics And Viagra
Buy Levitra Without Prescription
Levitra 10mg
Viagra Canada
Brand Viagra Over The Net
Cialis Fast Delivery Usa
Best Way To Use Cialis
Cialis Professional 20 Mg
Cheap Viagra No Prescription
5mg Propecia
Canadian Female Viagra
Cialis Women
Cialis 20 Mg Tablet
Generic Levitra Overnight Delivery
Canadian Phamacy
Dose Cialis
Canadian Healthcare Pharmacy
Generic Cialis Next Day Delivery
Purchase Of Viagra Or Cialis Etc
Buying Real Viagra Without Prescription
Cialis Online No Prescription
Generic Levitra Canadian Healthcare
Generic Cialis India Discount
Canadian Pharm Propecia Online
Buy Cialis Online Uk
Cialis Brand Name
Buy Cialis On Line
Brand Cialis
Generic Viaga Canada
Cialis By Mail
Canadianpharmacy
Pharmacy Support Viagra
Non Prescription Viagra
Cialis Online Without Prescription
Cialis Samples
Canada Viagra Generic
Buy Cialis Professional
Online Cialis
Cialis Alternative
Viagra On Line
Fast Delivery Canada Cialis
Purchase Cialis
Generic Propecia Mastercard
Cialis 20 Mg
Buy Generic Levitra Online
Buy Generic No Online Prescription Viagra
Buy Cialis From Canada
Best Prices On Viagra
Buy Cialis Generic
Buy Viagra Australia
50 Mg Cialis
Cialis Daily In Canada
Buy Viagra Online Canadian Phamacy
Mexico Pharmacy
Canadian Pharmacy Viagra Prescription
Buy Real Cialis Online
Cialis Gel
Cheap Viagra Or Cialis
Prescription Viagra
Ordering Viagra Overnight Delivery
Viagra Soft Gel
Canada Meds Viagra
5 Mg Propecia
Branded Viagra
Cialis 20mg
What Is Cialis
Official Canadian Pharmacy
Online Pharmacy Viagra Ottawa Canada
Buy Cialis Without Rx
Cheap Generic Viagra Online



