Web Pages and their Complexity

Blogged under General by Administrator on Wednesday 30 July 2008 at 12:27 pm

HTML’s forgiving nature is responsible for dramatic increases in the expressive power of web pages and in the overall usability of the Internet at large. As the saying goes, though, there’s no such thing as a free lunch. The flexibility of the HTML standard encourages a loose relationship between the HTML tags structure (DOM) of a web page and its visualization (rendering) on the screen. This creates challenges for search engines to find relevant matches between user queries and web pages.

Today’s web pages can be described as a hodgepodge of many (and quite often unrelated) elements – articles, comments, posts, ads, banners, tables of contents, hyperlinks, etc. Printed newspapers (which still constitute a serious challenge for OCR systems) are dwarfed in complexity by the complexity of web pages. Of course, HTML DOM is of some help, but its hierarchical structure is no match to lateral links between web page elements with their dynamic rendering.

Let’s consider the following page: http://richlabonte.net/exonews/xtra/whales_dying.htm

web page example

This page is one of the top-10 Google Search results for the query whales warning of tsunami. The page consists of 9 separate articles discussing different subjects, including an article about tsunami warning systems (#3) and an article on preservation of whales (#1), but not what was asked – the rather unusual topic of whales warning of tsunamis. As can be seen from this example, the relevancy of a page to a query is not necessarily well defined.

Various means have been employed to deal with web page complexity. The most widely accepted one is to use the distance between keywords occurrences as a measure of relevancy – this is often referred to as proximity measure. This approach works extremely well in the case of a classic Information Retrieval problem of matching keywords to texts. However, on a web page, close proximity of two words does not necessarily mean that they are related. They can be very close on a page while belonging to completely different contexts — for instance, think of a newspaper layout where two unrelated topics appear on different sides of the line separating two columns. The problem is exacerbated with dynamic rendering of web pages, where ads and links to outside pages are inserted inside an article.

All this leads to reinterpretation of the concept of a relevant page. It is no longer a relevant page but rather a relevant context we are after. As soon as we can split a page into contexts, the methods that are applicable to pure texts such as NLP, semantic analysis, proximity measures, etc. become much more reliable.

Glendor Search is built on this principle. We developed a patent-pending technology to analyze arbitrary web pages and divide them into independent contexts. On average, the analysis and indexing of a page requires less than one second. This translates into the ability to crawl and index the entire surface web (~20 billion pages) once a quarter using 1,000 standard PCs.

Inexpensive Viagra
Buy Cheap Propecia
Cialis Pfizer
Buy Pfizer Viagra
Buy Cialis Online Without Prescription
Cialis Tablets Vs Viagra
Cialis No Rx
Cialis From Canada
Buy Propecia Cheap
Cialis On Women
Buy Propecia Online
Pfizer Viagra Online
Canadian Pharmacy Online
How To Get Cialis No Prescription
Buy Viagra Cialis Levitra
Cialis In India
Viagra On Line
Levitra Online Pharmacy
Online Pharmacy
Cheap Cialis
Cheap Levitra Tablets
Cheap 25mg Viagra
Cialis Once Daily
Viagra Soft Tabs 100 Mg
Buy Viagra 100mg
Cialis Canada Online Pharmacy No Prescription
Cialis Cheap Delivery
Cialis Professional 100 Mg
Cialis Online Doctor
Buying Viagra Online
Canadian Rx Viagra
Cheap Generic Viagra Online
Cialis In Canada
Brand Cialis
Cialis Daily Availability
Cialis Professional
Viagra Soft Gel
Viagra Tablets
Generic Cialis
Cialis Soft Pills
Best Quality Viagra
Canadianpharmacy
Canada Cialis Online
Cialis Generic
Buy Levitra Online No Prescription
Cialis On Sale
Generic Cialis Soft Tabs
5 Mg Propecia
Buy Viagra In Canada
Canadian Healthcare Viagra
Cialis Costs
Canadian Pharmacy Viagra Scam
Dose Cialis
Best Recognized Pharmacy In Canada For Viagria
Canadian Pharmacy Cialis 5 Mg
Levitra For Cheap Canadian Pharmacy
Canadian Pharmacy Viagra
Buy Online Prescription Propecia
Buy Viagra Mexico
Pfizer Viagra Canada
Buy Generic Cialis
Cialis Online For Canadian
Cialis100mg
Levitra For Women
Online Viagra
Best Cialis Price
Cialis Quick
Buy Levitra Us
Buy Viagra Online Australia
Discount Cialis And Viagra
Order Propica
Best Price Levitra Online
Real Cialis
Canadamedscom
Overnight Viagra Delivery
Generic Cialis Next Day Delivery
No Prescription
Buy Cheap Generic Propecia
Alternatives To Cialis
Minoxidil Propecia Nizoral
Cheap Generic Levitra
Cheap Onlin Viagra In Usa
Purchase Of Viagra Or Cialis Etc
Online Pharmacy Shop Canadian Healthcare Pharmacy
Brand Name Cialis
Cialis Levitra
Canadian Pharmacy Levitra Value Pack
Buy Viagra From China
Canada Generic Propecia
Levitra 20 Mg
Best Deal For Propecia
Soft Cialis

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Proudly powered by Wordpress - Theme Glendor