HTML’s forgiving nature is responsible for dramatic increases in the expressive power of web pages and in the overall usability of the Internet at large. As the saying goes, though, there’s no such thing as a free lunch. The flexibility of the HTML standard encourages a loose relationship between the HTML tags structure (DOM) of a web page and its visualization (rendering) on the screen. This creates challenges for search engines to find relevant matches between user queries and web pages.
Today’s web pages can be described as a hodgepodge of many (and quite often unrelated) elements – articles, comments, posts, ads, banners, tables of contents, hyperlinks, etc. Printed newspapers (which still constitute a serious challenge for OCR systems) are dwarfed in complexity by the complexity of web pages. Of course, HTML DOM is of some help, but its hierarchical structure is no match to lateral links between web page elements with their dynamic rendering.
Let’s consider the following page: http://richlabonte.net/exonews/xtra/whales_dying.htm
This page is one of the top-10 Google Search results for the query whales warning of tsunami. The page consists of 9 separate articles discussing different subjects, including an article about tsunami warning systems (#3) and an article on preservation of whales (#1), but not what was asked – the rather unusual topic of whales warning of tsunamis. As can be seen from this example, the relevancy of a page to a query is not necessarily well defined.
Various means have been employed to deal with web page complexity. The most widely accepted one is to use the distance between keywords occurrences as a measure of relevancy – this is often referred to as proximity measure. This approach works extremely well in the case of a classic Information Retrieval problem of matching keywords to texts. However, on a web page, close proximity of two words does not necessarily mean that they are related. They can be very close on a page while belonging to completely different contexts — for instance, think of a newspaper layout where two unrelated topics appear on different sides of the line separating two columns. The problem is exacerbated with dynamic rendering of web pages, where ads and links to outside pages are inserted inside an article.
All this leads to reinterpretation of the concept of a relevant page. It is no longer a relevant page but rather a relevant context we are after. As soon as we can split a page into contexts, the methods that are applicable to pure texts such as NLP, semantic analysis, proximity measures, etc. become much more reliable.
Glendor Search is built on this principle. We developed a patent-pending technology to analyze arbitrary web pages and divide them into independent contexts. On average, the analysis and indexing of a page requires less than one second. This translates into the ability to crawl and index the entire surface web (~20 billion pages) once a quarter using 1,000 standard PCs.