Web Pages and their Complexity

Blogged under General by Administrator on Wednesday 30 July 2008 at 12:27 pm

HTML’s forgiving nature is responsible for dramatic increases in the expressive power of web pages and in the overall usability of the Internet at large. As the saying goes, though, there’s no such thing as a free lunch. The flexibility of the HTML standard encourages a loose relationship between the HTML tags structure (DOM) of a web page and its visualization (rendering) on the screen. This creates challenges for search engines to find relevant matches between user queries and web pages.

Today’s web pages can be described as a hodgepodge of many (and quite often unrelated) elements – articles, comments, posts, ads, banners, tables of contents, hyperlinks, etc. Printed newspapers (which still constitute a serious challenge for OCR systems) are dwarfed in complexity by the complexity of web pages. Of course, HTML DOM is of some help, but its hierarchical structure is no match to lateral links between web page elements with their dynamic rendering.

Let’s consider the following page: http://richlabonte.net/exonews/xtra/whales_dying.htm

web page example

This page is one of the top-10 Google Search results for the query whales warning of tsunami. The page consists of 9 separate articles discussing different subjects, including an article about tsunami warning systems (#3) and an article on preservation of whales (#1), but not what was asked – the rather unusual topic of whales warning of tsunamis. As can be seen from this example, the relevancy of a page to a query is not necessarily well defined.

Various means have been employed to deal with web page complexity. The most widely accepted one is to use the distance between keywords occurrences as a measure of relevancy – this is often referred to as proximity measure. This approach works extremely well in the case of a classic Information Retrieval problem of matching keywords to texts. However, on a web page, close proximity of two words does not necessarily mean that they are related. They can be very close on a page while belonging to completely different contexts — for instance, think of a newspaper layout where two unrelated topics appear on different sides of the line separating two columns. The problem is exacerbated with dynamic rendering of web pages, where ads and links to outside pages are inserted inside an article.

All this leads to reinterpretation of the concept of a relevant page. It is no longer a relevant page but rather a relevant context we are after. As soon as we can split a page into contexts, the methods that are applicable to pure texts such as NLP, semantic analysis, proximity measures, etc. become much more reliable.

Glendor Search is built on this principle. We developed a patent-pending technology to analyze arbitrary web pages and divide them into independent contexts. On average, the analysis and indexing of a page requires less than one second. This translates into the ability to crawl and index the entire surface web (~20 billion pages) once a quarter using 1,000 standard PCs.

Welcome to Glendor Search

Blogged under General by Administrator on Wednesday 11 June 2008 at 4:29 pm

We are throwing our hat into the ring and launching our own web search service. What makes us unique? We cut to the chase and let the users instantly see query-relevant information and quickly decide which of the top ten search results will be most interesting.

Curious? Register for a Private Beta at www.glendor.com. We are excited to know what you think.

3, 2, 1 LAUNCH

Recent Coverage

Blogged under General by Jeff Clavier on Wednesday 7 September 2005 at 4:48 am

Glenbrook Networks got some interesting coverage recently:

I look forward to when Glenbrook or Google will help us find information from these previously unavailable sources. It will mean billions more pages of relevant information available to the world.

 

Trawling the Deep Web

Blogged under General by Jeff Clavier on Sunday 21 August 2005 at 2:04 pm

The majority of web pages one can access through search engines were collected by crawling the so-called Static or Surface Web. It is a smaller portion of the Internet reportedly containing between 8 and 20 billion pages (Google vs. Yahoo index sizes). Though this number is already very large, the total number of pages available on the Web is estimated to 500 billion pages. This part of the Internet is often referred to as Deep Web, Dynamic Web, or Invisible Web. All these names reflect some of the features of this gigantic source of information - stored deep down in databases, rendered through DHTML, not accessible to standard crawlers. Pages in the Deep Web typically might not have a standard URL, and cannot be addressed in a standard fashion. In many cases, they actually do not even exist until a user asks a question by filling up fields in a form, and a response (page) is generated. Typical examples of deep web applications are airline reservation, online dictionaries, etc.

It is supposedly quite easy for a human to navigate through the Deep Web. One just needs to fill up a form by choosing one of several options like destinations and dates a on travel site, or entering a word to search for a meaning or a translation. It is much more difficult for a machine to do so automatically and generically. Because the Deep Web contains a lot of factual information, it can be seen metaphorically as an ocean with a lot of fish. That is why we call the system that navigates the Deep Web a trawler.

There are two major problems with navigating Deep Web automatically. First, the trawler needs to understand what questions to ask through aforementioned forms, and ask them exhaustively. Second, the trawler can not easily navigate from one page to another since pages do not have set URLs or might not even exist. That’s why the trawler needs to remember where it came from and return to the surface (like a whale) before “diving” again to ask the next question.

If the number of sites is relatively small, say a few thousands, each set of forms could be described manually through a templating system. Its major limitations are scalability, and non resilience to changes in page formats. 

There is a third problem that is related to the size of the Deep Web. It is so big that one needs to focus on a particular subset (vertical) to have a chance to trawl it with some level of success, especially if high precision is an important factor. Since the task of determining what questions to ask includes understanding of semantics and context, the focus on a vertical comes handy.

Glenbrook’s approach to building a trawler is based on mimicking the behavior of a (human) user. It is a useful approach since the “doors” opening the Deep Web were built with a human in mind and reflect the standards (no matter how loose) that humans use to navigate the Web.

The Trawler consists of five layers:

  1. Discoverer - locates perspective target home pages in Surface Web
  2. Scout - navigates Surface Web part of a web site and finds the “doors” - DHTML pages that contain forms leading to the Deep Web part of a web site
  3. Locksmith - fills up the forms with various requests and collects responses
  4. Assessor - analyses responses and makes a decision to use this door as candidate to query the Deep Web part of the site or move elsewhere
  5. Harvester - collects all relevant pages from Surface and Deep Web parts of the web site

After all potentially relevant pages are harvested the Extractor takes over. The Extractor is a hybrid system that applies Pattern Recognition, Natural Language Processing and other AI techniques to extract facts, combine them and populate a database that is used to provide factual answers to search queries.

The Extractor will be the subject of another post.

Tag:

Cross-posted from Software Only

Glenbrook Networks in the San Jose Mercury News

Blogged under General by Jeff Clavier on Tuesday 16 August 2005 at 8:07 am

SiliconBeat’s Michael Bazeley featured Glenbrook Networks co-founders Julia and Edward Komissarchik, and the Glendor showcase, in a great piece about “Deep Web” search and information extraction. Michael summarized it quite well:

Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for. Then, Glenbrook’s Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such as the location of their desired job, in the same way a human would.

Julia Komissarchik likens the process to cracking a safe.

“The way to think of it is, you case the joint,'’ she said. “The scout goes through the form and tries a few options to see what the results will be. Then you have a mastermind or safecracker who gets all this information from the scout and devises a method to open the forms.'’

Finally, she said, the “harvesters'’ spring into action to gather up all the information.

Just to clarify: the “safe” analogy does not imply that the company is breaking passwords, and accessing private information. It relates to getting a machine to access generically information stored beyond interactive forms.

We announced the launch of the Glendor showcase a couple of month ago. This features the first (and still I guess, only) mashup involving jobs listings positioned on GoogleMaps.

Longer post about the concept of “web trawling” implemented by the company on its way.

Thanks to all of you who emailed us since this morning, we are grateful for reports of issues with different browser/OS combination, sorry we are not hiring at this time, and yes we can build large scale custom search and aggregation data solutions. And we are delighted that you like this showcase

Update: Gary Price, who was also quoted by Michael, posted an analysis on Search Engine Watch, that I wanted to briefly comment on. First Glenbrook’s technology does not (and can not) extract information directly from corporate databases, it goes through the public, manual, interface that companies have setup to access that data.The innovation lies in a suite of algorithms that figure out automatically the parameters to be used to extract that data, not requiring any templating of the sites to be targeted.

On server load, queries are made in a sensible way to avoid overloading servers based on response times, etc. And data can be refreshed daily, and maybe multiple times a day if the dataset is small enough. But extracting and caching data that change too frequently would not be appropriate.

On usability and searchability of the data, this is actually where the aggregation of structured data delivers its value: being able to apply on a position, a location, across a wide range of sources (in this case, jobs listings across companies).

Delighted to show you the technology at your convenience Gary…

Tag:

Cross-posted from Software Only.

Investments flowing into job search engines

Blogged under General by Jeff Clavier on Tuesday 9 August 2005 at 8:18 pm

Congratulations to SimplyHired for raising a $3M Series B from a great group of angel investors last Thursday, and to Indeed for following suit on Monday, scoring $5M from Union Square Ventures, the NY Times Company and Allen & Company. Fred Wilson shared interesting insights about the deal on his blog.

The consolidation in the jobs vertical search begins: Jobster acquires Workzoo

Blogged under General by Jeff Clavier on Tuesday 12 July 2005 at 4:45 am

I was reading Charlene Li’s excellent account of the launch of HotJobs crawling capability when I spotted that Jobster is buying WorkZoo. According to Charlene:

I spoke with Jobster CEO Jason Goldberg on Monday, and he described their vision of how WorkZoo will allow users to expand their search beyond their network of jobs on Jobster proper and see “every” job. WorkZoo has its cut out for them – in previous testing, they lagged significantly in their parsing ability compared to Indeed.com and Simply Hired. But this combination of Jobster and WorkZoo makes sense as a combined service – it’s also is similar to the partnership that currently exists between professional social networking service LinkedIn and SimplyHired.

The consolidation has already begun. Interesting.

Yahoo HotJobs is also a jobs search engine

Blogged under General by Jeff Clavier on Tuesday 12 July 2005 at 3:27 am

John Battelle said it best in “A Good Idea, Indeed. You’re Simply Hired “: Yahoo Hotjobs is entering the Jobs search arena.

“Yahoo seems to be taking a cue from Indeed and Simply Hired. Ouch. (Thanks, Richard)”

Joel Cheesman actually posted on the topic before John, and there is an interesting discussion in the comments of his post.

Let’s see what Monster.com and CarreerBuilder’s next moves are in this new segment.

Update: SiliconBeat added their take on the news

Bay Area zip codes

Blogged under General by Jeff Clavier on Wednesday 6 July 2005 at 1:42 am

Francois Gossieaux over at Emergence Marketing very rightly pointed out that our readers, and showcase testers, might not be familiar with our zip codes. Apologies for that.

I should mentiong that leaving the “Location” field empty uses San Carlos as the reference point for searches (it is sort in the center of Silicon Valley). And here are a few Bay Area zip codes: 94301 for Palo Alto, 94111 for San Francisco and 95113 for San Jose.

Glendor.com is a mashup

Blogged under General by Jeff Clavier on Tuesday 5 July 2005 at 7:04 pm

Om Malik has pointed this morning to a few applications using Google Maps to geolocate “stuff”, stuff being wireless-enabled cafes, wireless hot-spots in cities, and the now famous Craigslist meets Google Maps for having started the whole movement.

Michael Bazeley then pointed to Redfin, which combines satellite maps and MLS homes data for the Seattle area.

The O’Reilly Radar also referred to the Google Maps + Yahoo Traffic mashup that was taken down, and then brought back up.

So Glendor.com is a mashup as well then!

Finally, I found Google Maps Mania in our referrer logs:  An unofficial Google Maps blog tracking the websites, ideas and tools being influenced by Google Maps.

Mapping job listings

Blogged under General by Jeff Clavier on Tuesday 5 July 2005 at 2:57 am

Glendor Showcase

And this was developed before the Google Maps API was released! Which means that we might not have used all the capabilities now available.
Also make sure to zoom in the map to display the different companies with less overlap.

.

A few search examples

Blogged under General by Jeff Clavier on Tuesday 5 July 2005 at 2:37 am

The following searches will give you an idea of what can be accessed on Glendor.com:

  • Development jobs available 25 miles around Palo Alto, CA:  search map rss
  • Software jobs listed on company websites that includes the keywords (kernel, networking, file system): search map rss
  • Contract or temporary admin jobs published in the last 7 days, within 10 miles of San Francisco, CA: search map rss

Don’t be surprised if some jobs are outside of the Bay Area: we are restricting the sources to companies having operations, or their headquarter, in the Bay Area, but the jobs themselves might be anywhere in the US, or actually abroad.

Also, the precision of the mapping is at the level of the city since only rarely is the actual address of the company mentioned in the job listing. That’s why multiple jobs may overlap on one city, and clicking on one character does not display all jobs available for that city in the “bubble”.

A word about this blog

Blogged under General by Jeff Clavier on Monday 4 July 2005 at 1:48 am

Besides keeping you up to date on the developments of Glenbrook Networks, and the Glendor showcase, this blog will also talk about vertical search in general, and some of the technology issues that we had to solve when building our vertical search and information extraction platform.
Please tune in the RSS feed.

Welcome to the Glendor Showcase

Blogged under General by Jeff Clavier on Monday 4 July 2005 at 1:10 am

Glendor.com is the showcase of Glenbrook Networks, the search and information extraction platform provider.

We have chosen jobs as a vertical for this showcase because extracting listings from company web sites exercises all aspects of our technology to produce quality, structured results: surface and dynamic web crawling, layout recognition, natural language processing,…
We have also integrated a few additional features like the mapping job listings onto Google Maps, the ability to subscribe to search results via RSS feeds, and to syndicate searches on blogs or other web sites.

The showcase is providing job listings extracted from a few hundred Bay Area company web sites, and one large job board. Using it is pretty straightforward, but check out the Help section for typical queries.

Proudly powered by Wordpress - Theme Glendor