An Experiment in HTML Data Extraction

by joel 4. March 2010 01:44

I’ve been lucky enough to have spent the past two months gainfully unemployed, and inspired enough to explore the wonderful world of Google’esque type search and data extraction problems. The optimistic software engineer inside of me estimated two months to grok and build software and algorithms those PhD students at Cambridge spend years contemplating. Needless to say after two months I’ve concluded that PhD’s are probably worth a cent, and that it’s no coincidence they’re being hired by the fistful at Google and Microsoft.

I chose to spend my two month ‘sabbatical’ on three major themes:

  1. Explore a search and data extraction problem, that employ’s Google’esque like software algorithms:
    • Crawling
    • Parsing
    • Machine learning/data extraction
    • Filtering
    • Ranking and Search
  2. Implement these algorithms in F# with the intent on studying:
    • Any speedup in development time
    • F# language features that offer better mappings or abstractions to the algorithms at hand
  3. Get a feel for the complexity of the problem space

The problem domain I chose to experiment with is simple: crawl commerce websites and extract products and prices in an unsupervised manner. Or put another way, screen scrape sites like eBay, Harvey Norman (an Australian retailer) etc. with zero hard-coded knowledge of the HTML structure, and extract out their product and price catalogue with very minimal handholding. Yes, I want to build Google Product search, without relying on merchant sites providing a products XML feed. And once I have the products and prices, I also want to build the search part too.

For this post I’ll start by focusing on the parsing, data-extraction and filtering. I’ll hopefully find the time to write up the other parts soon, so subscribe and sit-tight for the remainder.

The goal of this post is simple: document my experiments, approaches and conclusions to the HTML data extraction problem space. But, before we kick off, my disclaimer: I’m not a PhD in this field; I’m simply a hacker hacking on stuff that lights me up. For claims of scientific merit, please stand on the shoulders of true giants.

The Problem Space

Data extraction (or web scraping) from HTML pages is typically done through a manual, supervised process. These manual workhorses to extract data elements we care about are tightly coupled to the structure of HTML through mechanisms like string comparison and Regular expressions. Let’s look at an example to make this point clear. The following is a webpage screenshot extract from a local Australian e-commerce seller:

I’ve simplified the underlying HTML structure of the above example to focus on the rendering of the thing that we care about later on: products and prices:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>...</head>
<body id="pageID_category">
  <div id="container">
    <div id="header">
    <div id="nav_contain">
    <div id="pageTools">
    <div id="mainContentBox">
    <div class="floatRight">
    <div class="mainContent">
      <h1>Desktop Computers</h1>
      <p>...</p>
        <!-- First Computer -->
        <div class="fP_img1">
        <div class="fP_imageHolder">
          <img border="0" alt="Acer eMachine EZ1600 All-in-One 18.5"
          Desktop (XC6429)" src="/isroot/dse/images/products/xc6429~med.jpg"/> 

          <h3>
            <a id="bcXC6429" href=”/Product/View/XC6429">
            Acer eMachine EZ1600 All-in-One 18.5" Desktop</a>
          </h3>

          <div class="pricePoint">
            <h4> 
              A$449
              </h4>

          </div>

          </div>
          <!-- Second Computer -->
          <div class="fP_img2">
          ...

The structure of this HTML is typical (important parts bolded above): a <div> container (<div class=”fp_img1”> ... <div class=”fp_img2”> and so on) that will contain the product and price grouping; an <a href=> HTML tag that has the product description (<a id=”bxXC6429”...>Acer eMachine EZ1600 All-in-One...</a>), and a HTML tag to display the price (<h4>A$449</h4>). This div container format is repeated for the columns (<div class=”fp_img1”... “fp_img2” ... ”fp_img3”), and then again for more rows if needed.

Programmatic extraction of product and pricing text from this example would be typically achieved by either:

  • Slurping the html in to a string and indexing into the string at various points using string manipulation API’s like String.IndexOf(“<h3>”) etc.
  • Identifying the common denominator HTML node and using a Regular Expressions to extract those nodes. i.e. Product text is marked up using the “<h3>” tag, and price text is marked up using “<h4>”. A simple Regular Expression like: “<h3>.*?</h3>” will identify all product <h3> containers to which you can extract the product text from the <a href> manually (or write a more complex Regular Expression).
  • Use a library to abstract over HTML and then leverage the abstraction in automation (e.g. BeautifulSoup).
  • Use a tool (a visualisation tool is common) to allow a user to point and click to the data they wish to extract.

These approaches (and similar variants) have two core problems: 1) code is tightly coupled to the HTML structure, and/or the HTML markup and thus is fragile under situations where the content is dynamic or the website structure/layout are changed or versioned in the future, and 2) it doesn’t easily account for subtle structure/markup variants that are found throughout most e-commerce sites.

We need a better way to abstract and reflect on HTML, all while dealing with its malformations and variations. We need a better way to programmatically extract data, one that can deal with versioning and variants. And we need the approaches to scale well. Scale in terms of the amount of sites we can parse (and their various idiosyncrasies), and scale well in terms of computation (scale out). Product and price data extraction from e-commerce sites seems like a meaningful way to explore this problem space.

The Idea

Programmatic string manipulation, regular expressions and so on are commonly used tools because one thing is typical about websites: they’re born from templates. Meaning someone has created a template page, and then most pages on that respective website are derived from the template. And taking this one step further, even the pages themselves usually have repeating sections that are based on templates – anyone familiar with ASP.NET repeaters will testify to that. Using regular expressions makes some sense in these template scenarios: they’re simply mechanical harvesters of templates derivatives, based on template deconstruction by a programmer (or put in language geek speak: they’re early bound, statically linked to the HTML). If this is generally true, why not build a harvester Reflection library instead? This way, you build a tool once, and it can generate data extraction logic for any site, and any page. And while we’re here, we should push forward one step further and dynamically build the extraction logic every time a crawl/parse occurs so that site dynamism over time can be ignored. Sort of like a Reflection for HTML, where everything is discovered and extracted late-bound with no static dependencies.

This approach hopefully alleviates the versioning problem, deals with the scale problem (we can point the generator at any domain, and away it goes), and we can add smarts to the to overcome the template variant problem (slight variations from page to page within the same domain).

So, what’s needed for a data extraction generator to work? It needs to identify the things it cares about, find out where those things exist in the HTML structure, generate the code or algorithm to extract them out, test that it meets the bar, then rinse and repeat for all pages in all sites.

To explore this generator concept further, we’ll switch now to focusing on our particular domain problem: extracting out products and prices from HTML. We’ll start with identification of the product and price features. If we knew the text for every single product on the planet, we could achieve product extraction from a webpage simply by string comparing all of its text against all of the world’s products. This isn’t computationally possible of course, but the process of identifying “some” of the products in a webpage against a smaller product corpus definitely is.

And the link between product text in HTML and websites being derivatives of templates gets us to the crux of figuring out how to automate extraction: if we can find some products in the HTML, and test to see if they all live in the same structural area of HTML, we can assume the rest of the unmatched strings in that area contains products too. But, before we get to an example of this (it’ll make sense soon, I promise), we first need to define what “structural area of HTML” really means.

Dealing with HTML programmatically is terrible, so it’s best abstracted to something easier to deal with. We typically abstract it away to a graph like structure (the DOM [Document Object Model] is a common choice), and then we can address it structurally using an API like XPath. This raises the level of abstraction from string manipulation and regular expressions, to a structure that makes sense (XML) an expressive language that navigates, manipulates and reflects over a directed graph (XPath). XPath is well understood, plenty of APIs for it, and has a unique feature which we’ll need to heavily rely on: its graph addressing system has a spectrum ranging from very specific, to very general. A quick example of addressing to bootstrap our understanding: To find the first <a> node in two nested divs we can call XPath with a query that looks something like “/html[1]/body[1]/div[1]/div[1]/a[1]”. Notice the [1]’s? That’s saying “I specifically want the first node in the graph” and so on. We can also be more general: to extract all <a> nodes that live in the same nested <div> structure, we can say this: “/html/body/div/div/a”. Dropping the [n] indexing says “address all nodes with that name in the graph”. Nodes can even be address via attribute tags: “/div[@class=’foo’]/a, which gets all <a> nodes that live within a <div> container that has a class attribute “foo”. And so on - very specific, to very general.

Okay, now that we know how to address areas of the HTML graph, let’s get back to figuring out how to find products in that graph. Remember that we want to find a string match from a subset of products inside the HTML. Let’s walk through it: we start with a corpus of products

Asus eMachine EZ1600 All-in-one 18.5” Desktop
Inspiron One Desktop S260905AU
DELL Inspiron 545MT S560603AU
DELL Inspiron 545MT S560603AU with Monitor
...

And a website page that has products that may match some of the products in the corpus:

(The HTML for this example):

<html>
<head>...</head>
<body id="pageID_category">
  <div id="container">
    <div id="header">
    <div id="nav_contain">
    <div id="pageTools">
    <div id="mainContentBox">
    <div class="floatRight">
    <div class="mainContent">
      <div class="fP_img1">
        <div class="fP_imageHolder">
            <a id="bcXC6429"
            title="Acer eMachine EZ1600 All-in-One 18.5" Desktop (XC6429)"
            href="/Product/View/XC6429">
            Acer eMachine EZ1600 All-in-One 18.5" Desktop</a>

By doing a simple string comparison throughout the HTML text nodes of the page to my product corpus, we find a few nodes with corpus matches:

Asus eMachine EZ1600 All-in-one 18.5” Desktop
Inspiron One Desktop S260905AU
DELL Inspiron 545MT S560603AU

We can then address those nodes using an XPath expression:

/html[1]/body[1]/div[1]/div[6]/div[1]/div[1]/a[1]
/html[1]/body[1]/div[1]/div[6]/div[2]/div[1]/a[1]
/html[1]/body[1]/div[1]/div[6]/div[3]/div[1]/a[1]
/html[1]/body[1]/div[1]/div[6]/div[4]/div[1]/a[1]

And if we look closely, we find them to be structurally pretty similar, bar the indexed addressing of the third <div> node. By removing that indexed addressing, we’re able to obtain all products that we know about with one generalised XPath:

/html[1]/body[1]/div[1]/div[6]/div/div[1]/a[1]

The side effect of generalising this XPath: we get all the other products in the same structural area of the graph too. Products we don’t know about from our corpus. Running that generalised XPath expression on our HTML example will give us this:

Asus eMachine EZ1600 All-in-one 18.5” Desktop
Inspiron One Desktop S260905AU
DELL Inspiron 545MT S560603AU
DELL Inspiron 545MT S560603AU
HP Pavillion P6120a Desktop
Compaq Prasario CQ1200AN CPU

A list of all products from that page, including ones not included in the corpus. And that forms the basic idea of automated product text extraction: identify product node candidates, find their common generalised XPath address, apply that to all nodes in the graph, then extract. Of course we’re missing a bunch of detail here like “what’s the likelihood of our product corpus matching something on a website” (the answer is not that great, but there’s a cute solution), and what about price extraction? Oh, and what if I don’t have a product corpus? Yep, we’re going to have to go deep on this, so let’s keep going.

The Details

This section will go a little deeper on the details. We’ll take a look at the following:

  • Matching the corpus against website text in a fuzzy way
  • Applying generalisation principles across the entire website
  • Finding prices for products
  • Filtering outliers

Matching and Generalising

The first obvious problem we need to solve is the text node to product corpus matching. Thousands of variations on how products can be listed (grammar combinations, extra description text etc.) kills any hope of getting away with basic string matching. Thankfully we have two workhorses of language and string comparison processing: 1) tokenisation and stemming (e.g. suffix stripping etc.) and 2) “compare and rank’s” tireless algorithm tf-idf (term frequency inverse document frequency). For a given corpus, we can stem, tokenise and calculate the tf-idf scores for all words, place them in individual vectors, and then compare those against each text node in the graph. Once we’ve done that, we’ll have a ranked list of similarity scores for text nodes compared to the corpus. Now, we’re going to get a lot of noise here – a page may contain all sorts of text nodes that relate in some way to the corpus, but aren’t relevant product matches. A quick look at our “Desktop Computers” example above shows us a casing point:

“Dick Smith is where you'll find the latest and greatest in Desktop Computers, Acer, HP, Compaq...”

If our corpus had a wide range of computers and electronics from Acer, HP and Compaq, this text node will get ranked -- low, sure, but ranked nevertheless. So how do we disambiguate between true products, and text nodes like the one above (and other common nodes like categories, drop down lists etc)? We apply the text node corpus ranking system to more pages on the website (making sure we have the biggest list of ranked potentials possible), then start performing XPath generalisation on the ranked text nodes. We do this so that we can look for generalised XPath’s that produce a high sum of well ranked text nodes. Let’s take a quick look at an example of this process:

Example Product Corpus:

Computer
Memory
Microsoft
Acer
Compaq
...
Acer eMachine EZ1600
Inspiron Desktop S260905AU
Inspiron 545MT S560603AU
RAM
DDR2 800Mhz DIMM
DDR2 800Mhz SODIMM
DDR3 1333Mhz DIMM
...

Page 1:

The red boxes represent graph nodes that had a corpus match hit (remember, this means they’ve undergone a string tokenisation and then a subsequent tf-idf vector comparison against the corpus to give us a score). You’ll notice the numbers (1 through 4): this represents a generalised XPath expression that we’ve found. So the three red-boxed computer products found in grouping 4 represent a generalised XPath and so on.

And now Page 2:

Page 2 ends up being very similar. So after generalisation, grouping, and figuring out the tf-idf scores of each, we end up with a data structure that looks something like this:

Grouping 1: /html[1]/body[1]/div[1]/div[5]/div[1]/div[1]/p[1]/a[1]

Desktop Computers, 0.19390
Computer Memory, 0.23320

Grouping 2: /htm1[1]/body[1]/div[1]/div[5]/div[1]/div[1]/h1[1]

Desktop Computers, 0.19390
Computer Memory, 0.23320

Grouping 3: /html[1]/body[1]/div[1]/div[5]/div[1]/div[1]/p[2]

Dick Smith is where you'll find the latest ..., 0.09334
Increase your Computer Memory with extra ..., 0.03203

Grouping 4: /html[1]/body[1]/div[1]/div[6]/div/div[1]/a[1]

Acer eMachine EZ1600 All-in-One 18.5" Desktop, 0.672399
Inspiron One Desktop S260905AU, 0.890233
Dell Inspiron 545MT S560603AU, 0.840503
DSE 2GB DDR2 800Mhz DIMM RAM, 0.650430
DSE 2GB DDR2 800Mhz SODIMM RAM, 0.64050
DSE 1GB DDR2 800Mhz DIMM RAM, 0.62283

From the data-structure we can see grouping 4 has the most node candidates, and the highest ranked tf-idf scores. We can rank the groupings by doing a simple sum, or an average, or push forward on even cuter statistical analysis. For now, a sum and sort will do:

Grouping 4: 3.47639
Grouping 3: 0.12537
Grouping 1: 0.4271
Grouping 2: 0.4271

From here, we can be certain that the generalised XPath grouping number 4 is the most likely XPath expression that will yield us a list of products given the corpus and dataset. And the last part of that statement is pretty key – given a corpus and dataset – meaning the richer the corpus and the richer the dataset, the more accuracy we’re going to get.

Now we’re not yet accounting for template derivation across the site. Ecommerce site owners tend to like to strike prices and products out randomly, add new prices, and add notations and all sorts of crazy stuff. Inevitably this leads to subtle HTML mark-up variations that we need to account for. These types of mark-up changes either get nicely collected in the XPath generalisation process, or yield new XPath groupings. We’d need to account for that variation, but in the interest of saving time and words, we’ll leave process for another day.

Finding Prices

Because there’s no corpus for price information, we can’t apply the same corpus matching theory to price data extraction. But we do have a list of products we’re looking to find prices for, which means we have our first algorithmic input: the number of prices we can expect to find on a page.

The basic pseudocode for finding prices is fairly simple: for a given product, find its graph node, and then start walking “around” the graph looking for anything that resembles a number. Take that number and its markup (any periods ‘.’, or dollar signs $, or currency markup “AUD, USD” etc) and add it to the “potential prices” list until some local maximum is found.

The fundamental premises of this algorithm rely on prices being nearby products in the graph, and that the text markup surrounding the number will offer a signal of price strength.

So let’s get to it and apply our pseudocode. We use two measures to derive a list of prices for a given product node: 1) the probability that a text-node is a price based on its markup, and 2) the graph distance between the price node and the product node. Let’s look at an example:

<div>
<h3><a href="Product/View/XC8288">Inspiron One Desktop S260905AU</a></h3>
<div class="bvg"><a href="Review/XC8288">Review this product</a></div>

<h4>A$899</h4>
<p class="PrcTxtSav">SAVE $100.00</p>
</div>

The “Inspiron One Desktop” product node has a number of nodes around it, two of which will show up in our graph walk when looking for numbers:

<h4>A$899</h4>
<p class="PrcTxtSav">SAVE $100.00</p>

Now we need to determine the probability of each price being the correct price. We can do something simple: parse out the number into a float, determine the floats string length, subtract that length from the total string length of the node and we have the “potential price” string distance, relative to the total node text. This will downgrade nodes we may find with lots of textual markup (like product numbers, serial numbers, “SAVE”, or “RRP” and so on) relative to other nodes without such markup. We can even add influencing signals like “does the price text have a period ‘.’ or a dollar sign ‘$’, if so, increase the probability” which further boosts nodes with pricing markup.

Once we’ve dealt with probability, we add another hint: the graph distance from the product node. In this case, the <h4> tag has a distance of 2, and the <p> tag a distance of 3, a simple count seems to be all that’s needed. At the end, we’ll have a ranked list of price nodes attached to a product node based on a combination of probability and distance.

Once we’ve done all that, we can do all sorts of analysis with those attached price lists, like making sure we have enough prices for products (an indicator that we may have the wrong product XPath query), or generalising the price nodes to see if they’re all situated in the same graph structure (an indicator that we’ve got the correct price). From here we can even deal with another common problem: two prices for each product, an “RRP” price, and a “Sale” price. Simply walk the list of products, look for this pattern of two prices with similar probabilities and distances, and sort by the lowest price based on some maximum price difference bar.

This kind of algorithm, where we’re ranking, sorting and using results to influence past algorithmic runs is sometimes called “hill climbing”. Hill climbing is a search optimization technique, where we have a criteria we wish to search for which is based on local input, and we repeat the algorithm again and again refreshing inputs until we’ve satisfied the criteria. In this case we have a pipeline of stuff we’re trying to do: find the right product node XPath generalisation, find the right price node, and account for variations in those two. And that pipeline has self-feedback mechanisms: can’t find enough prices for a given list of potential product nodes, then maybe we have the wrong product XPath generalisation etc. This pipeline and its feedback loop is perfect for hill-climbing. Simply keep climbing until we find a local maximum of the best products and prices for the pipeline. (Great F# and C# example of hill climbing here)

And of course, we often get a bit of incorrect noise, so let’s move on to dealing with that by filtering outliers.

Filtering outliers

We’re not going to be perfect when trying to yield products and prices out of a pipeline which relies on sort and rank. So we have to expect some amount of failure, and in my experience with this experiment, it typically happens at the price node level. We find a dodgy price somewhere from some weird markup or template variation. Now, we can approach stripping out these dodgy prices (and thus, their associated products) in a few ways. One obvious one is finding the price nodes with differing probabilities and distances from the common, and culling. But unfortunately this doesn’t account for the template variation problem very well (and in some cases, price node distances can vary wildly across a list of products on the same page). So, a better technique that’s used widely in the machine learning space is clustering. Cluster results around some common parameter, then cull anything that looks out of place.

I used the K-means clustering algorithm for this, as it’s simple and parallelisable. We can take the price nodes probability and distance, make that combination the observation, and hit run. Price nodes with probabilities and distances that are “not like the others” are simply clustered away in to a “bad cluster” and dropped. I used a K value of 3, and then culled clusters that were small (relative to the other two), and/or had a centroid distance that seemed far from the others. This worked pretty well, I’ve only seen a couple of cases where I culled and shouldn’t have. Again, this clustering and culling process could be refined, and plugged in to a pipeline where another hill-climbing reinforcement algorithm could be added, but for now it works well enough. (F# example found here)

Wrapping Up

We’ve spent nearly 4000 words taking a 10,000 foot view of automated data extraction from HTML. Hopefully the key ideas of “abstraction is good” “search and rank hill-climbing” and “filtering” were described well enough to spark further interest in this problem domain. After a few months of day and night hacking I feel like I have a lot to say about this area - particularly for those that live on the .NET platform, eat functional programming techniques for breakfast, and are interested in the cloud computing space (I tried scaling all the software I built out using Microsoft’s Azure, another very interesting experiment).

Soon I hope to go even deeper and show the link between these kinds of algorithms and their efficient implementation in the F# language. I’d hardly call myself an aficionado of the machine learning data extraction space, but I strongly believe that with the rise of the cloud and the explosion of data, all programmers are going to need at least some understanding of this stuff in their toolbox.

Stay tuned.

Tags:

blog comments powered by Disqus

Powered by BlogEngine.NET 2.7.0.0
A modified theme by Mads Kristensen

About the author

Joel Pobar works on languages and runtimes at Facebook