More fun with HTML Data Extraction

by joel 6. April 2010 12:14

I got a lot of good feedback from my previous post on semi-automated data extraction from HTMLincluding a private query on how one would go about “feature” extraction (“feature”, as in a product feature like processor speed for example). It’s an incredibly hard problem to solve in an automated way, particularly if the set of data you’re classifying doesn’t have a consistent set of features (as would be the case with electronics, where you have processors, hard drives, and video cards, all requiring different features to extract [“size”, “speed” etc], and some similar [“gigabytes”]).

We hope to have something more interesting to say about this particular problem soon – in fact, we’re already off hacking away at it. But in the meantime we’ve started by looking at the easy end of the feature extraction spectra: hand held feature extraction. Given a scenario where a set of data has a consistent set of features, simply roll your own “feature extraction” helper algorithm (basically just an F# function which takes a DOM tree as input, and produces a list of possible features along with their respective probabilities), and plug it in to the existing crawling and parsing infrastructure. Perfumes offer a simple example to stretch our legs on – they have consistent features like size/amount (50ml, 100ml, 7oz etc) and type (EDP or Eau De Parfum, ETD Eau De Toilette etc), and offer enough variance in how they’re presented on individual consumer perfume websites to give us some extra algorithmic flavour.

While we hack on all this classification goodness, we figured we’d drop a demo to show where we’re at, and what you can do with the“output” of all this crawling/data extraction/feature extraction stuff. Our demo plays in the price comparison space – price comparison of Perfumes! We’ve crawled, parsed and data extracted 25’odd Australian perfume retailers online using the generic techniques mentioned previously. If you head on over to and click on some of the examples, you can start to get a sense of where our heads are at with this: deep vertical search. By no means are we user experience experts (in fact the site looks pretty bad ;)), but it hopefully demonstrates the idea. You might also notice some other cute little search experience features we’re playing with like automated tag generation, clustering, and filtering. Beware, bugs abound I’m sure.

Perfumes aren’t the only place you could go with this (they’re an easy example as they have a set of easily detectable features to extract). In fact, if you think about electronics price comparison, it’s clearly a nice place to push and flex those algorithmic muscles: go from hand-held feature extraction to fully automated feature extraction.

Maybe we’ll get there. More technical content soon. Keep you posted.


blog comments powered by Disqus

Powered by BlogEngine.NET
A modified theme by Mads Kristensen

About the author

Joel Pobar works on languages and runtimes at Facebook