A day with disruptor-net

by joel 15. August 2011 09:22
My idle hands jumped on the chance to play around with disruptor-net, a .NET port of disruptor, from the guys over at LMAX. Having just come off an interesting contract with an investment bank to build a high frequency market making system in .NET, I was curious to see if the disruptor framework had the potential to further push the boundaries of scale and latency. 

We had leveraged CCR (Concurrency and Coordination Runtime) from the Microsoft Robotics guys as the concurrency programming model. If you peek under the hood, it contains a lot of the data structures that the distruptor guys claim have little mechanical sympathy: locks and queues (and these mechanics did show up during a performance milestone analysis). 

I spent a day toying around, building a very simple stock price to option price simulator, where a price comes in off the wire, we run it through an option pricing model (Black Scholes), and then hand the option price back to the wire. It's embarrassingly parallel if all your doing is single instrument pricing. For five million price messages, the results are in: 

Number of Processors: 1
- Name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
- Description: Intel64 Family 6 Model 42 Stepping 7
- ClockSpeed: 3401 Mhz
- Number of cores: 4
- Number of logical processors: 4
- Hyperthreading: OFF

Memory: 16364 MBytes
L1Cache: 256 KBytes
L2Cache: 1024 KBytes
L3Cache: 8192 KBytes

Detailed test results

Scenario Implementation Run Operations per second Duration (ms) # GC (0-1-2) Comments
Market1P1C Disruptor 0 787 277ops 6371(ms) 691-1-0  
Market1P1C CCR 0 537 634ops 9304(ms) 757-69-6  


Now, there are probably hundreds of different ways we can optimize both scenarios, so take the results with a grain of salt, but poking around using the Visual Studio Concurrency perf tool and a trial of Intel's VTune Amplifier XE, I did see a lot of synchronisation overhead with CCR vs. disruptor-net, so the mechanical sympathy features on the box seem to work (and well, the L2 cache aligned RingBuffer approach should definitely yield us more execution time in theory) . I suspect batching the messages in the CCR case would immediately reduced that overhead though.

However (and with the caveat that my entire distruptor experience is a day plus change), while coding this up, I couldn't help but want the CCR programming model and API back. You get elegant composability, join calculus, and error handling, which means you spend less time thinking about concurrency and synchronisation constructs, and more time on the problem domain. The distruptor-net API, while small in surface area, seems less subtle: pack the data tightly, grok the semantics of how things get executed and when, consume rinse repeat. And really, that's no great surprise given what it's trying to achieve.

It's a classic case of "paying for abstractions". I just wonder if we can't have both? Regardless, disruptor-net shows promise, I just hope the contributors keep it up.

Notes: 


Tags:

SSCLI 2.0 Patch for VS 2010

by joel 27. April 2010 05:03

I’ve been hacking away on some type system stuff for fun, checking out Linear Types and Effectsand how all that relates to concurrency problems of tomorrow. I find the best way to think about that concretely is to take a look at the runtime/csharp sources, and where best they might be modified to support such a system.

That means spinning up my old friend the Shared Source CLI 2.0. First step: wrangle the build system/source in to Visual Studio 2010 land. Bootstrapped by Jeremy Kuhne’s VS2008 patch, I’ve pushed that patch forward to work with a Visual Studio 2010 install (requires C++, and I believe something bigger than Express).

You can download the new patch here. Install by just overlaying the patch files on your unzipped SSCLI 2.0 distribution.

After you crank up a Visual Studio 2010 Command Prompt and run “sscli20\env”, you’ll need to make sure LIBPATH and INCLUDE environment variables are set properly. Details of those steps are found in the vs2010README.txt file in the patch zip. It comes with a “Works on my machine” software certification.

And while you’re at it, you may as well download our Shared Source CLI 2.0 Internals book. Fostered by Microsoft Research, shelved by the GFC. Free to all. Enjoy.

Tags:

More fun with HTML Data Extraction

by joel 6. April 2010 12:14

I got a lot of good feedback from my previous post on semi-automated data extraction from HTMLincluding a private query on how one would go about “feature” extraction (“feature”, as in a product feature like processor speed for example). It’s an incredibly hard problem to solve in an automated way, particularly if the set of data you’re classifying doesn’t have a consistent set of features (as would be the case with electronics, where you have processors, hard drives, and video cards, all requiring different features to extract [“size”, “speed” etc], and some similar [“gigabytes”]).

We hope to have something more interesting to say about this particular problem soon – in fact, we’re already off hacking away at it. But in the meantime we’ve started by looking at the easy end of the feature extraction spectra: hand held feature extraction. Given a scenario where a set of data has a consistent set of features, simply roll your own “feature extraction” helper algorithm (basically just an F# function which takes a DOM tree as input, and produces a list of possible features along with their respective probabilities), and plug it in to the existing crawling and parsing infrastructure. Perfumes offer a simple example to stretch our legs on – they have consistent features like size/amount (50ml, 100ml, 7oz etc) and type (EDP or Eau De Parfum, ETD Eau De Toilette etc), and offer enough variance in how they’re presented on individual consumer perfume websites to give us some extra algorithmic flavour.

While we hack on all this classification goodness, we figured we’d drop a demo to show where we’re at, and what you can do with the“output” of all this crawling/data extraction/feature extraction stuff. Our demo plays in the price comparison space – price comparison of Perfumes! We’ve crawled, parsed and data extracted 25’odd Australian perfume retailers online using the generic techniques mentioned previously. If you head on over to http://www.baseprice.com.au/perfume and click on some of the examples, you can start to get a sense of where our heads are at with this: deep vertical search. By no means are we user experience experts (in fact the site looks pretty bad ;)), but it hopefully demonstrates the idea. You might also notice some other cute little search experience features we’re playing with like automated tag generation, clustering, and filtering. Beware, bugs abound I’m sure.

Perfumes aren’t the only place you could go with this (they’re an easy example as they have a set of easily detectable features to extract). In fact, if you think about electronics price comparison, it’s clearly a nice place to push and flex those algorithmic muscles: go from hand-held feature extraction to fully automated feature extraction.

Maybe we’ll get there. More technical content soon. Keep you posted.

Tags:

An Experiment in HTML Data Extraction

by joel 4. March 2010 01:44

I’ve been lucky enough to have spent the past two months gainfully unemployed, and inspired enough to explore the wonderful world of Google’esque type search and data extraction problems. The optimistic software engineer inside of me estimated two months to grok and build software and algorithms those PhD students at Cambridge spend years contemplating. Needless to say after two months I’ve concluded that PhD’s are probably worth a cent, and that it’s no coincidence they’re being hired by the fistful at Google and Microsoft.

I chose to spend my two month ‘sabbatical’ on three major themes:

  1. Explore a search and data extraction problem, that employ’s Google’esque like software algorithms:
    • Crawling
    • Parsing
    • Machine learning/data extraction
    • Filtering
    • Ranking and Search
  2. Implement these algorithms in F# with the intent on studying:
    • Any speedup in development time
    • F# language features that offer better mappings or abstractions to the algorithms at hand
  3. Get a feel for the complexity of the problem space

The problem domain I chose to experiment with is simple: crawl commerce websites and extract products and prices in an unsupervised manner. Or put another way, screen scrape sites like eBay, Harvey Norman (an Australian retailer) etc. with zero hard-coded knowledge of the HTML structure, and extract out their product and price catalogue with very minimal handholding. Yes, I want to build Google Product search, without relying on merchant sites providing a products XML feed. And once I have the products and prices, I also want to build the search part too.

For this post I’ll start by focusing on the parsing, data-extraction and filtering. I’ll hopefully find the time to write up the other parts soon, so subscribe and sit-tight for the remainder.

The goal of this post is simple: document my experiments, approaches and conclusions to the HTML data extraction problem space. But, before we kick off, my disclaimer: I’m not a PhD in this field; I’m simply a hacker hacking on stuff that lights me up. For claims of scientific merit, please stand on the shoulders of true giants.

The Problem Space

Data extraction (or web scraping) from HTML pages is typically done through a manual, supervised process. These manual workhorses to extract data elements we care about are tightly coupled to the structure of HTML through mechanisms like string comparison and Regular expressions. Let’s look at an example to make this point clear. The following is a webpage screenshot extract from a local Australian e-commerce seller:

I’ve simplified the underlying HTML structure of the above example to focus on the rendering of the thing that we care about later on: products and prices:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>...</head>
<body id="pageID_category">
  <div id="container">
    <div id="header">
    <div id="nav_contain">
    <div id="pageTools">
    <div id="mainContentBox">
    <div class="floatRight">
    <div class="mainContent">
      <h1>Desktop Computers</h1>
      <p>...</p>
        <!-- First Computer -->
        <div class="fP_img1">
        <div class="fP_imageHolder">
          <img border="0" alt="Acer eMachine EZ1600 All-in-One 18.5"
          Desktop (XC6429)" src="/isroot/dse/images/products/xc6429~med.jpg"/> 

          <h3>
            <a id="bcXC6429" href=”/Product/View/XC6429">
            Acer eMachine EZ1600 All-in-One 18.5" Desktop</a>
          </h3>

          <div class="pricePoint">
            <h4> 
              A$449
              </h4>

          </div>

          </div>
          <!-- Second Computer -->
          <div class="fP_img2">
          ...

The structure of this HTML is typical (important parts bolded above): a <div> container (<div class=”fp_img1”> ... <div class=”fp_img2”> and so on) that will contain the product and price grouping; an <a href=> HTML tag that has the product description (<a id=”bxXC6429”...>Acer eMachine EZ1600 All-in-One...</a>), and a HTML tag to display the price (<h4>A$449</h4>). This div container format is repeated for the columns (<div class=”fp_img1”... “fp_img2” ... ”fp_img3”), and then again for more rows if needed.

Programmatic extraction of product and pricing text from this example would be typically achieved by either:

  • Slurping the html in to a string and indexing into the string at various points using string manipulation API’s like String.IndexOf(“<h3>”) etc.
  • Identifying the common denominator HTML node and using a Regular Expressions to extract those nodes. i.e. Product text is marked up using the “<h3>” tag, and price text is marked up using “<h4>”. A simple Regular Expression like: “<h3>.*?</h3>” will identify all product <h3> containers to which you can extract the product text from the <a href> manually (or write a more complex Regular Expression).
  • Use a library to abstract over HTML and then leverage the abstraction in automation (e.g. BeautifulSoup).
  • Use a tool (a visualisation tool is common) to allow a user to point and click to the data they wish to extract.

These approaches (and similar variants) have two core problems: 1) code is tightly coupled to the HTML structure, and/or the HTML markup and thus is fragile under situations where the content is dynamic or the website structure/layout are changed or versioned in the future, and 2) it doesn’t easily account for subtle structure/markup variants that are found throughout most e-commerce sites.

We need a better way to abstract and reflect on HTML, all while dealing with its malformations and variations. We need a better way to programmatically extract data, one that can deal with versioning and variants. And we need the approaches to scale well. Scale in terms of the amount of sites we can parse (and their various idiosyncrasies), and scale well in terms of computation (scale out). Product and price data extraction from e-commerce sites seems like a meaningful way to explore this problem space.

The Idea

Programmatic string manipulation, regular expressions and so on are commonly used tools because one thing is typical about websites: they’re born from templates. Meaning someone has created a template page, and then most pages on that respective website are derived from the template. And taking this one step further, even the pages themselves usually have repeating sections that are based on templates – anyone familiar with ASP.NET repeaters will testify to that. Using regular expressions makes some sense in these template scenarios: they’re simply mechanical harvesters of templates derivatives, based on template deconstruction by a programmer (or put in language geek speak: they’re early bound, statically linked to the HTML). If this is generally true, why not build a harvester Reflection library instead? This way, you build a tool once, and it can generate data extraction logic for any site, and any page. And while we’re here, we should push forward one step further and dynamically build the extraction logic every time a crawl/parse occurs so that site dynamism over time can be ignored. Sort of like a Reflection for HTML, where everything is discovered and extracted late-bound with no static dependencies.

This approach hopefully alleviates the versioning problem, deals with the scale problem (we can point the generator at any domain, and away it goes), and we can add smarts to the to overcome the template variant problem (slight variations from page to page within the same domain).

So, what’s needed for a data extraction generator to work? It needs to identify the things it cares about, find out where those things exist in the HTML structure, generate the code or algorithm to extract them out, test that it meets the bar, then rinse and repeat for all pages in all sites.

To explore this generator concept further, we’ll switch now to focusing on our particular domain problem: extracting out products and prices from HTML. We’ll start with identification of the product and price features. If we knew the text for every single product on the planet, we could achieve product extraction from a webpage simply by string comparing all of its text against all of the world’s products. This isn’t computationally possible of course, but the process of identifying “some” of the products in a webpage against a smaller product corpus definitely is.

And the link between product text in HTML and websites being derivatives of templates gets us to the crux of figuring out how to automate extraction: if we can find some products in the HTML, and test to see if they all live in the same structural area of HTML, we can assume the rest of the unmatched strings in that area contains products too. But, before we get to an example of this (it’ll make sense soon, I promise), we first need to define what “structural area of HTML” really means.

Dealing with HTML programmatically is terrible, so it’s best abstracted to something easier to deal with. We typically abstract it away to a graph like structure (the DOM [Document Object Model] is a common choice), and then we can address it structurally using an API like XPath. This raises the level of abstraction from string manipulation and regular expressions, to a structure that makes sense (XML) an expressive language that navigates, manipulates and reflects over a directed graph (XPath). XPath is well understood, plenty of APIs for it, and has a unique feature which we’ll need to heavily rely on: its graph addressing system has a spectrum ranging from very specific, to very general. A quick example of addressing to bootstrap our understanding: To find the first <a> node in two nested divs we can call XPath with a query that looks something like “/html[1]/body[1]/div[1]/div[1]/a[1]”. Notice the [1]’s? That’s saying “I specifically want the first node in the graph” and so on. We can also be more general: to extract all <a> nodes that live in the same nested <div> structure, we can say this: “/html/body/div/div/a”. Dropping the [n] indexing says “address all nodes with that name in the graph”. Nodes can even be address via attribute tags: “/div[@class=’foo’]/a, which gets all <a> nodes that live within a <div> container that has a class attribute “foo”. And so on - very specific, to very general.

Okay, now that we know how to address areas of the HTML graph, let’s get back to figuring out how to find products in that graph. Remember that we want to find a string match from a subset of products inside the HTML. Let’s walk through it: we start with a corpus of products

Asus eMachine EZ1600 All-in-one 18.5” Desktop
Inspiron One Desktop S260905AU
DELL Inspiron 545MT S560603AU
DELL Inspiron 545MT S560603AU with Monitor
...

And a website page that has products that may match some of the products in the corpus:

(The HTML for this example):

<html>
<head>...</head>
<body id="pageID_category">
  <div id="container">
    <div id="header">
    <div id="nav_contain">
    <div id="pageTools">
    <div id="mainContentBox">
    <div class="floatRight">
    <div class="mainContent">
      <div class="fP_img1">
        <div class="fP_imageHolder">
            <a id="bcXC6429"
            title="Acer eMachine EZ1600 All-in-One 18.5" Desktop (XC6429)"
            href="/Product/View/XC6429">
            Acer eMachine EZ1600 All-in-One 18.5" Desktop</a>

By doing a simple string comparison throughout the HTML text nodes of the page to my product corpus, we find a few nodes with corpus matches:

Asus eMachine EZ1600 All-in-one 18.5” Desktop
Inspiron One Desktop S260905AU
DELL Inspiron 545MT S560603AU

We can then address those nodes using an XPath expression:

/html[1]/body[1]/div[1]/div[6]/div[1]/div[1]/a[1]
/html[1]/body[1]/div[1]/div[6]/div[2]/div[1]/a[1]
/html[1]/body[1]/div[1]/div[6]/div[3]/div[1]/a[1]
/html[1]/body[1]/div[1]/div[6]/div[4]/div[1]/a[1]

And if we look closely, we find them to be structurally pretty similar, bar the indexed addressing of the third <div> node. By removing that indexed addressing, we’re able to obtain all products that we know about with one generalised XPath:

/html[1]/body[1]/div[1]/div[6]/div/div[1]/a[1]

The side effect of generalising this XPath: we get all the other products in the same structural area of the graph too. Products we don’t know about from our corpus. Running that generalised XPath expression on our HTML example will give us this:

Asus eMachine EZ1600 All-in-one 18.5” Desktop
Inspiron One Desktop S260905AU
DELL Inspiron 545MT S560603AU
DELL Inspiron 545MT S560603AU
HP Pavillion P6120a Desktop
Compaq Prasario CQ1200AN CPU

A list of all products from that page, including ones not included in the corpus. And that forms the basic idea of automated product text extraction: identify product node candidates, find their common generalised XPath address, apply that to all nodes in the graph, then extract. Of course we’re missing a bunch of detail here like “what’s the likelihood of our product corpus matching something on a website” (the answer is not that great, but there’s a cute solution), and what about price extraction? Oh, and what if I don’t have a product corpus? Yep, we’re going to have to go deep on this, so let’s keep going.

The Details

This section will go a little deeper on the details. We’ll take a look at the following:

  • Matching the corpus against website text in a fuzzy way
  • Applying generalisation principles across the entire website
  • Finding prices for products
  • Filtering outliers

Matching and Generalising

The first obvious problem we need to solve is the text node to product corpus matching. Thousands of variations on how products can be listed (grammar combinations, extra description text etc.) kills any hope of getting away with basic string matching. Thankfully we have two workhorses of language and string comparison processing: 1) tokenisation and stemming (e.g. suffix stripping etc.) and 2) “compare and rank’s” tireless algorithm tf-idf (term frequency inverse document frequency). For a given corpus, we can stem, tokenise and calculate the tf-idf scores for all words, place them in individual vectors, and then compare those against each text node in the graph. Once we’ve done that, we’ll have a ranked list of similarity scores for text nodes compared to the corpus. Now, we’re going to get a lot of noise here – a page may contain all sorts of text nodes that relate in some way to the corpus, but aren’t relevant product matches. A quick look at our “Desktop Computers” example above shows us a casing point:

“Dick Smith is where you'll find the latest and greatest in Desktop Computers, Acer, HP, Compaq...”

If our corpus had a wide range of computers and electronics from Acer, HP and Compaq, this text node will get ranked -- low, sure, but ranked nevertheless. So how do we disambiguate between true products, and text nodes like the one above (and other common nodes like categories, drop down lists etc)? We apply the text node corpus ranking system to more pages on the website (making sure we have the biggest list of ranked potentials possible), then start performing XPath generalisation on the ranked text nodes. We do this so that we can look for generalised XPath’s that produce a high sum of well ranked text nodes. Let’s take a quick look at an example of this process:

Example Product Corpus:

Computer
Memory
Microsoft
Acer
Compaq
...
Acer eMachine EZ1600
Inspiron Desktop S260905AU
Inspiron 545MT S560603AU
RAM
DDR2 800Mhz DIMM
DDR2 800Mhz SODIMM
DDR3 1333Mhz DIMM
...

Page 1:

The red boxes represent graph nodes that had a corpus match hit (remember, this means they’ve undergone a string tokenisation and then a subsequent tf-idf vector comparison against the corpus to give us a score). You’ll notice the numbers (1 through 4): this represents a generalised XPath expression that we’ve found. So the three red-boxed computer products found in grouping 4 represent a generalised XPath and so on.

And now Page 2:

Page 2 ends up being very similar. So after generalisation, grouping, and figuring out the tf-idf scores of each, we end up with a data structure that looks something like this:

Grouping 1: /html[1]/body[1]/div[1]/div[5]/div[1]/div[1]/p[1]/a[1]

Desktop Computers, 0.19390
Computer Memory, 0.23320

Grouping 2: /htm1[1]/body[1]/div[1]/div[5]/div[1]/div[1]/h1[1]

Desktop Computers, 0.19390
Computer Memory, 0.23320

Grouping 3: /html[1]/body[1]/div[1]/div[5]/div[1]/div[1]/p[2]

Dick Smith is where you'll find the latest ..., 0.09334
Increase your Computer Memory with extra ..., 0.03203

Grouping 4: /html[1]/body[1]/div[1]/div[6]/div/div[1]/a[1]

Acer eMachine EZ1600 All-in-One 18.5" Desktop, 0.672399
Inspiron One Desktop S260905AU, 0.890233
Dell Inspiron 545MT S560603AU, 0.840503
DSE 2GB DDR2 800Mhz DIMM RAM, 0.650430
DSE 2GB DDR2 800Mhz SODIMM RAM, 0.64050
DSE 1GB DDR2 800Mhz DIMM RAM, 0.62283

From the data-structure we can see grouping 4 has the most node candidates, and the highest ranked tf-idf scores. We can rank the groupings by doing a simple sum, or an average, or push forward on even cuter statistical analysis. For now, a sum and sort will do:

Grouping 4: 3.47639
Grouping 3: 0.12537
Grouping 1: 0.4271
Grouping 2: 0.4271

From here, we can be certain that the generalised XPath grouping number 4 is the most likely XPath expression that will yield us a list of products given the corpus and dataset. And the last part of that statement is pretty key – given a corpus and dataset – meaning the richer the corpus and the richer the dataset, the more accuracy we’re going to get.

Now we’re not yet accounting for template derivation across the site. Ecommerce site owners tend to like to strike prices and products out randomly, add new prices, and add notations and all sorts of crazy stuff. Inevitably this leads to subtle HTML mark-up variations that we need to account for. These types of mark-up changes either get nicely collected in the XPath generalisation process, or yield new XPath groupings. We’d need to account for that variation, but in the interest of saving time and words, we’ll leave process for another day.

Finding Prices

Because there’s no corpus for price information, we can’t apply the same corpus matching theory to price data extraction. But we do have a list of products we’re looking to find prices for, which means we have our first algorithmic input: the number of prices we can expect to find on a page.

The basic pseudocode for finding prices is fairly simple: for a given product, find its graph node, and then start walking “around” the graph looking for anything that resembles a number. Take that number and its markup (any periods ‘.’, or dollar signs $, or currency markup “AUD, USD” etc) and add it to the “potential prices” list until some local maximum is found.

The fundamental premises of this algorithm rely on prices being nearby products in the graph, and that the text markup surrounding the number will offer a signal of price strength.

So let’s get to it and apply our pseudocode. We use two measures to derive a list of prices for a given product node: 1) the probability that a text-node is a price based on its markup, and 2) the graph distance between the price node and the product node. Let’s look at an example:

<div>
<h3><a href="Product/View/XC8288">Inspiron One Desktop S260905AU</a></h3>
<div class="bvg"><a href="Review/XC8288">Review this product</a></div>

<h4>A$899</h4>
<p class="PrcTxtSav">SAVE $100.00</p>
</div>

The “Inspiron One Desktop” product node has a number of nodes around it, two of which will show up in our graph walk when looking for numbers:

<h4>A$899</h4>
<p class="PrcTxtSav">SAVE $100.00</p>

Now we need to determine the probability of each price being the correct price. We can do something simple: parse out the number into a float, determine the floats string length, subtract that length from the total string length of the node and we have the “potential price” string distance, relative to the total node text. This will downgrade nodes we may find with lots of textual markup (like product numbers, serial numbers, “SAVE”, or “RRP” and so on) relative to other nodes without such markup. We can even add influencing signals like “does the price text have a period ‘.’ or a dollar sign ‘$’, if so, increase the probability” which further boosts nodes with pricing markup.

Once we’ve dealt with probability, we add another hint: the graph distance from the product node. In this case, the <h4> tag has a distance of 2, and the <p> tag a distance of 3, a simple count seems to be all that’s needed. At the end, we’ll have a ranked list of price nodes attached to a product node based on a combination of probability and distance.

Once we’ve done all that, we can do all sorts of analysis with those attached price lists, like making sure we have enough prices for products (an indicator that we may have the wrong product XPath query), or generalising the price nodes to see if they’re all situated in the same graph structure (an indicator that we’ve got the correct price). From here we can even deal with another common problem: two prices for each product, an “RRP” price, and a “Sale” price. Simply walk the list of products, look for this pattern of two prices with similar probabilities and distances, and sort by the lowest price based on some maximum price difference bar.

This kind of algorithm, where we’re ranking, sorting and using results to influence past algorithmic runs is sometimes called “hill climbing”. Hill climbing is a search optimization technique, where we have a criteria we wish to search for which is based on local input, and we repeat the algorithm again and again refreshing inputs until we’ve satisfied the criteria. In this case we have a pipeline of stuff we’re trying to do: find the right product node XPath generalisation, find the right price node, and account for variations in those two. And that pipeline has self-feedback mechanisms: can’t find enough prices for a given list of potential product nodes, then maybe we have the wrong product XPath generalisation etc. This pipeline and its feedback loop is perfect for hill-climbing. Simply keep climbing until we find a local maximum of the best products and prices for the pipeline. (Great F# and C# example of hill climbing here)

And of course, we often get a bit of incorrect noise, so let’s move on to dealing with that by filtering outliers.

Filtering outliers

We’re not going to be perfect when trying to yield products and prices out of a pipeline which relies on sort and rank. So we have to expect some amount of failure, and in my experience with this experiment, it typically happens at the price node level. We find a dodgy price somewhere from some weird markup or template variation. Now, we can approach stripping out these dodgy prices (and thus, their associated products) in a few ways. One obvious one is finding the price nodes with differing probabilities and distances from the common, and culling. But unfortunately this doesn’t account for the template variation problem very well (and in some cases, price node distances can vary wildly across a list of products on the same page). So, a better technique that’s used widely in the machine learning space is clustering. Cluster results around some common parameter, then cull anything that looks out of place.

I used the K-means clustering algorithm for this, as it’s simple and parallelisable. We can take the price nodes probability and distance, make that combination the observation, and hit run. Price nodes with probabilities and distances that are “not like the others” are simply clustered away in to a “bad cluster” and dropped. I used a K value of 3, and then culled clusters that were small (relative to the other two), and/or had a centroid distance that seemed far from the others. This worked pretty well, I’ve only seen a couple of cases where I culled and shouldn’t have. Again, this clustering and culling process could be refined, and plugged in to a pipeline where another hill-climbing reinforcement algorithm could be added, but for now it works well enough. (F# example found here)

Wrapping Up

We’ve spent nearly 4000 words taking a 10,000 foot view of automated data extraction from HTML. Hopefully the key ideas of “abstraction is good” “search and rank hill-climbing” and “filtering” were described well enough to spark further interest in this problem domain. After a few months of day and night hacking I feel like I have a lot to say about this area - particularly for those that live on the .NET platform, eat functional programming techniques for breakfast, and are interested in the cloud computing space (I tried scaling all the software I built out using Microsoft’s Azure, another very interesting experiment).

Soon I hope to go even deeper and show the link between these kinds of algorithms and their efficient implementation in the F# language. I’d hardly call myself an aficionado of the machine learning data extraction space, but I strongly believe that with the rise of the cloud and the explosion of data, all programmers are going to need at least some understanding of this stuff in their toolbox.

Stay tuned.

Tags:

Relating Disparate Datasets

by Joel 1. February 2010 00:27

I've spent the last month in F# paradise -- hacking code all day, exploring the strengths and weaknesses of the language. I've found it a pleasure to code in, particularly in the 'big data' space: parsing and transformations, machine learning, matching and ranking and so on. One space I'm particularly interested in lately is web scraping and data-interchange formats. How can we as programmers tap data from areas of the web that don’t expose it? And once we’ve tapped it, how can we easily relate it (and by relate it, I mean with the uniqueness we’d expect from a relational database through keys).

As an experiment in this area, I tried using F# to scrape and parse the data from the new myschool.edu.au website (an initiative of the Australian Government to provide ‘transparent’ test scoring across all public/private schools here in Australia) and overlaying the score on a map. Sort of like a geo-heat-map, an indicator of good or bad geographical areas for schooling. 

This posed a few technical challenges: 

  • Scrape the myschool.edu.au site, grabbing the required data
  • Find the lat/lng geolocation of a given school (there’s a catch: no address information or geolocation is provided on the myschool.edu.au website)
  • Normalise the schools score, and overlay that on a map

The first and third are easy, but the second is a great example of where we have a need to relate disparate datasets. Typically, we’re used to having a foreign key of some sort to query over, but instead we’re stuck with trying to find a relationship ourselves. This means we can’t rely on uniqueness, and instead we’ll have to find confidence – as in “how confident are we that ‘a’ and ‘b’ are related”. We have a school name, and a geolocation search service (there are many around) which given a string gives you back a list of ranked results. To bridge this data, I used a simple, but effective approach to find confidence: tf-idf (term frequency inverse document frequency) to weight and rank the search result set against the school name string, grabbing the head of the list, and filtering on a high match score. An exact match of “Brisbane State High” against a row in the search results will give me a tf-idf normalised score (and thus confidence) of 1. I set my confidence filter at 0.95, to catch all close, but not exact results like “Brisbane State High School”. This means that my source dataset (myschools.edu.au) can differ slightly from my geolocation search results, yet I can still relate them with a high degree of confidence. 

There are plenty of these kinds of ranking and matching techniques, ranging from the simple (word counts, tf-idf, string distance etc), through to complex (machine learning, Bayesian filtering, hill-climbing etc). And I’m a firm believer that programmer confidence with these tools will allow us to move forward on the “programmable web” without relying on the sites themselves to provide a structured data format for us to consume. 

Anyway, the actually result of this experiment was somewhat interesting. Taking the lat/lng geodata from the geolocation service and plotting a coloured overlay of average school performance using the fantastic Google Maps API (really, I'm no web guy, but this was dead easy), I was able to cook up a heat-map in under 20 minutes. I was going to provide the map + overlays as an interactive website, but after looking deeper at the data, I’m pretty sure it suffers from the “flaw of averages” (meaning the data source, and my interpretation is fundamentally flawed through a normalisation process -- statistician's should really look at myschool and confirm). Instead, I’ll offer up some pretty (but very very unscientific) screenshots instead:

Brisbane City (click on image for expanded size)

Brisbane Unscientific heat map visualisation

Sydney (click on image for expanded size)

Sydney Unscientific heat map visualisation

Melbourne (click on image for expanded size)

Melbourne Unscientific heat map visualisation

Adelaide (click on image for expanded size)

Adelaide Unscientific heat map visualisation

Colours indicate the performance of the school (using the same colours and ranges found on the myschool website) by taking the average of all years disciplines, taking the worst case, and scoring against the average described on the myschool website). Many schools are missing (due to lack of confidence in the search results matching algorithm), and the data hasn't been post processed, so, take it with a grain of salt... But at first glance it does seem like regional areas are under serviced.  

This all begs the question though, why didn’t the government just offer up the raw data and let the programmers of Australia mash it up? (Or at least give me a feed of the raw data, to save me some time) It took me about 4 hours to do all this, a real web design outfit or the web community could have done something brilliant with a little more effort.

 

Tags:

Blog

Why F#? (TechEd '09 DEV450)

by joel 6. September 2009 13:14

I’m presenting at TechEd Australia 2009 on Microsoft’s new functional language F#. The session is called “DEV450 Big Algorithms Made Easy with Microsoft’s F#”. It’s jammed packed with content – too much to easily absorb in an hour, so I figured I’d start dumping stuff we may miss for pre/post talk review and follow up – homework if you like. The talk is scheduled for Friday the 9th in Central A, and you can find the session abstract here: http://www.msteched.com/australia/Public/SessionList.aspx. For those not attending, I think it’s being filmed, so I’ll see about grabbing a link and posting here after the show.

What I’m Talking About

The title suggests “big algorithms”, but I’d have preferred to say “interesting algorithms you can start learning and applying today in your apps” – the former seems more marketable. I deliver a survey of algorithms that they’re classifying these days as “Collective Intelligence Algorithms” – classic Web 2.0 hyper-super-nomenclature. What this really means is looking at some search and rank, classification and recommendation algorithms in the context of big data – typically crowd generated data. We’ll then marry them to the powerful language and platform feature set of the F# functional language. We’re trying to achieve two things: explore some of the algorithms that the big guns like Google and Microsoft are using in a fun, absorbable way; and show off how easy it is to implement them in a functional language like F#. At the end of the hour, I hope to convince you that it’s worth spending the time learning more about these algorithms, and more importantly, adopting the functional programming religion.

In my talk, I don’t spend too much time on what F# actually represents as a language or ideology, so I’ll start by surveying a lot of that background here. We’ll start with the ultimate question: why bother with all this? Then we’ll take a quick look at the F# elevator pitch, and finally, I’ve got all the code I use in my presentation available for download. If I have time before the talk this week, I try and also dump some pre-reading content on the CI algorithms themselves.

Let’s start with the ultimate question:

Why Should I Learn a Functional Language?

I think the “why” is a judgement call for the individual. Learning a functional language (particularly if you’re engrained in the imperative) requires brain re-wiring. It’s a frustrating path, especially when the payoff isn’t immediate - based on a very small sample of people I know who made the leap, I’d say it’s a 40-60 hour ramp up to get comfortable and productive. And the motivations and benefits are going to be largely intrinsic. There’s no widespread market reward for knowing a functional programming language, and there’s certainly no forcing functions for adoption (at least not right now).

So where’s the beef? I figured I’d stand on the shoulders of giants and give a quick link dump/summary for your leisurely review:

John Hughes, Why Functional Programming Matters (1984). An absolute classic.

“Conventional languages place conceptual limits on the way problems can be modularised. Functional languages push those limits back.”

He talks about functional languages providing efficient modularity, composability, safety through immutability and provides numerous program examples juxtaposed against imperative equivalents. His core theme is modularity – arguing that clean modularity is the key to successful programming, and languages that aim to enable modularity are inherently more productive.

Reg Braithwaite has great review of Hughes’ paper here. I particularly like his extension to Hughes discussion through his analysis of Seperation of Concerns (SoC). He gives an example of Object Orientations poor SoC in the game design world:

“Let’s ask a question about Monopoly (and Enterprise Software). Where do the rules live? In a noun-oriented design, the rules are smooshed and smeared across the design, because every single object is responsible for knowing everything about everything it can ‘do’. All the verbs are glued to the nouns as methods.”

He goes on to note:

“The great insight is that better programs separate concerns. They are factored more purely, and the factors are naturally along the lines of responsibility (rather than in Jenga piles of abstract virtual base mixin module class proto_ extends private implements). Languages that facilitate better separation of concerns are more powerful in practice than those that don’t.”

His core point is that functional languages offers users the ability to get to the verb right away, without being admonished to objects and interfaces. The comments of Reg’s post are also worth the read. Particularly the one chap that points out the obvious: there’s a problem with the semantic distance between the language of the problem domain, and the language used to implement the solution. I call this developer cognitive dissonance – how hard is it to map what you want to “do” to the language and tools you’re using to implement?

Paul Graham extends the theme of developer cognitive dissonance and introduces differentiation as a key motivation for the “why functional” through his essay Beating the Averages. Referring to the initial stages of a startup he was involved in, he talks about his startups language choice:

“If you can use any language, which do you use? We chose Lisp. For one thing, it was obvious that rapid development would be important in this market. We were all starting from scratch, so a company that could get new features done before its competitors would have a big advantage. We knew Lisp was a really good language for writing software quickly, and server-based applications magnify the effect of rapid development, because you can release software the minute it's done.”

And then goes on to say:

“What were the results of this experiment? Somewhat surprisingly, it worked. We eventually had many competitors, on the order of twenty to thirty of them, but none of their software could compete with ours.”

Graham’s essay shines a light on a software business truism: the barriers to entry are really low, so expect lots of competition. Those with more business acumen than me often point out that first mover advantage makes a competitive difference, so if you can shorten your development cycle, get to market first, all with high quality, who wouldn’t? And with today’s audience almost expecting features like recommendations, awesome search, social networking, classification, high availability, speed, scale and more, perhaps it’s time we mapped that sort of feature set to the best programming language and toolset for the job.

The book “Real World Haskell” (Haskell being the infamous pure functional language used by academics, geeks and language nuts) also has a great deal for those that choose the functional programming path: Novelty, Power and Enjoyment. The book is a great read, and worth the time if you’re looking for a deep dive in to one of the more ‘purist’ functional languages around. 

Sill not convinced? Erik Meijer (among others) have made the case for purist functional programming ideology being the saviour to the concurrency problem we’re facing today. He makes the case that state and side-effects (which include IO, exceptions, threads, resources, reflection etc etc) are the bane of concurrency, and that imperative/OO languages today simply aren’t equipped to declare and manage those side-effects in a safe way. I like his unexpected exception example on page 15:

var xs = new[] { 5, 4, 3, 2, 1, 0 };
IEnumerable<int> q = null;

try
{
    q = from x in xs select 1 / x;
}
catch
{
    q = new int[] {};
}

foreach (var z in q)
    // DivideByZero gets thrown here and not caught!
    Console.WriteLine(z);

The link between unexpected exception behaviour and concurrency is vague at best, but the basic concept is this: if you can’t describe side-effects using the available language syntax, the compiler can’t reason about it, and inevitably can’t tell you if it’s safe to parallelise or not. He later goes on to talk about how a functional feature called Monads are able to describe side-effects right there in the type system, thus allowing the compiler, the programmer, and/or concurrency libraries to take advantage of that when figuring out how to parallelise safely.

Tim Sweeney from Epic Games (who brought me Gears of War, and I love them for it) also shares Erik’s view – his POPL (Principles of Programming Languages) talk titled “The Next Mainstream Programming Language: A Game Developers Prospective” walks through the life of a game programmer in an imperative OO world. The reliability and concurrency/data parallelism slides have strong arguments for language features found in traditional functional programming languages. Slides 28-39 start with an incredible statistic: 50% of Unreal engine bugs can be traced to common imperative/OO compiler problems (out-of-bounds array access, dereferencing null pointers, integer overflow and uninitialized variable access).

Under the QCon track “Historically Bad Idea’s”, Tony Hoare, type system designer for ALGOL, call’s the invention of the null reference a “billion dollar mistake”. In similar vein to Erik and Tim, he call’s for the eradication of null from the type systems of current languages – something functional programmers generally don’t need to worry about.

And if better design traits, easier composability, reduced developer cognitive dissonance, competitive differentiation, fun, power, enjoyment, tamed side-effects, safety and being smarter and better looking aren’t enough, you could always start looking to the functional language feature set to see what you’re missing out on.

Functional Language Feature Set

Wikipedia gives a decent run down on some of the notable functional programming language features: higher-order functions, recursion, laziness, evaluation, pattern matching, type inference, immutability and meta-programming. I suggest reading and understanding each of those features listed (or you could come to my talk and learn about them there ;)).

Matt Podwysocki and Erik Meijer have a great discussion about functional programming and it’s feature set (in fact Matt’s blog is just fantastic for all things functional, and more importantly all things F#).

Erik also goes solo with a whiteboard session on some core functional programming features and concepts.

The comp.lang.functional FAQ has a great overview of what functional programming is, and some of the features that make them special.

And if you want to build a multi-billion dollar search company, you better ramp up on Map Reduce (two simple functions that are available in most functional programming languages). These two simple functions gave Google the programming model it needed to scale up and out and crawl billions of web pages.

“MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.”

While you may not get the distributed computing part for free when adopting a functional programming language, you get all the language enablers to go do it yourself (and my personal belief is that your favourite cloud computing provider will provide you with a similar model anyway).

And with that, we begin looking at the functional feature set with a F#/.NET slant: “The Don” (as I like to call him), talks on video about F# and its feature set. His “comparison’s with C#” slides are good – one core point being that in C#, there’s more noise than signal (as in, there’s a lot more lines of code in C#, than there is in the same F# example). The “Pleasure and Pain” slides (about 8 mins in) you may instantly relate to if you’ve ever tried to build a compiler in an OO language.

Amanda Launcher and Ted Neward have a chat with the .NET Rocks team about Functional Programming, it’s feature set, and F#.

Chris Smith (the F# team’s guru tester) and Dustin Campbell chat on the Deep Fried Bytes podcast about F# and functional programming.

And StackOverflow has a discussion of functional features and how they relate to the C# language.

The F# Elevator Pitch

The modern software application has evolved in to a heterogeneous mesh of platforms, architecture, data, and design, which spans a variety of topologies like client/server, n-tier, Peer-to-Peer, cloud and more. The modern software consumer has also evolved. They want  high availability, reliable apps that light up on new hardware with many-cores. They want their apps to scale and accommodate the massive amount of data that they and their friends generate.

And you just want to get through the day building what they want, cheaply and quickly without complexity.

F# has superb fundamentals to help you:

  • A full-featured functional language with all the guts and glory to help solve the problems we’ve talked about above.
  • Works great with large datasets in a beautifully succinct and scalable way. (Check out how the Applied Games group parsed terabytes of data easily with F#)
  • And works great in the small and adds value to your app dev cycle straight away.
  • Helps you program in a naturally concurrent way, and gives you concurrency features out-of-the box to scale across many-cores.

And its core value proposition:

  • It’s a .NET language, which comes with all the cross-language interop goodness we’ve come to expect from the CLR.
  • Because it speaks .NET, you can use all the .NET libraries you know and love (BCL, WPF, WCF, ASP.NET etc).
  • And that also means you get to use ALL the .NET tools you know and love (profilers, code analysis,
  • It speaks OO and mutability too: while immutable and functional by default, gives you plenty of escape hatches to walk the mutable/OO plank.

But most of all, it’s hacking F# is just plain fun. And if you get nothing out of the 2000+ word “why functional, why F#” pitch, take this away: learning F# and/or the functional paradigm will make you a much better C#/VB.NET programmer. And that to me is worth the 40-60 hours of brain-hurt. So get started now. Download the free VS 2008 add-on, or grab the VS 2010 beta bits (it comes default installed!)

Hopefully see you at the talk!

Download DEV450 Demo Code Search, Netflix Recommendation Engine, SVM Classifier, Fuzzy matching.

Tags:

Blog Reboot

by joel 31. May 2009 09:56

My virtual host provider (vpsland.com) was a fail -- it's a long story, which includes sub-plots like "no backups", and my favourite "your instance isn't responding, and we can't reboot it right now". So off to discountasp.net with a migration to BlogEngine.NET. I love the feature set. And download->compile->modify->deploy to DiscountASP was maybe an hour. It also worked on my linux box via Mono straight up -- great stuff.

A lot going on lately, including a fantastic showing of F# at JAOO downunder: I built a simple search engine + fuzzy keyword search and a Netflix recommendation engine in under 45 mins 70 minutes. You can grab the slides + code here. I'm also planning a "Build a OS in under an hour" talk in the spirit of my PDC and TechEd "Build a compiler" equivalents. I'll likely test the waters on that at the next local user group. 

Anyway -- I'm motivated to keep the content rolling, and not let the infrastructure man beat me down. More to come soon...

 

Tags:

Powered by BlogEngine.NET 2.7.0.0
A modified theme by Mads Kristensen

About the author

Joel Pobar works on languages and runtimes at Facebook