Relating Disparate Datasets

by Joel 1. February 2010 00:27

I've spent the last month in F# paradise -- hacking code all day, exploring the strengths and weaknesses of the language. I've found it a pleasure to code in, particularly in the 'big data' space: parsing and transformations, machine learning, matching and ranking and so on. One space I'm particularly interested in lately is web scraping and data-interchange formats. How can we as programmers tap data from areas of the web that don’t expose it? And once we’ve tapped it, how can we easily relate it (and by relate it, I mean with the uniqueness we’d expect from a relational database through keys).

As an experiment in this area, I tried using F# to scrape and parse the data from the new website (an initiative of the Australian Government to provide ‘transparent’ test scoring across all public/private schools here in Australia) and overlaying the score on a map. Sort of like a geo-heat-map, an indicator of good or bad geographical areas for schooling. 

This posed a few technical challenges: 

  • Scrape the site, grabbing the required data
  • Find the lat/lng geolocation of a given school (there’s a catch: no address information or geolocation is provided on the website)
  • Normalise the schools score, and overlay that on a map

The first and third are easy, but the second is a great example of where we have a need to relate disparate datasets. Typically, we’re used to having a foreign key of some sort to query over, but instead we’re stuck with trying to find a relationship ourselves. This means we can’t rely on uniqueness, and instead we’ll have to find confidence – as in “how confident are we that ‘a’ and ‘b’ are related”. We have a school name, and a geolocation search service (there are many around) which given a string gives you back a list of ranked results. To bridge this data, I used a simple, but effective approach to find confidence: tf-idf (term frequency inverse document frequency) to weight and rank the search result set against the school name string, grabbing the head of the list, and filtering on a high match score. An exact match of “Brisbane State High” against a row in the search results will give me a tf-idf normalised score (and thus confidence) of 1. I set my confidence filter at 0.95, to catch all close, but not exact results like “Brisbane State High School”. This means that my source dataset ( can differ slightly from my geolocation search results, yet I can still relate them with a high degree of confidence. 

There are plenty of these kinds of ranking and matching techniques, ranging from the simple (word counts, tf-idf, string distance etc), through to complex (machine learning, Bayesian filtering, hill-climbing etc). And I’m a firm believer that programmer confidence with these tools will allow us to move forward on the “programmable web” without relying on the sites themselves to provide a structured data format for us to consume. 

Anyway, the actually result of this experiment was somewhat interesting. Taking the lat/lng geodata from the geolocation service and plotting a coloured overlay of average school performance using the fantastic Google Maps API (really, I'm no web guy, but this was dead easy), I was able to cook up a heat-map in under 20 minutes. I was going to provide the map + overlays as an interactive website, but after looking deeper at the data, I’m pretty sure it suffers from the “flaw of averages” (meaning the data source, and my interpretation is fundamentally flawed through a normalisation process -- statistician's should really look at myschool and confirm). Instead, I’ll offer up some pretty (but very very unscientific) screenshots instead:

Brisbane City (click on image for expanded size)

Brisbane Unscientific heat map visualisation

Sydney (click on image for expanded size)

Sydney Unscientific heat map visualisation

Melbourne (click on image for expanded size)

Melbourne Unscientific heat map visualisation

Adelaide (click on image for expanded size)

Adelaide Unscientific heat map visualisation

Colours indicate the performance of the school (using the same colours and ranges found on the myschool website) by taking the average of all years disciplines, taking the worst case, and scoring against the average described on the myschool website). Many schools are missing (due to lack of confidence in the search results matching algorithm), and the data hasn't been post processed, so, take it with a grain of salt... But at first glance it does seem like regional areas are under serviced.  

This all begs the question though, why didn’t the government just offer up the raw data and let the programmers of Australia mash it up? (Or at least give me a feed of the raw data, to save me some time) It took me about 4 hours to do all this, a real web design outfit or the web community could have done something brilliant with a little more effort.




blog comments powered by Disqus

Powered by BlogEngine.NET
A modified theme by Mads Kristensen

About the author

Joel Pobar works on languages and runtimes at Facebook