Relating Disparate Datasets

by joelpob 1. February 2010 00:27

I've spent the last month in F# paradise -- hacking code all day, exploring the strengths and weaknesses of the language. I've found it a pleasure to code in, particularly in the 'big data' space: parsing and transformations, machine learning, matching and ranking and so on. One space I'm particularly interested in lately is web scraping and data-interchange formats. How can we as programmers tap data from areas of the web that don’t expose it? And once we’ve tapped it, how can we easily relate it (and by relate it, I mean with the uniqueness we’d expect from a relational database through keys).

As an experiment in this area, I tried using F# to scrape and parse the data from the new myschool.edu.au website (an initiative of the Australian Government to provide ‘transparent’ test scoring across all public/private schools here in Australia) and overlaying the score on a map. Sort of like a geo-heat-map, an indicator of good or bad geographical areas for schooling. 

This posed a few technical challenges: 

  • Scrape the myschool.edu.au site, grabbing the required data
  • Find the lat/lng geolocation of a given school (there’s a catch: no address information or geolocation is provided on the myschool.edu.au website)
  • Normalise the schools score, and overlay that on a map

The first and third are easy, but the second is a great example of where we have a need to relate disparate datasets. Typically, we’re used to having a foreign key of some sort to query over, but instead we’re stuck with trying to find a relationship ourselves. This means we can’t rely on uniqueness, and instead we’ll have to find confidence – as in “how confident are we that ‘a’ and ‘b’ are related”. We have a school name, and a geolocation search service (there are many around) which given a string gives you back a list of ranked results. To bridge this data, I used a simple, but effective approach to find confidence: tf-idf (term frequency inverse document frequency) to weight and rank the search result set against the school name string, grabbing the head of the list, and filtering on a high match score. An exact match of “Brisbane State High” against a row in the search results will give me a tf-idf normalised score (and thus confidence) of 1. I set my confidence filter at 0.95, to catch all close, but not exact results like “Brisbane State High School”. This means that my source dataset (myschools.edu.au) can differ slightly from my geolocation search results, yet I can still relate them with a high degree of confidence. 

There are plenty of these kinds of ranking and matching techniques, ranging from the simple (word counts, tf-idf, string distance etc), through to complex (machine learning, Bayesian filtering, hill-climbing etc). And I’m a firm believer that programmer confidence with these tools will allow us to move forward on the “programmable web” without relying on the sites themselves to provide a structured data format for us to consume. 

Anyway, the actually result of this experiment was somewhat interesting. Taking the lat/lng geodata from the geolocation service and plotting a coloured overlay of average school performance using the fantastic Google Maps API (really, I'm no web guy, but this was dead easy), I was able to cook up a heat-map in under 20 minutes. I was going to provide the map + overlays as an interactive website, but after looking deeper at the data, I’m pretty sure it suffers from the “flaw of averages” (meaning the data source, and my interpretation is fundamentally flawed through a normalisation process -- statistician's should really look at myschool and confirm). Instead, I’ll offer up some pretty (but very very unscientific) screenshots instead:

Brisbane City (click on image for expanded size)

Brisbane Unscientific heat map visualisation

Sydney (click on image for expanded size)

Sydney Unscientific heat map visualisation

Melbourne (click on image for expanded size)

Melbourne Unscientific heat map visualisation

Adelaide (click on image for expanded size)

Adelaide Unscientific heat map visualisation

Colours indicate the performance of the school (using the same colours and ranges found on the myschool website) by taking the average of all years disciplines, taking the worst case, and scoring against the average described on the myschool website). Many schools are missing (due to lack of confidence in the search results matching algorithm), and the data hasn't been post processed, so, take it with a grain of salt... But at first glance it does seem like regional areas are under serviced.  

This all begs the question though, why didn’t the government just offer up the raw data and let the programmers of Australia mash it up? (Or at least give me a feed of the raw data, to save me some time) It took me about 4 hours to do all this, a real web design outfit or the web community could have done something brilliant with a little more effort.

 

Tags:

Blog | dotnet

Comments

2/1/2010 3:39:22 AM #

Leon

Wow. Beautiful work.
and fully agree that "I’m a firm believer that programmer confidence with these tools will allow us to move forward on the “programmable web” without relying on the sites themselves to provide a structured data format for us to consume. "  

Leon

2/1/2010 3:55:17 AM #

Jayne

Great work. I'd love to see such a map of Adelaide.

Jayne

2/1/2010 4:13:36 AM #

Matthew Wills

How long till we see this graphic on ACA or TT?

For the benefit of non-Australian residents, ACA and TT are Australia's 'Current Affairs' programs. And by Current Affairs I mean the latest cancer and weight loss 'treatments'.

Matthew Wills

2/1/2010 4:22:12 AM #

Peter

Would scaling the size of each school's circle based on number of students or student/teacher ratio reveal any interesting comparisons?

Peter

2/1/2010 8:01:43 AM #

Midge

Lol at Logan in Brisbane. We somehow made it through that Joel?

Midge

2/1/2010 10:35:27 AM #

Fabian

You might want to look at this site, Government 2.0 http://gov2.net.au/about/ and these guys also discuss mashups of public data balneus.wordpress.com/.../ and www.sauer-thompson.com/.../open-australia.php

Fabian

2/1/2010 1:03:58 PM #

trackback

Social comments and analytics for this post

This post was mentioned on Twitter by joelpob: Playing with myschool.edu.au data and geolocation search: http://bit.ly/buZo3a. #myschool #fsharp

uberVU - social comments

2/5/2010 2:41:06 AM #

chapo

It would be great if you could also do a mashup for Perth.

chapo

2/5/2010 11:27:30 PM #

Andrew Harvey

Nice Work!

Any chance you could release the data that you scraped or the code you used to scrape it. (Yes I know that that may be an infringement of copyright...)

I wrote a scraper of my own, which I'll clean up and publish along with the myschool data it scraped as soon as I can.

Andrew Harvey

2/7/2010 8:07:43 AM #

Andrew Harvey

...my scraper (also see my blog post andrewharvey4.wordpress.com/.../) is coming along nicely in case any one is interested the source code is available at http://github.com/andrewharvey/myschool. At one stage I parsed into XML, but currently it parses into a Postgres database.

Andrew Harvey

2/8/2010 1:45:04 AM #

James Conner

you have to increase the opacity of the circles and blur the edges, the top most circle is having too much influence on the color and skewing the results.

James Conner

2/8/2010 5:25:04 PM #

Phillip Good

To take this modest proposal a step further, shouldn't the authors of all published articles provide a link to their raw data?

Phillip Good

2/9/2010 12:03:38 AM #

arathi

Hello!
This is facinating! Can you do one for Darwin / Northern Territory too?

Great work... just hope all this prompts positive action and not further stereotyping about 'failing' teachers.

arathi

2/9/2010 5:22:26 AM #

Rose N

Wow, very interesting. I would be interested in a comparative overlay of socio economic data. I can see in Brisbane how it would come out but I don't know the other cities as well. There was some socioeconomic data supplied onthe myschools site, so you wouldn't even have to go to the ABS.

Data scraping provides a lot of food for thought. I'll be passing this on for some thought provoking comment in my Sociology class

Rose N

Comments are closed

Powered by BlogEngine.NET 1.5.0.7
A modified theme by Mads Kristensen

About the author

.NET nerd and F# evangelical. Random updates at Twitter @joelpob, but real content goes here.

Articles, Presentations and more here.