Relating Disparate Datasets

by joelpob 1. February 2010 00:27

I've spent the last month in F# paradise -- hacking code all day, exploring the strengths and weaknesses of the language. I've found it a pleasure to code in, particularly in the 'big data' space: parsing and transformations, machine learning, matching and ranking and so on. One space I'm particularly interested in lately is web scraping and data-interchange formats. How can we as programmers tap data from areas of the web that don’t expose it? And once we’ve tapped it, how can we easily relate it (and by relate it, I mean with the uniqueness we’d expect from a relational database through keys).

As an experiment in this area, I tried using F# to scrape and parse the data from the new myschool.edu.au website (an initiative of the Australian Government to provide ‘transparent’ test scoring across all public/private schools here in Australia) and overlaying the score on a map. Sort of like a geo-heat-map, an indicator of good or bad geographical areas for schooling. 

This posed a few technical challenges: 

  • Scrape the myschool.edu.au site, grabbing the required data
  • Find the lat/lng geolocation of a given school (there’s a catch: no address information or geolocation is provided on the myschool.edu.au website)
  • Normalise the schools score, and overlay that on a map

The first and third are easy, but the second is a great example of where we have a need to relate disparate datasets. Typically, we’re used to having a foreign key of some sort to query over, but instead we’re stuck with trying to find a relationship ourselves. This means we can’t rely on uniqueness, and instead we’ll have to find confidence – as in “how confident are we that ‘a’ and ‘b’ are related”. We have a school name, and a geolocation search service (there are many around) which given a string gives you back a list of ranked results. To bridge this data, I used a simple, but effective approach to find confidence: tf-idf (term frequency inverse document frequency) to weight and rank the search result set against the school name string, grabbing the head of the list, and filtering on a high match score. An exact match of “Brisbane State High” against a row in the search results will give me a tf-idf normalised score (and thus confidence) of 1. I set my confidence filter at 0.95, to catch all close, but not exact results like “Brisbane State High School”. This means that my source dataset (myschools.edu.au) can differ slightly from my geolocation search results, yet I can still relate them with a high degree of confidence. 

There are plenty of these kinds of ranking and matching techniques, ranging from the simple (word counts, tf-idf, string distance etc), through to complex (machine learning, Bayesian filtering, hill-climbing etc). And I’m a firm believer that programmer confidence with these tools will allow us to move forward on the “programmable web” without relying on the sites themselves to provide a structured data format for us to consume. 

Anyway, the actually result of this experiment was somewhat interesting. Taking the lat/lng geodata from the geolocation service and plotting a coloured overlay of average school performance using the fantastic Google Maps API (really, I'm no web guy, but this was dead easy), I was able to cook up a heat-map in under 20 minutes. I was going to provide the map + overlays as an interactive website, but after looking deeper at the data, I’m pretty sure it suffers from the “flaw of averages” (meaning the data source, and my interpretation is fundamentally flawed through a normalisation process -- statistician's should really look at myschool and confirm). Instead, I’ll offer up some pretty (but very very unscientific) screenshots instead:

Brisbane City (click on image for expanded size)

Brisbane Unscientific heat map visualisation

Sydney (click on image for expanded size)

Sydney Unscientific heat map visualisation

Melbourne (click on image for expanded size)

Melbourne Unscientific heat map visualisation

Adelaide (click on image for expanded size)

Adelaide Unscientific heat map visualisation

Colours indicate the performance of the school (using the same colours and ranges found on the myschool website) by taking the average of all years disciplines, taking the worst case, and scoring against the average described on the myschool website). Many schools are missing (due to lack of confidence in the search results matching algorithm), and the data hasn't been post processed, so, take it with a grain of salt... But at first glance it does seem like regional areas are under serviced.  

This all begs the question though, why didn’t the government just offer up the raw data and let the programmers of Australia mash it up? (Or at least give me a feed of the raw data, to save me some time) It took me about 4 hours to do all this, a real web design outfit or the web community could have done something brilliant with a little more effort.

 

Tags:

Blog | dotnet

Why F#? (TechEd '09 DEV450)

by joelpob 6. September 2009 13:14

I’m presenting at TechEd Australia 2009 on Microsoft’s new functional language F#. The session is called “DEV450 Big Algorithms Made Easy with Microsoft’s F#”. It’s jammed packed with content – too much to easily absorb in an hour, so I figured I’d start dumping stuff we may miss for pre/post talk review and follow up – homework if you like. The talk is scheduled for Friday the 9th in Central A, and you can find the session abstract here: http://www.msteched.com/australia/Public/SessionList.aspx. For those not attending, I think it’s being filmed, so I’ll see about grabbing a link and posting here after the show.

What I’m Talking About

The title suggests “big algorithms”, but I’d have preferred to say “interesting algorithms you can start learning and applying today in your apps” – the former seems more marketable. I deliver a survey of algorithms that they’re classifying these days as “Collective Intelligence Algorithms” – classic Web 2.0 hyper-super-nomenclature. What this really means is looking at some search and rank, classification and recommendation algorithms in the context of big data – typically crowd generated data. We’ll then marry them to the powerful language and platform feature set of the F# functional language. We’re trying to achieve two things: explore some of the algorithms that the big guns like Google and Microsoft are using in a fun, absorbable way; and show off how easy it is to implement them in a functional language like F#. At the end of the hour, I hope to convince you that it’s worth spending the time learning more about these algorithms, and more importantly, adopting the functional programming religion.

In my talk, I don’t spend too much time on what F# actually represents as a language or ideology, so I’ll start by surveying a lot of that background here. We’ll start with the ultimate question: why bother with all this? Then we’ll take a quick look at the F# elevator pitch, and finally, I’ve got all the code I use in my presentation available for download. If I have time before the talk this week, I try and also dump some pre-reading content on the CI algorithms themselves.

Let’s start with the ultimate question:

Why Should I Learn a Functional Language?

I think the “why” is a judgement call for the individual. Learning a functional language (particularly if you’re engrained in the imperative) requires brain re-wiring. It’s a frustrating path, especially when the payoff isn’t immediate - based on a very small sample of people I know who made the leap, I’d say it’s a 40-60 hour ramp up to get comfortable and productive. And the motivations and benefits are going to be largely intrinsic. There’s no widespread market reward for knowing a functional programming language, and there’s certainly no forcing functions for adoption (at least not right now).

So where’s the beef? I figured I’d stand on the shoulders of giants and give a quick link dump/summary for your leisurely review:

John Hughes, Why Functional Programming Matters (1984). An absolute classic.

“Conventional languages place conceptual limits on the way problems can be modularised. Functional languages push those limits back.”

He talks about functional languages providing efficient modularity, composability, safety through immutability and provides numerous program examples juxtaposed against imperative equivalents. His core theme is modularity – arguing that clean modularity is the key to successful programming, and languages that aim to enable modularity are inherently more productive.

Reg Braithwaite has great review of Hughes’ paper here. I particularly like his extension to Hughes discussion through his analysis of Seperation of Concerns (SoC). He gives an example of Object Orientations poor SoC in the game design world:

“Let’s ask a question about Monopoly (and Enterprise Software). Where do the rules live? In a noun-oriented design, the rules are smooshed and smeared across the design, because every single object is responsible for knowing everything about everything it can ‘do’. All the verbs are glued to the nouns as methods.”

He goes on to note:

“The great insight is that better programs separate concerns. They are factored more purely, and the factors are naturally along the lines of responsibility (rather than in Jenga piles of abstract virtual base mixin module class proto_ extends private implements). Languages that facilitate better separation of concerns are more powerful in practice than those that don’t.”

His core point is that functional languages offers users the ability to get to the verb right away, without being admonished to objects and interfaces. The comments of Reg’s post are also worth the read. Particularly the one chap that points out the obvious: there’s a problem with the semantic distance between the language of the problem domain, and the language used to implement the solution. I call this developer cognitive dissonance – how hard is it to map what you want to “do” to the language and tools you’re using to implement?

Paul Graham extends the theme of developer cognitive dissonance and introduces differentiation as a key motivation for the “why functional” through his essay Beating the Averages. Referring to the initial stages of a startup he was involved in, he talks about his startups language choice:

“If you can use any language, which do you use? We chose Lisp. For one thing, it was obvious that rapid development would be important in this market. We were all starting from scratch, so a company that could get new features done before its competitors would have a big advantage. We knew Lisp was a really good language for writing software quickly, and server-based applications magnify the effect of rapid development, because you can release software the minute it's done.”

And then goes on to say:

“What were the results of this experiment? Somewhat surprisingly, it worked. We eventually had many competitors, on the order of twenty to thirty of them, but none of their software could compete with ours.”

Graham’s essay shines a light on a software business truism: the barriers to entry are really low, so expect lots of competition. Those with more business acumen than me often point out that first mover advantage makes a competitive difference, so if you can shorten your development cycle, get to market first, all with high quality, who wouldn’t? And with today’s audience almost expecting features like recommendations, awesome search, social networking, classification, high availability, speed, scale and more, perhaps it’s time we mapped that sort of feature set to the best programming language and toolset for the job.

The book “Real World Haskell” (Haskell being the infamous pure functional language used by academics, geeks and language nuts) also has a great deal for those that choose the functional programming path: Novelty, Power and Enjoyment. The book is a great read, and worth the time if you’re looking for a deep dive in to one of the more ‘purist’ functional languages around. 

Sill not convinced? Erik Meijer (among others) have made the case for purist functional programming ideology being the saviour to the concurrency problem we’re facing today. He makes the case that state and side-effects (which include IO, exceptions, threads, resources, reflection etc etc) are the bane of concurrency, and that imperative/OO languages today simply aren’t equipped to declare and manage those side-effects in a safe way. I like his unexpected exception example on page 15:

var xs = new[] { 5, 4, 3, 2, 1, 0 };
IEnumerable<int> q = null;

try
{
    q = from x in xs select 1 / x;
}
catch
{
    q = new int[] {};
}

foreach (var z in q)
    // DivideByZero gets thrown here and not caught!
    Console.WriteLine(z);

The link between unexpected exception behaviour and concurrency is vague at best, but the basic concept is this: if you can’t describe side-effects using the available language syntax, the compiler can’t reason about it, and inevitably can’t tell you if it’s safe to parallelise or not. He later goes on to talk about how a functional feature called Monads are able to describe side-effects right there in the type system, thus allowing the compiler, the programmer, and/or concurrency libraries to take advantage of that when figuring out how to parallelise safely.

Tim Sweeney from Epic Games (who brought me Gears of War, and I love them for it) also shares Erik’s view – his POPL (Principles of Programming Languages) talk titled “The Next Mainstream Programming Language: A Game Developers Prospective” walks through the life of a game programmer in an imperative OO world. The reliability and concurrency/data parallelism slides have strong arguments for language features found in traditional functional programming languages. Slides 28-39 start with an incredible statistic: 50% of Unreal engine bugs can be traced to common imperative/OO compiler problems (out-of-bounds array access, dereferencing null pointers, integer overflow and uninitialized variable access).

Under the QCon track “Historically Bad Idea’s”, Tony Hoare, type system designer for ALGOL, call’s the invention of the null reference a “billion dollar mistake”. In similar vein to Erik and Tim, he call’s for the eradication of null from the type systems of current languages – something functional programmers generally don’t need to worry about.

And if better design traits, easier composability, reduced developer cognitive dissonance, competitive differentiation, fun, power, enjoyment, tamed side-effects, safety and being smarter and better looking aren’t enough, you could always start looking to the functional language feature set to see what you’re missing out on.

Functional Language Feature Set

Wikipedia gives a decent run down on some of the notable functional programming language features: higher-order functions, recursion, laziness, evaluation, pattern matching, type inference, immutability and meta-programming. I suggest reading and understanding each of those features listed (or you could come to my talk and learn about them there ;)).

Matt Podwysocki and Erik Meijer have a great discussion about functional programming and it’s feature set (in fact Matt’s blog is just fantastic for all things functional, and more importantly all things F#).

Erik also goes solo with a whiteboard session on some core functional programming features and concepts.

The comp.lang.functional FAQ has a great overview of what functional programming is, and some of the features that make them special.

And if you want to build a multi-billion dollar search company, you better ramp up on Map Reduce (two simple functions that are available in most functional programming languages). These two simple functions gave Google the programming model it needed to scale up and out and crawl billions of web pages.

“MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.”

While you may not get the distributed computing part for free when adopting a functional programming language, you get all the language enablers to go do it yourself (and my personal belief is that your favourite cloud computing provider will provide you with a similar model anyway).

And with that, we begin looking at the functional feature set with a F#/.NET slant: “The Don” (as I like to call him), talks on video about F# and its feature set. His “comparison’s with C#” slides are good – one core point being that in C#, there’s more noise than signal (as in, there’s a lot more lines of code in C#, than there is in the same F# example). The “Pleasure and Pain” slides (about 8 mins in) you may instantly relate to if you’ve ever tried to build a compiler in an OO language.

Amanda Launcher and Ted Neward have a chat with the .NET Rocks team about Functional Programming, it’s feature set, and F#.

Chris Smith (the F# team’s guru tester) and Dustin Campbell chat on the Deep Fried Bytes podcast about F# and functional programming.

And StackOverflow has a discussion of functional features and how they relate to the C# language.

The F# Elevator Pitch

The modern software application has evolved in to a heterogeneous mesh of platforms, architecture, data, and design, which spans a variety of topologies like client/server, n-tier, Peer-to-Peer, cloud and more. The modern software consumer has also evolved. They want  high availability, reliable apps that light up on new hardware with many-cores. They want their apps to scale and accommodate the massive amount of data that they and their friends generate.

And you just want to get through the day building what they want, cheaply and quickly without complexity.

F# has superb fundamentals to help you:

  • A full-featured functional language with all the guts and glory to help solve the problems we’ve talked about above.
  • Works great with large datasets in a beautifully succinct and scalable way. (Check out how the Applied Games group parsed terabytes of data easily with F#)
  • And works great in the small and adds value to your app dev cycle straight away.
  • Helps you program in a naturally concurrent way, and gives you concurrency features out-of-the box to scale across many-cores.

And its core value proposition:

  • It’s a .NET language, which comes with all the cross-language interop goodness we’ve come to expect from the CLR.
  • Because it speaks .NET, you can use all the .NET libraries you know and love (BCL, WPF, WCF, ASP.NET etc).
  • And that also means you get to use ALL the .NET tools you know and love (profilers, code analysis,
  • It speaks OO and mutability too: while immutable and functional by default, gives you plenty of escape hatches to walk the mutable/OO plank.

But most of all, it’s hacking F# is just plain fun. And if you get nothing out of the 2000+ word “why functional, why F#” pitch, take this away: learning F# and/or the functional paradigm will make you a much better C#/VB.NET programmer. And that to me is worth the 40-60 hours of brain-hurt. So get started now. Download the free VS 2008 add-on, or grab the VS 2010 beta bits (it comes default installed!)

Hopefully see you at the talk!

Download DEV450 Demo Code Search, Netflix Recommendation Engine, SVM Classifier, Fuzzy matching.

Tags:

Blog Reboot

by joelpob 31. May 2009 09:56

My virtual host provider (vpsland.com) was a fail -- it's a long story, which includes sub-plots like "no backups", and my favourite "your instance isn't responding, and we can't reboot it right now". So off to discountasp.net with a migration to BlogEngine.NET. I love the feature set. And download->compile->modify->deploy to DiscountASP was maybe an hour. It also worked on my linux box via Mono straight up -- great stuff.

A lot going on lately, including a fantastic showing of F# at JAOO downunder: I built a simple search engine + fuzzy keyword search and a Netflix recommendation engine in under 45 mins 70 minutes. You can grab the slides + code here. I'm also planning a "Build a OS in under an hour" talk in the spirit of my PDC and TechEd "Build a compiler" equivalents. I'll likely test the waters on that at the next local user group. 

Anyway -- I'm motivated to keep the content rolling, and not let the infrastructure man beat me down. More to come soon...

 

Tags:

Powered by BlogEngine.NET 1.5.0.7
A modified theme by Mads Kristensen

About the author

.NET nerd and F# evangelical. Random updates at Twitter @joelpob, but real content goes here.

Articles, Presentations and more here.