Ted Neward and I have been hacking on a second edition to the Shared Source CLI Essentials book for over a year. The goal was simple: ramp the book content to be “correct” for version 2.0 of the SSCLI, and add a few chapters on the interesting new 2.0 features. It turned out to be a huge job (probably because of our insatiable appetite for deep technical detail) but we’re almost done, and we figured we’d place a rough cuts drop up for the community to take a look at. You can find the draft version of the book here: Shared Source CLI 2.0 Internals DRAFT.pdf.

Some of the changes and additions to the book include:

  • Updated to reflect the 2.0 source code. A *lot* of stuff changed internally, specifically for deeply engrained type system features like Generics.
  • Added a chapter on Reflection and Code Generation features. We talk about new 2.0 features like Lightweight Code Generation (LCG) and the new Reflection caching mechanisms.
  • Added a chapter on Generics to explore how it works under the hood, from metadata changes, right through to runtime data structure changes.
  • Added an Appendix tutorial that walks through step-by-step the changes required to add a new opcode to the runtime (everything from verification, to JIT compilation).
  • Explores a new interface call dispatch architecture called “Virtual Stub Dispatch”.
  • Looks at the new native x86 calling convention.

The book will be available for *free* in e-book PDF/Word from the SSCLI 2.0 download site in the coming weeks when it's polished up for print. It’ll also be available in dead-tree format from Amazon soon after that.

Along with the draft download, be sure to check out our chat with Richard and Carl over at
.NET Rocks about the book, Rotor and its future.
 
Comments, nitpicks, and praise are most welcome.      

Permalink | Trackback (0)
posted in General


My good friend (and former officemate) Joe Duffy is a guest on .NET Rocks! It's a great podcast, and is pertinent to some of the stuff we've been talking about recently. Joe's book is almost out too - I've read the first half of the rough draft, and it's just awesome (the first few pages are littered with assembly language!).

More posts for the software adventure series soon. I've been out with a really bad chest infection for the past three weeks which has delayed the spit and polish.

Hope you had a good new years.

Permalink | Trackback (1)
posted in General


I’m skipping (Functional Languages Going Mainstream) for now as I’m not entirely happy with it yet. I’ll polish it off at a later date – for now, I’ll just unblock and get on with the show.


3. Client Side Parallel Programming Models

 

I love my dual core Intel Centrino Pro based Thinkpad T61p, except it doesn't make Outlook run any faster than my old T42 which had exactly half the processors and a similar clock speed. Welcome to the reality of multi-core - old single threaded apps humming along at the same speed on dual, quad, octa (n-tuple) core machines.

The reason is now becoming fairly well known: the issue is we're starting to hit physical walls when trying to bump clock speeds higher. I've talked about this problem in depth before, so I won't rehash it, but essentially higher frequencies == more power == more cooling & more leakage, and we've hit walls on how much we can cool, and how much leakage we can cope with. There's been a few interesting CPU design announcements which have helped reduce the problem somewhat (see: High-k and Metal Gate Make-Over at 45nm, and Intels Fundamental Advance in Transistor Design Extends Moore's Law, Computing Performance), but for the foreseeable future, we're stuck with ever expanding L1/L2/L3 caches, higher throughput through architecture innovation, more cores, and slight movements forward in terms of clock frequency.

Now don't get me wrong - more cache, and better per-clock-cycle execution bang for buck is great news, and hopefully it will make Outlook run that little bit faster, but I've got this whole extra processor sitting around waiting to slurp up instructions. Why can't it help render the Outlook UI, or re-index my Mail store?

And lets be clear here - Moore's law is still going strong - we're just getting different processor scaling to what we've enjoyed before. It's a classic software/hardware impedance mismatch that is at the root of this problem.


What are the Challenges?

If we believe that dual/quad/octa/n-tuple cores + cache scaling + internals advancements is going to be the default way that processors are expected to scale, we must adjust the software appropriately to scale with it. Just as DOOM ran faster when I xcopied it from my old 486 to my Pentium 166 MMX, we need the same experience for our LOB apps, enterprise apps and so on.

When you start to think about how to solve the problem, an interesting meta-question arises: should we really be focusing on making sure client-side platforms scale? With the popular movement of apps to the web, perhaps you could equivocate the experience of upgrading from a 486 to a Pentium to that of going from 24 mbit ADSL 2+ to 100 mbit VHDSL? Users see their web apps load quicker with reduced latency - all without changes to the client side machine.

Let's delay tacking that meta-question for now (in fact, I'll be talking about it in Part 5: OS Hardware Virtualization & Cloud Computing), and just assume that at least for the next 5 years, we can expect that the market will keep adopting laptops, desktops and hardware coupled with a fat client operating system. To tackle the software / hardware scaling impedance mismatch, we'll need to adjust the software side.

Do away with Shared Memory, Threads and Locks

There's nothing stopping us from writing software that scales to ever-expanding cores - we've had basic concurrency building blocks since before I was born. Threads, Shared Memory and Locks are the typical tools used today, but I'd argue its through a lack of choice, rather than being the best tool for the job. Clearly they suck (in fact, without them I'd be out of a job as fixing race conditions is actually a really sweet rent-paying niche). Your average ISV programmer is going to struggle with it. What's worse, the bugs are almost impossible to diagnose with today's toolset - I've seen a torn read/write in the wild (e.g. a read of a 64 bit value from memory on a 32 bit wide machine requires two MOV instructions: thread 1 pauses after slurping in 32 of the higher order bits; thread 2 kicks in and performs a torn write over the lower order bits and pauses; thread 1 starts again, reading in the lower order bits overwritten by thread 2; bamo - broken invariant and a weird number to boot), and that was after days of sitting behind WinDbg.


Solutions and Opportunities

The good news is, there's plenty of places to start looking for solutions, and the opportunity in this space is what makes it so exciting. There's OS level plumbing, virtualization (hardware and software), virtual machine plumbing (ala software transactional memory), compilers, languages, libraries, GPU programming and more. Let's look at the 10,000 foot view in some of these areas - hopefully it'll get you as excited as I am about this space.

Languages

Have you seen C# 3.0 lately? Introducing language plumbing that enables data querying libraries like LINQ was genius. Further more, with libraries like Parallel Extensions Framework for .NET, it can leverage the LINQ architecture to perform LINQ queries in parallel across multiple cores. It leverages a parallel programming style called declarative data parallelism (ala functional programming etc). Declarative data parallelism has few or no side-effects. No side effects means high throughput operations over data without the need for "protection". Consider the case of multiplying all elements in an array by 2: for two processors, you could split the array in to two parts, hand one part to one processor and the other part the second processor and ask them to perform the same operation in parallel. Both procs walk their array parts multiplying elements and at the end of it all, we simply join the two parts from each processor for our result. This kind of parallelism is highly scalable (Google's MapReduce API does a very similar thing, and scales across thousands of machines).

Erlang is another parallel enabled programming that's close to my heart. It's a "fit for purpose" highly parallel programming languge that leverages a parallel programming model called the Actor model. From Wikipedia: "The Actor model is a mathematical model of concurrent computation that treats "actors" as the universal primitives of concurrent digital computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received". Erlang kicks ass. The messages that are sent and received are truly first class citizens, and they're light as a feather, weighing in at only a few hundred bytes per message. Those messages are sent in a highly concurrent fashion via Erlang abstraction called 'processes', which are neither OS threads, nor fibres, but rather a lightweight isolated process abstraction that is scheduled and managed by Erlang itself. Really cool stuff.

There's a bunch of other interesting languages that support concurrency features as first class or close to: Concurrent ML, C Omega, F# etc (http://en.wikipedia.org/wiki/Concurrent_computing), and they're worth checking out. F# especially if you live on the .NET platform like I do. My guess is we'll be seeing more and more concurrency features built in to languages as time rolls on.

Libraries

Clearly the most exciting announcement for the parallel programming .NET world comes from the Microsoft Parallel Extensions for .NET library that was dropped in a CTP a few days ago. Joe Duffy and Stephen Toub have been super busy lately getting this stuff out the door, and after toying with it in the past few days, I must say they've done a great job.

The PFX framework can be leveraged through a few different API surface areas: PLINQ (or Parallel LINQ), which is essentially LINQ on concurrency crack; a small set of imperative programming constructs that express parallelism over data sets (parallel for loops, parallel foreach loops etc); and a parallel abstraction over task based programming (a task abstraction is simply a series of actions) which is lighter weight than its ThreadPool cousin.

This stuff is goodness. I know a lot of work has gone on behind the scenes to make this API surface area cheap and efficient, however, at the end of the day, we're only a little further ahead of where we were before. A wise man once told me that in order to win hearts and minds, we need to make people fall in to the pit of success, and I'm not totally convinced that your average programmer can go from a single threaded client side forms app to a fully multithreaded concurrent app via API's like PLINQ and PFX, no matter how simple they are.

The issue here is the fundamental mismatch between a forms style programming model (imperative OO + Forms API surface area + events + UI thread) and a declarative functional style of data parallelism. And while the PFX task library is more suited to the forms style common denominator, it's still not a "pit of success" answer - a lot of work and care needs to be done on behalf of the programmer to get it 'right'. Not only that, there's a bunch of moving parts in your typical forms app that just aren't architecturally suited to being split up in to abstractions.

The good news here is that PFX is not a play for a new world order in .NET concurrency, it's about the building blocks. Once you have that foundation coupled with a small discrete surface area (even if it does have a small amount of learning curve), you enable a virtuous cycle which helps drive programmers to a better place. Building blocks enable libraries (MapReduce.NET anyone?), community involvement, partnerships (WPF team talks to PFX team who talk to Dynamic Languages team) and so on. We've done it before with OO, so it's definitely possible to move the masses to a more data driven task/actor oriented approach so long as we take baby steps. So while you could say that these libraries are evolutionary, I'm convinced it's going to help enable the 'revolutionary'.

Platforms

ASP.NET/IIS has done a great job at abstracting away concurrency issues for years. Of course, the mental model for programmers is fairly simple: autonomous users piped through a HTTP request with all "shared" data accesses done through a backend database somewhere. Perhaps we can plug that mental model in to client side application programming?

Virtual Machine Plumbing

There's a bunch of cool research going on make parallelism implicit through under the hood plumbing. This means the programmer can merrily go on her way not caring about deadlocks, race conditions and other nasty stuff, but instead concentrate on how their program should compose together with a known expectation of concurrency. Software Transactional Memory is one such implementation, and there's a fair amount of love and hate around for the idea.

And while STM does provide an elegant way to enforce composability and modularity (which is a nice way of saying that the programmer must declare their program invariant's up front), it's not clear if you can broaden the concept to encapsulate and protect most of the OO imperative world - the broader you go, the more rules you need.

I'll leave it as an exercise to the reader to go off and understand some of these issues - the journey is definitely rewarding.

Better OS Level Parallelism

The elephant in the room is that for the most part, apps like Outlook don't really do all that much hard data processing and number crunching. Sure, search, indexing, spam filtering etc are computationally intensive, but for the most part, users are waiting for that, they're sitting around waiting for Outlook to slurp up data from the disk. IO (especially blocking IO) is concurrency's best and worst friend. You can spend IO time doing something worthwhile with the processor, or you can spend it waiting for the data it's off collecting.

It'd be great if we had better tools and constructs to make designing for IO bottlenecks easier. I'm not exactly sure what we need, but we could start with some fundamentals like hardcore documentation.

I'd also like lighter code encapsulation constructs (something lighter than your traditional Win32 thread) that are fully supported. Fibers are okay, but I don't have a brilliantly simple managed API, and I don't really want to write a scheduler either.

GPU's, Custom Sockets and More Fun Stuff

When I saw this I got a bit googley eyed. Custom silicon to do instruction crunching on the same bus as your general purpose x86/64 is powerful and an interesting enabler. Not so long ago, if you wanted to do serious domain specific computation (financial modelling, simulations etc) you'd crack open your favourite Verilog compiler and start buiding a custom processor. Soon, if you need that extra FPU power you can go and buy the latest NVidia card and start coding away.

Of course, you'll need great libraries (stuff that looks like what we consider common today), or, have the compiler/libraries figure it out for you. There's no reason why a PLINQ operation over an array of doubles can't be offloaded on to the GPU?


My Guess?

It's too close to call. There's a bunch of moving parts here: web enabled push to computing in the cloud; reduced need for Win32; consolidation of various platforms (Win32 POSIX etc) in to the browser + Javascript; mobile computing (more need for increased battery life) and more. If we ignore all those issues for a minute, and concentrate on what the programming might be looking at in the coming years for client side programming, I think that the library story + some amount of language integration is probably a safe bet.

The driver wont be technology, it'll be the market. If the market decides its worth shelling out for 80 core Intel procs, then we'll have no choice but to push our apps in to a parallel world. If we end up in a web enabled world, then it's likely we'll care more about technologies that enable wider distribution of computation: making sure our server farms stay cool, and that browser based clients can help out with the computational load.

Brilliantly interesting. And that's why it's in my top 8 most excellent software adventures.

cheers.

 

 

Permalink | Trackback (0)
posted in General


On to part two in the series of Most Excellent Software Adventures. In this episode, we talk about scalability in the massive sense – ala Google style. Thousands of commodity machines, connected and waiting for your algorithm and data inputs, and the API’s that drive them.

NOTE: This is a really long post – if you just want the meat, jump to the “My Thoughts” section.

2.  Massive Scalability

Like any red blooded male, I love fast stuff. While most of my XY chromosomal counterparts are cheering for a roaring V8, I’m more in to seeing how fast I can flip bits and multiply binary numbers. The gaming generation grew up on overclocking Celerons and unsoldering transistors from underclocked “Slot A” Athlons to drive more speed in to their already overworked CPU’s. All in the name of benchmarks: seeing numbers go down, and throughput go up. All great fun.

While the motivation for overclocking etc. was generally hobbiest, I think we’ll see the same kind of interest brew for massively parallel systems: crunching huge datasets at high speeds in the name of brand new algorithms, pattern matching, business intelligence, and just plain geekish fun. And while we won’t be able to set up these systems at home, there’s a good chance it won’t matter: scale on tap will be at your local hosting pub in the cloud.

So what does that mean? When I talk about massive scale, I’m really talking about thousands of machines all connected to a super high speed backbone, all controlled programmatically by a simple API.

What do I want?

I want it cheap. I want thousands of machines on tap to be reachable. I want to cook up new algorithms and ideas and test them over these clusters for less than the cost of lunch. And I want the API to be so simple that the average Elvis programmer can grok it and get going in less time than a battle with a COM API. I don’t want to deal with bandwidth issues or latency, meaning my datasets should be local to my cluster, and can be moved from machine to machine without me caring about it. I want tools that show me inefficiencies in my algorithms, and diagnostics that make sense. Most of all, I want libraries that know how to play nicely in the sandpit of scale.

Give me a scenario

Take this for a typical day at the pub: I want to understand what my competitors have been doing for the past week. Let’s make the scenario basic: I want to know the relative airtime of each competitor in the press and the blogosphere, and I’ll visualize what’s new and important using a simple tag cloud mechanism.

So, I need a list of RSS feeds, news sites, and forums related to my business area. Let’s ignore where I get this list from, and assume that the list has tens of thousands of URL’s, which means potentially tens of thousands of documents that I need to take a look at. Let’s take that list, and run it through a filter, looking for keywords of my competitors and their products. The list becomes manageable: say ten thousand documents.

After that, I need to do the following:

  • Tokenize each document
  • Create word vectors for each document (lazily, as we don't know how big the total word vector would end up being)
  • Calculate the relative term frequency against my lose representation of the total words vector
  • Mash the term frequency vector of each document together, producing a tag cloud showing the most interesting words (based on relative frequency)
The algorithm is pretty high level, but basically it’s figuring out what the interesting and new words are based on word counts relative to one another. It’s a common technique (which is also used in search engine cataloging).

Let’s take a look at how we could scale this out.

The slurping down of documents can be parallelized across machines. Divide up the document list among the machines you have running and go. The filtering of each document can then be done locally too. Vector creation is done on that machine, and the result handed back to a central master machine, to create a global word vector. Once we’ve created that, we can then pass the global word vector back to our workers for relative term frequency calculation (a highly mathematical calculation, which could be locally parallelized on multiple procs). After that, we centralize the mashup of the term frequency vectors, and produce our Tag Cloud of interesting words associated with our competitors (for example, a competitor launched a new product, the name of that product would be an interesting word, and highlighted in the tag cloud).

(This kind of scenario: send out work, calculate, retrieve that work to a central place, send it out again for more calculation is the central theme for Google’s Map Reduce. More on that later)

The big question really is: could you do this on a local machine? Probably. Would it take a long freakin time? Absolutely. Scale this to the millions of documents, and you'd probably have no choice.

Issues

With scenarios like that, it should become clear that we’re missing a bunch of the building blocks to get something like this up and running.

Here’s my brain dump on the problems, and missing pieces of the puzzle:

We don’t have the tools: It’s simple for me to get a Windows Forms app up and running, or an enterprise level n-tier ASP.NET app going, but where in the world do I find tools that help me code up a massively scalable algorithm like my fantasy one above? At the moment, I can’t fire up Visual Studio or Eclipse, start coding my algorithm, and then deploy in a few steps?

We don’t have the API’s: We have no good massively parallel scale based API’s. We’ve seen a bunch of papers, and a few CTP’s in the pipe line, but they’re not tied to a platform or a tool chain, something we need in order to get this stuff off the ground and in to the mainstream.

It’s not cheap enough: I want testing to be nearly free, or at least billed per minute, not per hour.

It takes too long to spin up instances: At least with Amazon EC2 it does. I haven’t tested out the other services, but I did find that EC2 takes an amazingly long time to spin up instances for me to play on. I need these things on demand, and quickly. (More on EC2 later)

Diagnostic tools are required: I want a dashboard which shows me all the nodes running my algorithm, viewable partial results, and hotspots for algorithm problems.

Fast access to Data: Probably the biggest problem: how can I move my data around from node to node like it was on the same machine? I want network links as fast as memory buses. Also, how do I move external data (say, from the internet) to my local cluster as quick as possible? After all, I don’t want my machines sitting idle while I wait for the network to respond.

Current solutions?

There’s a ton of stuff going on in this space, including a few commercial offerings. I haven’t played on them all, so if you have the inside word, jump on the comments.

Google’s MapReduce:

This is the paper that perked my interest years ago. It’s Google’s crown jewel, its competitive advantage. If you want to know how to calculate TF/IDF frequencies for all documents on the Internet as fast and efficiently as possible, this is the kind of infrastructure you’d need.

Since most of my readers are .NET folk, I thought I’d give you a insight in to what I think a nice .NET MapReduce API would look like:

MapReduce .NET code

Assume for a second that bigArray is just huge – millions and millions of numbers. The array would get chopped up, distributed across my cluster, coupled with the anonymous method, and crunch: the result returned back as a IList<int>.

Of course, Google’s MapReduce isn’t publicly available, but from all accounts, it’s just brilliant. Lots of tools, lots of resources, great API’s.

However, there’s good news: an open source “roll your own” clone called Hadoop is in development, and is available for download.

Hadoop:

It essentially implements the MapReduce paper, plus an equivalent of the GFS file system called HDFS. It’s Java (which is a sore point for me, but nevertheless), it’s painful to set up, and the tooling for it is rather lax, but it’s a start.

Amazon EC2:

I was on the early beta program, and I loved it. EC2 is essentially virtual machines on tap. Spin up Linux instances in minutes (either their prebuilt ones, or your own), receive back an externally visible IP and you’re away.

Having said that though, EC2 falls short of what I really want: distributed computing API’s that are tied to a platform. EC2 doesn’t have an distributed programming API like MapReduce, it only has infrastructure API’s to spin up and spin down instances. Of course, that doesn’t mean you can’t roll your own.

Sun’s Grid:

I haven’t used this, but I’ve heard good things. It’s funny, I really think Sun ‘get it’, meaning they’re already ahead of the game (ala, the network is the computer thing that they touted a few years back is probably going to be a reality in the next 10 years or so). And with this, as usual, Sun is early to the party, but he’s the unpopular kid in the corner drinking orange juice and trying not to be noticed.

But, like everything else we’ve talked about, it still falls short. “Results are delivered by e-mail” – huh? I want the stuff I build on your Grid to be exposed to the outside world. I don’t want results sent by e-mail. Blah.

My Thoughts?

My guess is that we’ll see a bunch of changes in the ecosystem in the next 5-10 years. The driver will likely be organizations that care about software as a service: building and exposing those services using pay-per-play economics. We’re seeing it now with Amazon, Microsoft and Google offering all sorts of “pay per 1000 transactions” web services.

With that being the driver for demand, there are opportunities for hosting services to expose scalable clusters, using friendly API’s that can be integrated in to developer tool suites. It could be a Microsoft offering (given they’re great at platforms), but it’s likely going to be an agile startup partnering with an “Amazon like” cluster hosting company doing all the driving.

API’s will consolidate, languages will come to the party too: MapReduce like API’s are sensible, but languages like Erlang are F# + PFX are really nice, and aren’t too far removed for the programmer who typically speaks OO. In order to raise the level of abstraction for programming on a clustered, massively scalable platform, we need to start with API's and then the languages.

Data will travel at blitzing speeds: the ‘net should get faster and cheaper too (however, I’ve not seen too much evidence of that happening, in fact, there’s evidence to the contrary, but we’ll see).

Cluster hosting services will differentiate using exposed web services local to the cluster: assuming all cluster hosts had the stuff I wanted (API’s + nodes on demand etc.) then they’ll differentiate through web service offerings. “Hi, we’re Amazon EC2, and we have a copy of the Internet you can use”, or “Hi, we’re Company X, we have all the NYSE second-by-second transaction history – terabytes of data – all for free!”. That way, we don’t have to worry about how fast data is coming down the network pipe, meaning more CPU’s are doing more number crunching.

Von Neumann might not be a problem anymore: We’re seeing 80 cores on Intel research procs, perhaps along with scaling out, we’ll be scaling up too? If memory were shared across machines, and network pipes were like the memory buses of today, we’d lift the level of abstraction for algorithm design, and not have to worry about things like network latency, bus latency, and CPU stalling.

Virtualization will be key: hardware virtualization is necessary to make this secure and efficient. Intel are already working on this stuff – perhaps they see the vision too?

OS Virtualization (or Virtual Machines as the new Win32/POSIX) will also be key: if you’re scalable algorithm is tied to the environment, then virtualization (and the movement of virtual instances from node to node) is necessary. For most of the scenario’s I can think of, all I really need is .NET 2.0 – who’s to say that needs to sit on top of Win32? The programming platform needs to be abstracted from the hardware platform - it needs to be fluid.

Wrapping up

After a 2000+ word post, I’m sure you’ve had enough. But clearly dead simple distributed programming API’s which are tightly coupled to massively scalable infrastructure, and the developer tools to go with that is an “excellent adventure” in software engineering. The opportunities with this kind of scale are endless, and the details of building the libraries and the platform is a worthy effort. Partying on Mini-Google's like I've described for dirt cheap would just be SO much fun!

Thanks for reading. Part 3 will be up next week, with a look at Functional Programming languages. Comments always welcome.

Permalink | Trackback (82)
posted in General


Last week I was reminiscing about the good ‘ol days tinkering with computers: Commodore 64’s, GWBASIC, Turbo Pascal 5.0, DOOM and the Autoexect.bat config.sys hacking required to get it running on underprivileged 486’s, Amiga 500’s, broken Linux 1.0 kernel compiles, EGA video cards and more Sierra games than I can remember. Getting stuff running was hard. Understanding how stuff worked was heaps of fun. Connectivity to other likeminded communities was basically non-existent, so a great book on the topic of interest was like striking gold in Ballarat.

It got me thinking though – if I were to start again in 2007, what would be the equivalent to learning about the flat memory address space of a Commodore 64, or breaking open a copy Borland’s new Turbo Pascal IDE? I had to ignore my first thought of being mindlessly hooked on Facecrack getting nothing done, and push through to what I believe to be the 8 most interesting software engineering pursuits of the next 5 years – things that really light me up, something worthy of dedicating years of sleepless nights to.

I’m going to make this an 8 part series. Before I started this, I imagined it to be a few pages of lightweight material to get my point across and clarify my thinking – now that I’m finished, it’s a fairly dense 8000 word essay. ;) We'll start with the list, and then I'll talk about my thoughts on each of the technologies one by one over the series.

Joel's 8 most excellent software engineering adventures (in no particular order):

  1. Comprehending the Cloud (taking HTML and making programmatic sense of it)
  2. Infrastructure Scalability (scale in the massive sense: Amazon EC2, Grid Computing, GFS, MapReduce, HAMMER, S3 etc.)
  3. Functional languages (going mainstream baby!)
  4. Client side parallel programming models (PLINQ, PFX, GPU Programming)
  5. OS Hardware Virtualization (Cloud, Virtual Machines as OS's)
  6. Machine learning and Data Mining
  7. Search (Algorithms)
  8. Compilers, Languages, and DSL's (Compiler implementation, Phoenix, the Sptectrum of languages)
Okay, let’s start with the first of eight - comprehending the Cloud:


1. Comprehending the Cloud

True programmatic comprehension of the ‘cloud’ (that thing we call the Internet) is only just starting to get underway. We’ve got movement in microformat’s, RSS, well formed XHTML, web services and javascript, but there’s still a long way to go. One goal of comprehension is extensibility: the ability to programmatically extend a website or URI endpoint to create value for both the source and the extender. We’ve known about this value add system for years, it’s why Windows is so dominant, and apps like Photoshop keep their lead through the bazillion extensions you can buy.

Another goal is simplicity. I want to be able to hit a website, pass in my identity, retrieve the data I care about, and have that data loosely bind to other data I care about. Consider the following:



Here in my fake scenario, I’m slurping down the business news for the day, converting it to a list of company names and stock codes, and then sucking down the latest prices of that list from my broker – all in less than 10 lines.

I equivocate this experience of slurping data from websites to that of hitting a database and retrieving rows of data I care about. Let’s ignore extensibility for now, and focus on getting at that data.

What are the challenges?

Descriptive formats for software and components have been around since the dawn of operating systems. On the DOS/Windows platforms, we had the .EXE/.COM/.DLL packaging formats which allowed a very limited amount of extensibility and interaction, then we moved to software-to-software messaging systems and shared memory (DDE Dynamic Data Exchange, was the first attempt of this on Windows). Through the years we’ve evolved these packaging and messaging formats to be descriptive, and very extensible (VBX/ActiveX/COM/DCOM/ and finally .NET/Java etc.).

Formats and languages for data arguably have been around for longer, as Databases have traditionally enforced constraints through schema adherence, and query languages.

Noting this, the challenge should be clear by now: how do we make cloud comprehension as easy as loading a URI endpoint, reflecting over it, and then slurping down the data that we care about in a structured way? How do we then apply all we know about the evolution of software components to the web? Versioning? Bindings? Reliability?

Then, how do we get there today, using as the base the current minimum standard – unstructured HTML? Jason Kottke recently wrote that “open and messy trumps closed and controlled in the long run”, I tend to think that this is may be true for HTML vs. structured markup (at least in the short term). Sure, we’ll have a bunch of the later, but the former is always going to be there.

Solutions?

Dapper (http://www.dapper.net) takes a social approach: create a community where people tell the Dapper screen scraper where to find the data in the rendered version of the web pages, and convert that back to descriptive XML. There are issues with accuracy, and when the website layout changes, Dapper breaks, so it’s not terribly reliable either. A novel approach nevertheless.

Another approach is to embed semantic “helpers” in to the rendering engines themselves: bulletin boards, blog engines, mailing lists etc, and so when scraper API’s walk the site, they find navigating to the data easier.

Markup formats like RDF are also gaining traction, but it’s unrealistic to assume that we’ll retrospectively add RDF against all HTML based URI endpoints.

My guess?

My best guess at the short term solution for the worst case in cloud comprehension (just having bare minimum HTML, no RSS or anything)? Marrying late-bound data binding mechanisms with pattern matching/machine learning. You’d have the pattern matching software build up a loose idea of what it believes to be the interesting data content in the HTML (just like you can train software to understand the parts of speech in a corpus, you can presumably train it to look for content vs. navigation vs. ads etc.). Then pass that loose representation of the data to a language/platform which late-binds to the various metadata elements, and allows for meaningful introspection. To illustrate what I’m on about, consider the following imaginary HTML slurped down from a business website:



It’s ugly and unstructured. Clearly, I want something clean, something I can walk over and look at. Let’s pass it to our imaginary pattern matching/machine learning platform that dissects the rendered structure, and pulls out what’s interesting:



Much better. And likely something I can code against. This imaginary service could render RSS, RDF, or a popular webservice format, I don’t care, just give me something with structure + metadata.

So, clearly this would scale better if the pattern matching & machine learning platform was shared. Anyone that’s tried training a neural net/NLP platform knows that the more accurate training data you have, the more accurate the result. Easy solved. Imagine a HTML->XML web service that allows for incremental training? Developers slurp down the URI endpoint via the webservice, and can let it know where it got it wrong (e.g. you thought this block of text was an ad, but it was actually a comment on a blog post). Over time, a URI’s metadata just gets better and better.

Further extending this theme, consider the cases where we need to know about named entities: imagine another shared machine learned webservice, where you hand it semi-structured XML, and it hands you back the same XML but with more tags describing all the companies it found in the data.

You could pass it the following:



And it hands you back the following:



With two imaginary calls, we’ve gone from an unstructured HTML endpoint, to a semi-structured representation of what a machine believes to be the data, then we’ve added richer metadata using a specialized named-entity web service. And so starts the virtuous cycle…

To summarize, we’re using a machine to render metadata about URI’s for us. It’s not going to be brilliantly accurate, and the structure has to be lose and generic by definition, but we can make up for these deficiencies through machine learning, adding metadata incrementally using specialized services, and adding a social aspect to make training more efficient. As for the generic structure: use your favourite late-binding language or query language to grok/filter/sort that structure to make use of it in a reliable way.

More food for thought

We’ve barely touched the surface here – we missed code invocation (i.e. if a URI endpoint has Javascript, what are the semantics for invoking code on that endpoint), handing forms and other “shared memory” like web mechanisms, dealing with embedded non-text content like video players, and how you would go about programmatically exposing that stuff. There’s also the question of consolidation: we already have a bunch of these microformats that are helping us expose URI metadata (RSS is one of them), should we consolidate that stuff? And if so, how would you go about mashing those formats together?

There are a slew of legal issues too: copyright, fair use, adherence to international legislation etc.

Nevertheless, cloud comprehension makes my top 8 because it’s an interesting problem that could blend a bunch of fascinating software engineering technologies: machine learning, pattern matching, social software, scale, and the language late-binding mechanisms to tie it all together. Plenty of curious meat.

And finally, a few links to chew on below. Click away to learn more.

Next in the top 8? Infastructure Scalability. I’ll be talking about Amazon EC2, Grid Computing, Hadoop, GFS, S3 and more.
Stay tuned.

Links

Semantic Web (Google TechTalk)

Semistructred and Structured Data in the Web: Going Back and Forth


Constructing Hierarchical Information Structures of Sub-page Level HTML Documents

Extracting Structures of HTML Documents

Semantic Web Podcast

RDF

What is RDF

SPARQL

SPARQL and the Semantic Web (Podcast)

Late-binding over XML: Visual Basic 9

Volta and Dynamic Languages

Permalink | Trackback (42)
posted in General


That’s $250 USD per active user – wow. In the interest of trying to broaden my understanding on how the economics of pre-IPO privately owned - yet VC funded - valuations work, I’ll join the conversation and hopefully get schooled along the way. ;)

I’m an engineer at heart, so I think about this situation in the contrast of what’s efficient, to what’s traditionally available, and put that in the context of requirements.

For years marketers have touted the virtues of word-of-mouth advertising and the near exponential effect of Reed’s Law in action. This virtuous value is often spoken in the context of the seller, but I believe both the seller and the consumer benefit. It’s synonymous with network-effects: the value to the consumer rises as their social network grows bigger, and the value to the seller rises as the message spread potential grows larger. From the consumer point of view, tighter consumer-to-consumer feedback loops on products and services, means more efficient decision making capability to all members. At best, a consumer derives the most value from immediate network connections (and possibly 2nd degree connections). I personally believe these “immediate friends” subset networks are better than the “wisdom of the crowds”, because members have already made implicit assumptions about trust, quality, and expertise in their immediate social circle. At a high level, you could just say that “you are who your friends are” and move on fairly quickly.

Contrast this with antiquated methods for marketing: TV/Radio. Your prime-time 30 second advert costs anywhere from $100k skywards to more than $500k for something more like a  Friends episode. Getting accurate data on advertisement absorbtion rates is also difficult, but there seems to be general consensus that you need multiple runs of the ad for anyone to actually remember it. That's a lot of money for 30 seconds worth of airtime, distributing a message that may not be heard, which people may or may not care about, of which you can’t retrieve granular feedback from. Lots of time and money is spent on trying to patch up these inefficiencies – even subliminal messages got a run – but the reality is, it’s typically a one way communication event in to a crowded room full of people you’ve made generalizations about.

So we have a basic contrast of what’s typically believed to be good and efficient, and what’s available (sure, mass generalizations at best, so take care when you plug it in to your mental model). Okay, so what kind of requirements are sellers and consumers looking for?

Let’s start with the seller.

(Disclaimer: I’m no advertiser, so broad assumptions are made here, please step carefully). Sellers presumably start with a model: two lines that show advertising cost, and expected return. You plug the budget in, and it spits out an expected profit. You then make it more complicated in the hope that you can maximize the profit line – change your communication medium, differentiation, glitzy prize campaigns, partnering, synergy and all the rest. Your goal is twofold: get the product in to the hands of the people who need it, or get people to believe that they want it. The crown jewel of a successful campaign is to find a multiplier effect on the profit line without the cost line rising. This generally means get the network to sell to itself.

What are the requirements? You’d probably want the following:

•    The ROI math model to be as accurate as possible.
•    The model to be dynamic: e.g. factoring in to the equation things like historical information, current sales, or geographic sales etc.
•    Guidelines for enhancing multiplier effects. E.g. you need to make “connectors” believe in your product to kickstart a multiplier.
•    Two-way communication/feedback.
•    Malleable communication, changing according to the needs of any subset networks.

And the consumer?

I’ll start with the obvious: their wants and needs matched with the “best” product to fill that gap. The “best” part has a lot of non-obvious facets, some I’ll drop in here for clarity:

•    Socially acceptable: Some of the products I buy aren’t really a statement about me, but rather a statement about what my social circle will accept. Sometimes they’ll allow me to go a notch or two outside of what’s considered the social norm, but often they’ll bring me back down to earth as quickly as I shot out of it.
•    Satisfaction: Knowing I bought the “most efficient / best featured / fastest / quietest / insert whatever is contextually important to me here” product is important. The cognitive dissonance is high here: how can I possibly make a product satisfaction judgment call if all I’ve got to work with is marketing literature, a sales pitch and maybe some “star rating” off a random opinion website?
•    A belief system: loyalty is inbuilt, probably genetic. It’s amazing that we’re able to build belief systems around brands – so I need to factor that in when making a choice.
•    Certainty: I want to know that my actions are within the scope of what I believe to be safe.
•    There are more, but hopefully you get the picture.

Plugging the needs of the seller and the needs of the consumer together in a compatible way would add a huge amount value over what we’ve traditionally seen, and this is where I think Facebook starts to earn its $250/head value.

It’s a social network – people talk about themselves, their friends, the products they use, and the stuff they care about all the time. But here we have a very different twist: it’s highly traceable, data minable, and dead-pan accurate.

Imagine a few PhD's in Machine Learning and Natural Language Processing going to town on Facebook data: you’ve got the social network connections, textual content about what users like, dislike and whatever else generated almost daily, you’ve got backend logs about user actions (after all, what you click on implicitly shows your desire), patterns and correlations between users, their friends, and what they talk about, a regular heartbeat of conversation, and even personal information about the users themselves! With a few more research years under the belts of ML and NLP, you could even imagine machines shedding light on patterns of social behavior which we never knew existed.

So, in this crazy new world, we can start to see compatibility between sellers and consumers. Facebook, with the right tools in place, could satisfy the sellers’ requirements of a ROI model that’s dynamic and accurate, hints and guidelines on what subset social networks would be good candidates for multiplier effect, and they can also get involved in the conversation – a much touted benefit of the blogging revolution. Consumers can connect and make the right decision about product purchases through a trusted word-of-mouth mechanism, while feeling more integrated with the seller, and arguably putting transparent democratic pressure on the seller by virtue of their product feedback.

So, if you’ve believed my handwavy opinions so far, you could imagine Facebook selling more than just an advertising channel – they’re selling an integrated consumer advertising channel, a two-way feedback platform, the tools to deeply understand consumers on a whole new level, a tight product feedback loop, and the algorithms to sell to that network as efficiently as possible.

Of course, it starts to fall down with the lack of hard math to back all this up (please, someone jump on the comments and help me out). But I’ll have a go at some back of the napkin figures: $10 billion over an estimated 35 million in booked profit is a 285 P/E multiple (the standard in the market is like 15-20 when interest rates are low), so, to make this look good to your average investor, you’d need to derive at least another 15x profit to hit a number that the street would consider normal. At 40 million active users, they’re hitting a per user income bracket at just under a dollar. If the user growth of Facebook went flat tomorrow, you need to get that number to more like $10-15 per user/year, in order to justify that kind of price. But, the fact is, 100,000 people are joining per day, and it doesn’t seem like it’s going to slow down anytime soon.

If you’re just selling ads? A $10-15 profit per user/year is really tough work. But if couple predicted growth rates with the fact that you’re selling a consumer network, a feedback platform and the tools to make the economics of selling to that network extremely efficient, you might be able to sell it. After all, if you own the data and the infrastructure needed to build a dynamic advertising/marketing model that everyone wants, and it makes the world more efficient in the process, you could argue that you’ve just bought yourself a piece of a new economy.

Feedback appreciated.

Permalink | Trackback (47)
posted in General


Jetlag - the tyranny of a cross-pacific flight includes a wonderful post trip hangover. They forgot to mention that on the Qantas website as a flight “feature”. I tried a stopover in Hawaii on my way to Redmond this time, to see if I could reduce the pain and break up the boredom with beaches and babes. I don’t recommend it – if anything, it seems to have made it worse.
 
But, for those penny pinchers, I found a flight arbitrage opportunity in all the new low cost carrier competition: International long haul flights from Australia to the US are traditionally overpriced – you’re looking at paying between 2k and 3k to get from Brisbane to Seattle. Domestic flights within the US are very cheap – lots of competition and carriers battle it out for your loyalty and make it up through Visa card addon’s and frequent flier miles. I bought a JetStar flight from BNE to HON for $1200AUD return including taxes, then I purchased a Hawaiian airlines flight from HON to SEA direct for under $400 USD. Total cost:  about $1650 AUD. Equivalent Qantas flight: $2400. Equivalent Hawaiian Airlines flight from Sydney: $2600. Looks like that new China airline might deliver some other “bounce from Asia” to Europe arbitrage opportunities too.

I stayed at the Outrigger in Hawaii. It’s a nice hotel, but the walk to Waikiki was a bit too long. Here's a picture of the view:



And a picture of Waikiki beach:



It was great to catch up with old friends in Redmond. Since I’ve been gone (it’s been nearly a year and a half since I left, wow) the commute from my old roommates place is longer, there’s a lot of new building construction, and lots of folk have moved office (some 3-4 times in a year!). My favourite coffee shop, Victors Coffee Company, was in the middle of expansion, but the coffee was still as great as ever:



All in all, it’s good to be home. I had a lot of fun back at MS, and accumulated a whole swag of new toys to play with (the pictures above were taken with my new Fujifilm FinePix f40 – brilliant little camera). It was a good time to reflect on what the plan looks like moving forward. I’m hoping to blog a little about that soon.

Permalink | Trackback (4)
posted in General


I'm in Seattle for a little while, hanging out and doing some work with the CLR team (I'll announce the project I'm working on v.soon). I dragged along my faithful single proc Thinkpad T42, which has lasted a solid two years, but lately the "a" key is all screwed up and the screen is a little dull. It was time for an upgrade: I purchased a Thinkpad T61p - 2.2ghz dual core with all the trimmings and a boatload of RAM.

Only a few hours 'till it arrives at my house:

Package Delivery Description

God bless America, Lenovo, and overnight shipping. ;)

Permalink | Trackback (2)
posted in General


Last weekend was spent prepping for TechEd Australia, where I’m delivering two talks: C# 3.0 Under the Hood (DEV320, 5pm 09/08), and .NET Programming Language Pragmatics (DEV321, 1:45pm 10/08).
 
The latter session is my favourite. My goal is to explore the colours and contrasts of .NET enabled programming languages in an accessible way – everything from static, to dynamic, to functional. It’s timely, given existing languages are undergoing a lot of change to adapt to the latest developer paradigm shifts, from programming the cloud, to creating highly concurrent desktop apps. Two things particularly excite me: showing how Visual Basic 9 is becoming a highly adaptive language, pushing the “static typing where possible, dynamic typing when necessary” mantra to the very edge, and being able to show off a sneak peek at the VBx prototypes (VB 10). If you’re a languages geek at heart, you should drop by.

Abstract: "The .NET language ecosystem is vast and wide, offering lots of choice for a willing developer. Take a tour of the ecosystem, and explore the rich feature contrasts of dynamic languages like IronPython and IronRuby, functional languages like F#, and classic statically typed languages like C# and VB.NET. After the tour, we see the best of these features in the context of Visual Basic 9, where we've added relaxed delegates, XML literals, type inference and more. We also take a peek at the future of Visual Basic (VBx) and C#."

Permalink | Trackback (3)
posted in General


If you’re a language nerd, you’ve been wiping the drool off your face since the Dynamic Language Runtime (DLR) announcement at Mix. The DLR, coupled with LINQ, offers a language playground we haven’t seen since the initial CLR betas. The CLR team has been busy, and the result is impressive.
 
The best way to deeply understand a technology is to start hacking on it. So I’ve started a Lua compiler called Nua, that leverages and targets the DLR. Lua is a powerful, flexible dynamic scripting language that’s heavily used in the gaming industry to script games. It was written by the folk out at PUC-Rio over 10 years ago. It’s a brilliant language. I love the way it pieces together. I really love metatables – language extensions on tap. Just brilliant.

Dominic and I started a Lua compiler for the .NET framework a few years ago, just as the DLR was being conceived. I’ve started with that as the base, and intend to chop off limbs and replace them with DLR goodness. I’ve thrown the original up there for archaic contrast.

I've thrown the start of Nua up on Codeplex: http://www.codeplex.com/Nua and I'll be updating the source tree as I go. Who knows if I’ll finish it (I make no promises). I have no real plan, so don't ask for one. If anything, I’m hoping it’ll be a catalyst for others to embrace the Lua language and create an interest for a “real” Lua on DLR compiler. DLR gives you Silverlight for free, and given the growing interest in online games using Silverlight, it’s a total no brainer to go there.

The journey is the reward here, it'll hopefully be a lot of fun.

Permalink | Trackback (2)
posted in General