SSCLI 2.0 Patch for VS 2010

by joelpob 27. April 2010 05:03

I’ve been hacking away on some type system stuff for fun, checking out Linear Types and Effectsand how all that relates to concurrency problems of tomorrow. I find the best way to think about that concretely is to take a look at the runtime/csharp sources, and where best they might be modified to support such a system.

That means spinning up my old friend the Shared Source CLI 2.0. First step: wrangle the build system/source in to Visual Studio 2010 land. Bootstrapped by Jeremy Kuhne’s VS2008 patch, I’ve pushed that patch forward to work with a Visual Studio 2010 install (requires C++, and I believe something bigger than Express).

You can download the new patch here. Install by just overlaying the patch files on your unzipped SSCLI 2.0 distribution.

After you crank up a Visual Studio 2010 Command Prompt and run “sscli20\env”, you’ll need to make sure LIBPATH and INCLUDE environment variables are set properly. Details of those steps are found in the vs2010README.txt file in the patch zip. It comes with a “Works on my machine” software certification.

And while you’re at it, you may as well download our Shared Source CLI 2.0 Internals book. Fostered by Microsoft Research, shelved by the GFC. Free to all. Enjoy.

More fun with HTML Data Extraction

by joelpob 6. April 2010 12:14

I got a lot of good feedback from my previous post on semi-automated data extraction from HTMLincluding a private query on how one would go about “feature” extraction (“feature”, as in a product feature like processor speed for example). It’s an incredibly hard problem to solve in an automated way, particularly if the set of data you’re classifying doesn’t have a consistent set of features (as would be the case with electronics, where you have processors, hard drives, and video cards, all requiring different features to extract [“size”, “speed” etc], and some similar [“gigabytes”]).

We hope to have something more interesting to say about this particular problem soon – in fact, we’re already off hacking away at it. But in the meantime we’ve started by looking at the easy end of the feature extraction spectra: hand held feature extraction. Given a scenario where a set of data has a consistent set of features, simply roll your own “feature extraction” helper algorithm (basically just an F# function which takes a DOM tree as input, and produces a list of possible features along with their respective probabilities), and plug it in to the existing crawling and parsing infrastructure. Perfumes offer a simple example to stretch our legs on – they have consistent features like size/amount (50ml, 100ml, 7oz etc) and type (EDP or Eau De Parfum, ETD Eau De Toilette etc), and offer enough variance in how they’re presented on individual consumer perfume websites to give us some extra algorithmic flavour.

While we hack on all this classification goodness, we figured we’d drop a demo to show where we’re at, and what you can do with the“output” of all this crawling/data extraction/feature extraction stuff. Our demo plays in the price comparison space – price comparison of Perfumes! We’ve crawled, parsed and data extracted 25’odd Australian perfume retailers online using the generic techniques mentioned previously. If you head on over to http://www.baseprice.com.au/perfume and click on some of the examples, you can start to get a sense of where our heads are at with this: deep vertical search. By no means are we user experience experts (in fact the site looks pretty bad ;)), but it hopefully demonstrates the idea. You might also notice some other cute little search experience features we’re playing with like automated tag generation, clustering, and filtering. Beware, bugs abound I’m sure.

Perfumes aren’t the only place you could go with this (they’re an easy example as they have a set of easily detectable features to extract). In fact, if you think about electronics price comparison, it’s clearly a nice place to push and flex those algorithmic muscles: go from hand-held feature extraction to fully automated feature extraction.

Maybe we’ll get there. More technical content soon. Keep you posted.

Tags:

dotnet

Powered by BlogEngine.NET 1.5.0.7
A modified theme by Mads Kristensen

About the author

.NET nerd and F# evangelical. Random updates at Twitter @joelpob, but real content goes here.

Articles, Presentations and more here.