An Easier Way To Get Started With Scalding

In the previous post, I walked through how I first started using Scalding and some of the issues I had when trying to run it on a cluster in HDFS mode. After my initial work with Scalding, I wanted to learn more so I purchased the Kindle version of 'Programming MapReduce with Scalding' by Antonios Chalkiopoulos.

It was $10 well spent. Why? First, the book steps through lots of examples of using Scalding to process data pipelines. It also dives into testing Scalding code, connecting Scalding to HBase, and connecting Scalding to JDBC databases.

Even more important was the environment setup used in the book. Had I started with this book for my initial Scalding work, I would have saved myself hours of frustration. Sometimes it good to push through and figure out a problem. Other times (and this is one of those other times), it's nice to have a book or reference material to walk you through the steps to get started so you can work on more important problems.

The book uses the Kiji Bento Box project to setup a local cluster in just a few minutes. It runs a cluster on your local machine and uses the local disk for the HDFS environment. While this isn't a full fledged cluster, it's perfect to get started with learning Scalding. Obviously, a real cluster will be required for complete testing.

Below is a an example of running a WordCount job on Kiji. Other than the "bento start" and "bento stop" commands, the steps are identical to running a job locally.

$ bento start
$ mvn clean package
$ hadoop fs -mkdir hdfs:///data/input 
$ hadoop fs -mkdir hdfs:///data/output 
$ hadoop fs -put input.txt hdfs:///data/input/
$ hadoop jar target/chapter2-0-jar-with-dependencies.jar
com.twitter.scalding.Tool WordCountJob --hdfs --input
hdfs:///data/input --output hdfs:///data/output
$ hadoop fs -ls /data/output
$ bento stop

This is definitely easier than my first attempts and running Scalding on a cluster. Using Kiji and the steps from the book allows you to focus on learning and writing Scalding code rather than spending hours on a frustrating environment setup.

Plus, there are still 7 more chapters of Scalding details to walk through after running the first example. Definitely a good $10 investment for learning.