We can easily run a Cascalog Hadoop job by using the Amazon Elastic MapReduce (EMR) service. All we need to do is build a jar file and uploading the jar and our data to S3. That is,

  • Build a hello world app
  • Package up the code in a JAR file
  • Upload the code (and data) to S3
  • Submit the job using the Amazon UI

We’re assuming that you have lein and Java installed. We’ll also assume that you are familiar with the Amazon AWS stack in general.

Cascalog Hello World

The canonical example in the map reduce world is a word count app. That is, a job that counts the number of times a given word shows up in the input. Our example code is in https://github.com/ctdean/cascalog-hello

Sam Ritchie has built a nice word count app at https://github.com/sritchie/cascalog-class and we’re inspired by that example.

First thing, we need to split the text into words. We’ll use a simple definition of a word and the Cascalog defmapcatop macro to take an input record and return many output word records:

(defmapcatop split-words
  "Split a string into a seq of words.  A word is defined as a
   letter followed by one or more letters or digits."
  [line]
  (map str/lower-case
       (re-seq #"[a-zA-Z]\w+" line)))

Then we’ll just create the simple query:

(defn freq-count-query
  "A query that returns the word frequency count"
  [tap]
  (<- [?word ?count]
      (tap ?line)
      (split-words ?line :> ?word)
      (ops/count ?count)))

To execute the query we’ll read the input by lines and send the output to text files. Hadoop will also need a Java main class to run so we’ll use the defmain macro in Cascalog.

(defmain WritePoemFreq [in-path out-path]
  (let [in (hfs-textline in-path)
        out (hfs-textline out-path)]
    (?- out (freq-count-query in))))

Build the Hadoop jar file

The easiest way to package up this code is to build one large jar file with all the dependencies inside the jar.

Create the simple project.clj file:

(defproject cascalog-hello "0.1.0"
  :dependencies [[cascalog/cascalog "1.9.0"]
                 [org.apache.hadoop/hadoop-core "0.20.205.0"]]
  :aot [ctdean.cascalog.hello])

making sure that we compile our main class ahead of time. Then we just build the uberjar:

lein uberjar

This will make target/cascalog-hello-0.1.0-standalone.jar If you have a local hadoop cluster, you can call this directly using the hadoop command line tool

hadoop jar                                     \
    target/cascalog-hello-0.1.0-standalone.jar \
    ctdean.cascalog.hello.WritePoemFreq        \
    <INPUT> <OUTPUT>

but we’re going to skip that step and just run on EMR.

Prepare the inputs

The way EMR works is that it reads the jar and input data from S3 and writes the final output to S3. The intermediate work is done on a Hadoop cluster that is created (and later killed) for this one job.

The jar file is the one we built above and we’ll put that in s3://files.ctdean.com/cascalog-hello-0.1.0-standalone.jar (but you should place your jar file in your own s3 bucket). And we’ll use a Shel Silverstein poem and place that in s3://www.ctdean.com/etc/silverstein-boa.txt

There are many ways to put files on S3 (you can always use the web UI at https://console.aws.amazon.com/s3/home), but for this example we’ll just assume that you have some way of copying the files to S3.

Running on Amazon EMR

Make sure that your AWS account exists and you can access the AWS console at https://console.aws.amazon.com/elasticmapreduce/

Go to the EMR console and create a New Job Flow

  • Choose Custom Jar
  • On the next page enter the jar information (use your own path names here)
    • Our jar location is files.ctdean.com/cascalog-hello-0.1.0-standalone.jar
    • And the jar arguments are ctdean.cascalog.hello.WritePoemFreq s3n://www.ctdean.com/etc/silverstein-boa.txt s3n://files.ctdean.com/results
  • Accept the defaults on the rest of the pages. There is a field to set the S3 Log Path if you'de like to look at the Hadoop log files when you’re done.
  • Submit the job.

We’re running on our own S3 buckets, so our S3 access will just work.

Several minutes later we have our frequency count in s3://files.ctdean.com/results/part-*

If you use EMR a lot, using the command line tools from Amazon or the Lemur tool is much more convenient.

Conclusion

Running on EMR can be simple and straight forward. This example isn’t very involved but is capable of processing vast amounts of data.

But - running on EMR can be very frustrating. Errors can be difficult to debug and seemingly small changes can break the job.

Thanks to everyone on the Cascalog mailing list for all their help.

Published

06 Jul 2012


Previous Archive Next