We’re assuming that you have lein and Java installed. We’ll also assume that you are familiar with the Amazon AWS stack in general.
The canonical example in the map reduce world is a word count app. That is, a job that counts the number of times a given word shows up in the input. Our example code is in https://github.com/ctdean/cascalog-hello
Sam Ritchie has built a nice word count app at https://github.com/sritchie/cascalog-class and we’re inspired by that example.
First thing, we need to split the text into words. We’ll use a simple
definition of a word and the Cascalog
defmapcatop macro to take an
input record and return many output word records:
(defmapcatop split-words "Split a string into a seq of words. A word is defined as a letter followed by one or more letters or digits." [line] (map str/lower-case (re-seq #"[a-zA-Z]\w+" line)))
Then we’ll just create the simple query:
(defn freq-count-query "A query that returns the word frequency count" [tap] (<- [?word ?count] (tap ?line) (split-words ?line :> ?word) (ops/count ?count)))
To execute the query we’ll read the input by lines and send the output
to text files. Hadoop will also need a Java
main class to run so
we’ll use the
defmain macro in Cascalog.
(defmain WritePoemFreq [in-path out-path] (let [in (hfs-textline in-path) out (hfs-textline out-path)] (?- out (freq-count-query in))))
The easiest way to package up this code is to build one large jar file with all the dependencies inside the jar.
Create the simple
(defproject cascalog-hello "0.1.0" :dependencies [[cascalog/cascalog "1.9.0"] [org.apache.hadoop/hadoop-core "0.20.205.0"]] :aot [ctdean.cascalog.hello])
making sure that we compile our main class ahead of time. Then we just build the uberjar:
This will make
target/cascalog-hello-0.1.0-standalone.jar If you
have a local hadoop cluster, you can call this directly using the
hadoop command line tool
hadoop jar \ target/cascalog-hello-0.1.0-standalone.jar \ ctdean.cascalog.hello.WritePoemFreq \ <INPUT> <OUTPUT>
but we’re going to skip that step and just run on EMR.
The way EMR works is that it reads the jar and input data from S3 and writes the final output to S3. The intermediate work is done on a Hadoop cluster that is created (and later killed) for this one job.
The jar file is the one we built above and we’ll put that in
s3://files.ctdean.com/cascalog-hello-0.1.0-standalone.jar (but you
should place your jar file in your own s3 bucket). And we’ll use a
Shel Silverstein poem
and place that in
There are many ways to put files on S3 (you can always use the web UI at https://console.aws.amazon.com/s3/home), but for this example we’ll just assume that you have some way of copying the files to S3.
Make sure that your AWS account exists and you can access the AWS console at https://console.aws.amazon.com/elasticmapreduce/
Go to the EMR console and create a New Job Flow
ctdean.cascalog.hello.WritePoemFreq s3n://www.ctdean.com/etc/silverstein-boa.txt s3n://files.ctdean.com/results
We’re running on our own S3 buckets, so our S3 access will just work.
Several minutes later we have our frequency count in
Running on EMR can be simple and straight forward. This example isn’t very involved but is capable of processing vast amounts of data.
But - running on EMR can be very frustrating. Errors can be difficult to debug and seemingly small changes can break the job.
Thanks to everyone on the Cascalog mailing list for all their help.
06 Jul 2012