Posted by Steve Karam
on Oct 24, 2013 in Big Data
[This entry is part 4 of 4 in the series Hadoop Streaming
MapReduce with Hadoop Streaming in bash – Bonus!
To conclude my three part series on writing MapReduce jobs with shell script for use with Hadoop Streaming, I’ve decided to throw together a video tutorial on running the jobs we’ve created in Oozie
, a workflow editor for Hadoop that allows jobs to be chained, forked, scheduled, etc. To do this I explain more about the MapReduce jobs and setup, then use Hue
(a GUI for Hadoop management) to create an Oozie workflow from the three tasks and run them. After that, I use Hive
to create a table on top of the HDFS stored data and run a SQL query against it.
- As I mention in the video, in order to use Hadoop Streaming with Oozie you should configure the ShareLib. You can find details on the reasoning here and the process for setting this up in the CDH4 Install Guide under “Installing the Oozie ShareLib in Hadoop HDFS”.
- If you want to follow along, you’ll need the Cloudera QuickStart VM and my source data and scripts.
- In the section on Hive I erroneously say that the file on HDFS is moved into the Hive Metastore on table creation. It’s not the Metastore, but the Hive namespace in HDFS. You can find more on Hive table types here.
- The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.
Featured image for this article by heatheronhertravels.com.