Hadoop Streaming, Hue, Oozie Workflows, and Hive

Share Button

MapReduce with Hadoop Streaming in bash – Bonus!

To conclude my three part series on writing MapReduce jobs with shell script for use with Hadoop Streaming, I’ve decided to throw together a video tutorial on running the jobs we’ve created in Oozie, a workflow editor for Hadoop that allows jobs to be chained, forked, scheduled, etc. To do this I explain more about the MapReduce jobs and setup, then use Hue (a GUI for Hadoop management) to create an Oozie workflow from the three tasks and run them. After that, I use Hive to create a table on top of the HDFS stored data and run a SQL query against it.

Enjoy!

[youtube=http://www.youtube.com/watch?v=qlMATo095_s&w=583]

Special Notes

  • As I mention in the video, in order to use Hadoop Streaming with Oozie you should configure the ShareLib. You can find details on the reasoning here and the process for setting this up in the CDH4 Install Guide under “Installing the Oozie ShareLib in Hadoop HDFS”.
  • If you want to follow along, you’ll need the Cloudera QuickStart VM and my source data and scripts.

Errata

  • In the section on Hive I erroneously say that the file on HDFS is moved into the Hive Metastore on table creation. It’s not the Metastore, but the Hive namespace in HDFS. You can find more on Hive table types here.
  • The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.

Featured image for this article by heatheronhertravels.com.

Share Button

10 comments

  1. Great tutorial with a very good attention to the little details! A good news too is that the video is using Hue 2.1 and nowadays Oozie, Beeswax… are totally revamped in Hue 3.0 🙂

  2. Hi Steve,
    You mentiona an Errata:
    “The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.”

    This means that, if we add files in “Add File” (-file in hadoop command) section for maptf.sh, redtf.sh etc.. it wont work rite? And is it Oozie limitation ?
    (I have serious problem submitting Job via “Add File” to HDFS and run. Let me know if you need logs in email)

  3. Hi Steve
    I am new in cloudera environment.
    Using cloudera virtual machine(CDH 4.4.0)
    But while saving created workflow it will give message:could not saved workflow
    you have any about this problem.can you help me to solve this problem.

  4. very good work and nicely explained much appreciated….:) keep coming new stuffs

  5. HI,
    Thanks you for your great tutorial.
    I’m using Cloudera training Virtual Machine. When I wanted to do the settings mentioned for streaming in hdfs, it asked me for sudo password, but I don’t have the sudo password for cloudera training!
    Do you know any solution for that?
    thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.