Hadoop Streaming, Hue, Oozie Workflows, and Hive

Share Button

MapReduce with Hadoop Streaming in bash – Bonus!

To conclude my three part series on writing MapReduce jobs with shell script for use with Hadoop Streaming, I’ve decided to throw together a video tutorial on running the jobs we’ve created in Oozie, a workflow editor for Hadoop that allows jobs to be chained, forked, scheduled, etc. To do this I explain more about the MapReduce jobs and setup, then use Hue (a GUI for Hadoop management) to create an Oozie workflow from the three tasks and run them. After that, I use Hive to create a table on top of the HDFS stored data and run a SQL query against it.

Enjoy!

Special Notes

  • As I mention in the video, in order to use Hadoop Streaming with Oozie you should configure the ShareLib. You can find details on the reasoning here and the process for setting this up in the CDH4 Install Guide under “Installing the Oozie ShareLib in Hadoop HDFS”.
  • If you want to follow along, you’ll need the Cloudera QuickStart VM and my source data and scripts.

Errata

  • In the section on Hive I erroneously say that the file on HDFS is moved into the Hive Metastore on table creation. It’s not the Metastore, but the Hive namespace in HDFS. You can find more on Hive table types here.
  • The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.

Featured image for this article by heatheronhertravels.com.

Share Button

8 Responses to “Hadoop Streaming, Hue, Oozie Workflows, and Hive”

  1. Great tutorial with a very good attention to the little details! A good news too is that the video is using Hue 2.1 and nowadays Oozie, Beeswax… are totally revamped in Hue 3.0 :)

  2. Oh man I never knew! That might be a good topic for the next video. ;)

  3. Hi Steve , nice video and we are expecting more and more . For example mongo Hadoop migration. Hbase integrations.

  4. Sandeep Kumar says:

    Hi Steve,
    You mentiona an Errata:
    “The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.”

    This means that, if we add files in “Add File” (-file in hadoop command) section for maptf.sh, redtf.sh etc.. it wont work rite? And is it Oozie limitation ?
    (I have serious problem submitting Job via “Add File” to HDFS and run. Let me know if you need logs in email)

  5. vyomesh says:

    Hi Steve
    I am new in cloudera environment.
    Using cloudera virtual machine(CDH 4.4.0)
    But while saving created workflow it will give message:could not saved workflow
    you have any about this problem.can you help me to solve this problem.

  6. Romain says:

    The real cause of the problem you be displayed in the Chrome inspector (right click, inspect then try to save again) or on the /logs page. This is a known bug that will be fixed in the next release: https://issues.cloudera.org/browse/HUE-1858

    Feel free to followup on https://groups.google.com/a/cloudera.org/forum/#!forum/hue-user if it feels to spammy here!

  7. vyomesh says:

    Logs file give me this in console
    Failed to load resource: the server responded with a status of 400 (BAD REQUEST) http://localhost.localdomain:8888/oozie/workflows/46/save
    POST http://localhost.localdomain:8888/oozie/workflows/46/save 400 (BAD REQUEST) jquery-2.0.2.min.js:6

    I don’t know how to solve this problem.
    please can anyone help me to solve this problem

  8. suman kumar says:

    very good work and nicely explained much appreciated….:) keep coming new stuffs

Leave a Reply