Hadoop Streaming, Hue, Oozie Workflows, and Hive

October 24, 2013 Steve Karam Big Data, Development, News 10 comments

[This entry is part 4 of 4 in the series Hadoop Streaming]

MapReduce with Hadoop Streaming in bash – Bonus!

MapReduce with Hadoop Streaming in bash – Part 1
MapReduce with Hadoop Streaming in bash – Part 2
MapReduce with Hadoop Streaming in bash – Part 3
Hadoop Streaming, Hue, Oozie Workflows, and Hive

To conclude my three part series on writing MapReduce jobs with shell script for use with Hadoop Streaming, I’ve decided to throw together a video tutorial on running the jobs we’ve created in Oozie, a workflow editor for Hadoop that allows jobs to be chained, forked, scheduled, etc. To do this I explain more about the MapReduce jobs and setup, then use Hue (a GUI for Hadoop management) to create an Oozie workflow from the three tasks and run them. After that, I use Hive to create a table on top of the HDFS stored data and run a SQL query against it.

Enjoy!

[youtube=http://www.youtube.com/watch?v=qlMATo095_s&w=583]

Special Notes

As I mention in the video, in order to use Hadoop Streaming with Oozie you should configure the ShareLib. You can find details on the reasoning here and the process for setting this up in the CDH4 Install Guide under “Installing the Oozie ShareLib in Hadoop HDFS”.
If you want to follow along, you’ll need the Cloudera QuickStart VM and my source data and scripts.

Errata

In the section on Hive I erroneously say that the file on HDFS is moved into the Hive Metastore on table creation. It’s not the Metastore, but the Hive namespace in HDFS. You can find more on Hive table types here.
The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.

Featured image for this article by heatheronhertravels.com.

10 comments

Romain Rigaux says:

October 24, 2013 at 12:58 pm

Great tutorial with a very good attention to the little details! A good news too is that the video is using Hue 2.1 and nowadays Oozie, Beeswax… are totally revamped in Hue 3.0 🙂
Oracle Alchemist says:

October 24, 2013 at 1:37 pm

Oh man I never knew! That might be a good topic for the next video. 😉
pavan polineni says:

October 26, 2013 at 12:26 am

Hi Steve , nice video and we are expecting more and more . For example mongo Hadoop migration. Hbase integrations.
Sandeep Kumar says:

December 17, 2013 at 9:39 am

Hi Steve,
You mentiona an Errata:
“The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.”

This means that, if we add files in “Add File” (-file in hadoop command) section for maptf.sh, redtf.sh etc.. it wont work rite? And is it Oozie limitation ?
(I have serious problem submitting Job via “Add File” to HDFS and run. Let me know if you need logs in email)
vyomesh says:

January 28, 2014 at 12:16 am

Hi Steve
I am new in cloudera environment.
Using cloudera virtual machine(CDH 4.4.0)
But while saving created workflow it will give message:could not saved workflow
you have any about this problem.can you help me to solve this problem.
Romain says:

January 28, 2014 at 1:24 am

The real cause of the problem you be displayed in the Chrome inspector (right click, inspect then try to save again) or on the /logs page. This is a known bug that will be fixed in the next release: https://issues.cloudera.org/browse/HUE-1858

Feel free to followup on https://groups.google.com/a/cloudera.org/forum/#!forum/hue-user if it feels to spammy here!
vyomesh says:

January 30, 2014 at 2:08 am

Logs file give me this in console
Failed to load resource: the server responded with a status of 400 (BAD REQUEST) http://localhost.localdomain:8888/oozie/workflows/46/save
POST http://localhost.localdomain:8888/oozie/workflows/46/save 400 (BAD REQUEST) jquery-2.0.2.min.js:6

I don’t know how to solve this problem.
please can anyone help me to solve this problem
suman kumar says:

June 26, 2014 at 12:27 pm

very good work and nicely explained much appreciated….:) keep coming new stuffs
Neda says:

July 28, 2015 at 11:03 am

HI,
Thanks you for your great tutorial.
I’m using Cloudera training Virtual Machine. When I wanted to do the settings mentioned for streaming in hdfs, it asked me for sudo password, but I don’t have the sudo password for cloudera training!
Do you know any solution for that?
thanks.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

MapReduce with Hadoop Streaming in bash – Bonus!

Special Notes

Errata

Leave a Reply