[This entry is part 4 of 4 in the series Hadoop Streaming]
MapReduce with Hadoop Streaming in bash – Bonus!
To conclude my three part series on writing MapReduce jobs with shell script for use with Hadoop Streaming, I’ve decided to throw together a video tutorial on running the jobs we’ve created in Oozie, a workflow editor for Hadoop that allows jobs to be chained, forked, scheduled, etc. To do this I explain more about the MapReduce jobs and setup, then use Hue (a GUI for Hadoop management) to create an Oozie workflow from the three tasks and run them. After that, I use Hive to create a table on top of the HDFS stored data and run a SQL query against it.
- MapReduce with Hadoop Streaming in bash – Part 1
- MapReduce with Hadoop Streaming in bash – Part 2
- MapReduce with Hadoop Streaming in bash – Part 3
- Hadoop Streaming, Hue, Oozie Workflows, and Hive
- As I mention in the video, in order to use Hadoop Streaming with Oozie you should configure the ShareLib. You can find details on the reasoning here and the process for setting this up in the CDH4 Install Guide under “Installing the Oozie ShareLib in Hadoop HDFS”.
- If you want to follow along, you’ll need the Cloudera QuickStart VM and my source data and scripts.
- In the section on Hive I erroneously say that the file on HDFS is moved into the Hive Metastore on table creation. It’s not the Metastore, but the Hive namespace in HDFS. You can find more on Hive table types here.
- The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.
Featured image for this article by heatheronhertravels.com.
Great tutorial with a very good attention to the little details! A good news too is that the video is using Hue 2.1 and nowadays Oozie, Beeswax… are totally revamped in Hue 3.0 🙂
Oh man I never knew! That might be a good topic for the next video. 😉
Hi Steve , nice video and we are expecting more and more . For example mongo Hadoop migration. Hbase integrations.
You mentiona an Errata:
“The workflow in Oozie worked; however, it would have been much better for me to add my shell scripts (maptf.sh, redtf.sh, mapdf.sh, reddf.sh, maptfidf.sh, and IdentityReducer.sh) to the job and running from HDFS instead of the local Linux filesystem.”
This means that, if we add files in “Add File” (-file in hadoop command) section for maptf.sh, redtf.sh etc.. it wont work rite? And is it Oozie limitation ?
(I have serious problem submitting Job via “Add File” to HDFS and run. Let me know if you need logs in email)
I am new in cloudera environment.
Using cloudera virtual machine(CDH 4.4.0)
But while saving created workflow it will give message:could not saved workflow
you have any about this problem.can you help me to solve this problem.
The real cause of the problem you be displayed in the Chrome inspector (right click, inspect then try to save again) or on the /logs page. This is a known bug that will be fixed in the next release: https://issues.cloudera.org/browse/HUE-1858
Feel free to followup on https://groups.google.com/a/cloudera.org/forum/#!forum/hue-user if it feels to spammy here!
Logs file give me this in console
Failed to load resource: the server responded with a status of 400 (BAD REQUEST) http://localhost.localdomain:8888/oozie/workflows/46/save
POST http://localhost.localdomain:8888/oozie/workflows/46/save 400 (BAD REQUEST) jquery-2.0.2.min.js:6
I don’t know how to solve this problem.
please can anyone help me to solve this problem
very good work and nicely explained much appreciated….:) keep coming new stuffs
Thanks you for your great tutorial.
I’m using Cloudera training Virtual Machine. When I wanted to do the settings mentioned for streaming in hdfs, it asked me for sudo password, but I don’t have the sudo password for cloudera training!
Do you know any solution for that?