Yesterday I completed the second day of Cloudera Developer Training for Apache Hadoop. While the first day focused on Hadoop core technology like HDFS, the second day was all about MapReduce. That means it was the day that whole ‘developer’ thing was thrown into sharp relief.
I’ve been a DBA for most of my career, but I believe I still think like a developer in a lot of ways. In fact, when it came to Hadoop I specifically decided to take the Developer course because I figured it would make me more uncomfortable, provide a greater challenge, and teach me more things that I would have a harder time learning on my own.
Hey awesome, I was right! Definite discomfort. But therein lies salvation.
Since I’m a fan of summaries and tidbits, I’m going to stick with yesterday’s Day 1 summary and detail three things I learned during Day 2.
I Don’t Like Java
Okay, this isn’t about the class as much as it is me. But I am not a Java fan. Part of the problem (and this is me being honest here) is that I don’t know it very well. That always leads to a rougher experience. But even the things I knew pretty well were fairly annoying just by virtue of it being Java. Unit testing with MRUnit was a real brain-fryer (I’m a DBA, we do all our testing in production anyways, right guys?). But…
That testing in production thing was a joke by the way. Don’t do that.
So Java it is for now. However, I plan to make a good effort after class to learn Hive, Pig, and try using Hadoop Streaming to write MapReduce code in Python or other languages. From what I understand, most MapReduce developers use one of these options. If you have any insight on that I’d love to hear it in the comments. Hive provides options to run HiveQL (a lot like SQL) queries straight against parsed views of your files stored on HDFS. Pig provides a language (PigLatin) to write your own data flows. Both of these options transform into MapReduce code on the Hadoop side of things. Python is just so I can sit at the cool kids table. I’d consider Ruby but I’m just not that cool.
MapReduce sounds better than MapCombinePartitionSortShuffleReduce.
Thanks to my pitiful experience with MapReduce on MongoDB, I was of the impression that Hadoop querying had two parts: Map and Reduce. However, it turns out there’s a lot more:
- Mappers to take in the original data from HDFS and perform your parsing and calculations
- Combiners which run in-mapper as a mini-reduce to pre-aggregate some data (yet may or may not run)
- Partitioners which decide how data will be distributed to the reducers
- Sort and Shuffle, the phase where MapReduce sorts all the data emitted from the Mappers by key then shuffles it into merged lists (by key) for the Reducers
- Reducers which perform final aggregation if necessary. Reducers are totally optional, just as a SQL query without a GROUP BY is optional.
So it turns out that MapReduce implies a lot more, which is great. Knowing that helps understand MapReduce a little more. It also helps to compare these phases to piped Linux commands. For example, a grep could be analogized with a Mapper and ‘wc -l’ with a Reducer. The biggest difference of course is the distributed nature of Hadoop vs. standard shell access. However, with Hadoop Streaming you can write MapReduce jobs easily with anything that can use STDIN and STDOUT, including shell commands. Time to brush off those old awk scripts folks!
A Book Literally Is a Thousand Words (and then some)
On the recommendation of the instructor I purchased Tom White’s Hadoop: The Definitive Guide. It really is an outstanding book that reinforced what I’m learning in class and will hopefully help me prepare for the CCDH exam. If you’re interested in learning more about Hadoop, both on the HDFS and MapReduce side, I would highly recommend it. The flow is good and the information is great.
Make sure you get the 3rd Edition! Hadoop is young(ish) and moving quickly. The previous two editions focused on the old MapReduce API (<0.20) whereas the 3rd edition published in May 2012 includes the new MapReduce API. It also includes information on MapReduce 2 (MRv2) and Yet Another Resource Navigator (YARN) which is on its way to being ready for production use.
You have to ask questions to get answers. The hype that Big Data just “finds” all this amazing stuff is ridiculous. Certainly there are tools to help you dashboard and discover what your data contains (Hadoop/Solr/Mahout, Tableau, etc.), but Hadoop is just a filesystem and code framework. If you ask a developer to “find something neat” with MapReduce they might shoot you.