This will be a short(ish) post, as my brain is relatively fried from the obscene amount of knowledge imparted by the Hadoop class (and it’s Friday so I’m allowed to be lazy nanny nanny boo boo).
To be honest, this was probably the most fun day of class. While the entire course was filled with a plethora of tools, commands, etc. for coding and understanding how MapReduce jobs work, the final day was all about tools. For the education of folks who plan on taking this class in the future, here’s the rough layout of days:
- Introduction, concepts for Hadoop, HDFS, and MapReduce, getting connected and exploring HDFS.
- Full bore MapReduce, lots of information on Mappers, Reducers, Combiners, Partitioners, and more.
- Advanced topics such as Avro, Map only jobs, jobs with multiple MapReduces, common algorithms, and others.
- Tools such as Hive, Pig, Mahout, Oozie, Sqoop, and Flume.
So let’s look at some detail on the three highlights of the final day of Cloudera Developer Training for Apache Hadoop.
Hive Made Me Feel Uneasy
Hive is a data abstraction layer built on top of Hadoop that allows you to create tables based on data stored in HDFS. You can make external tables (a view into the data based on parsing mechanisms) or Hive-managed tables (data is moved to a special location and managed by Hive). The best part about Hive is that you can then use SQL queries (with limitations) against your HDFS stored data. On the plus side, this makes things like joining, sorting, grouping, etc. extremely easy. Hive takes all the SQL you write and transforms it into MapReduce jobs. Sounds great, right? So why did it make me uneasy?
Another reason is somewhat embarrassing. Though I’ve been developing and DBA’ing against Oracle databases for over 15 years, I still don’t know how to write ANSI join syntax. At least, not without looking at documentation. The Oracle (+) method is etched into my brain. Since Hive can only do joins (both inner and outer) with ANSI standard SQL, this bugs me. Not Hive’s fault though.
Lastly, because some things just don’t work in Hive. Commands like UNION and MINUS which would be very beneficial are not supported. Nor are subqueries, at least inside the WHERE clause. They can only be used in the FROM clause for inline views.
All in all it’s a neat tool but not one that I see myself developing in frequently. Perhaps only for quick gathering of information.
Pig is the Bees Knees
Pig, on the other hand, I really liked. One of the other students in class said that Hive is to SQL as Pig is to PL/SQL, and I tend to agree. You’re still writing a malformed version of SQL statements in PigLatin (the scripting language developed for Pig), yet you can chain things together and do a lot of powerful tasks in a single script. Overall it was a more comfortable tool for me, as it’s a completely different language which makes more sense to my (addled) brain. Maybe I’m just anal retentive.
Moving forward I’m going to keep brushed up on my Java MapReduce jobs, as there are some things you just can’t do with Hive or Pig from what I understand. But the myriad functions and capabilities of Pig seem really great.
Pig also generates MapReduce jobs to fulfill your requirements, which is really cool. It has verbose output so you can see how hard your script was on Hadoop as a whole and lets you see the kind of multi-part requirements your code created. For instance, joining two objects might require two MapReduce jobs, and a sort may require a third.
Mahout Mesmerizes My Mind
One of the coolest labs (done in three parts really) we got to do involved Mahout. In the labs, we:
- Used Sqoop to load MovieLens data from MySQL into HDFS
- Used Mahout to run a basic item based recommender task (collaborative filtering) to determine what movies you might like based on prior 5 star reviews
- Used Pig to join the output to the movie list to get a nice list of movies that each user should check out based on the recommender
Sounds awesome, right? It was the one part of class that made me feel like I data scienced a little. And it definitely opened up my mind to possibilities both at work and in data I’ve played with over the years. If nothing else, I really want to focus my attention on Sqoop, Mahout, and Pig as it can help produce immediate eye-popping results. Even better, since MapReduce jobs output in TSV format I can easily take the data I get out of these components and use them for d3js visualizations.
I really don’t know where my Hadoop future will go. At my company (as it is for many people) we probably won’t have a Hadoop cluster in house for quite a long time. I think that it would be best right now to play around with Amazon’s Elastic MapReduce against S3, as it incorporates the same tools in a pay-as-you-go fashion. For simple testing, there’s always the Cloudera QuickStart VM which I saved from class to continue on with my testing and meddling.