Cloudera Developer Training for Apache Hadoop is almost over, and I’m somewhat sad that my Hadoopin’ days are nearly done–in the classroom at least. However, the breadth of this training has been great and I can definitely say I’ve gotten my (company’s) money’s worth.
Being that I’m three days in, I figure it’s time to mention a little more about my purpose for taking this class. After all, what does a freewheelin’ Oracle slingin’ admin doin’ rabble rouser like myself have to do with distributed filesystems and Java coding?
It’s actually pretty simple: Hadoop and Big Data might be the most overused words since paradigm shift and synergy, but there’s still a huge stack of useful technology and new ways of thinking that can be harnessed–excuse me, leveraged–if you know how to use it. As a contentaholic (one who frequently partakes in the consumption of content) I get a little giddy when I think of a technology that was created to parse, categorize, rank, and form relationships between many different streams of content. Since I currently work for a digital media company that lives on the collection of content and metadata, a brand of analytics that goes beyond traditional relational capabilities is also appealing both to them and to me.
I’ve been working with Oracle for all of my adult life and like to think I’ve got it pretty well down (‘down’ is of course relative). While I had a pretty good understanding prior to the class what ‘Big Data’ truly meant (mostly thanks to work with distributed NoSQL databases), a data technologist like myself should definitely know some of the nitty gritty details. And this class definitely delivered.
Anyways, enough about me. Let’s recap some lessons from Day 3.
I’ve been noticing more and more the little details that the Hadoop architecture pays attention to that really just make a lot of sense. From the block replication to the ways you can partition out Mapper output to the Reducers, there are tons of little things that really just ‘make sense’ in the architecture.
The one that really stands out (and it sounds quite simple), is that the file structure generated by a Mapper job and that generated by a Reducer job look almost identical, AND are in a proper format to be used as input to a Mapper. Let me explain….no, there is to much. Let me sum up.
One of the joys of MapReduce is that it can accept pretty much anything as input: plain text, PDFs, videos, pictures, audio, XML, you name it, it handles it. It can of course do very well with comma separated values (CSV) and tab separated values (TSV) as well. Every input and output for Mappers and Reducers are key:value, with key being whatever kind of Writable Comparable you want and value being some form of Writable.
When you use a path as your source data for MapReduce (located in HDFS), MapReduce will consider every file in the directory as input as long as it doesn’t start with a period or underscore. So if I have a big directory full of sensor data from the FAA, it could look like this in HDFS:
If I then run a MapReduce job against that–say one that returns all aircraft that have traveled through each 500 square mile chunk of airspace in the USA–like this: “hadoop jar CalcAirspace.jar CalcAirspace /user/oraclealchemist/faa /user/oraclealchemist/faaout”, it would read in the three .lst files and ignore the _comments file which could contain a bunch of personal notes about the data set. The data would run through my job and the output would get dumped in the /user/oraclealchemist/faaout location on HDFS.
Now here’s the cool part. The data that comes out will be in the form of ‘part’ files, one for each reducer. It comes out by default in TSV format. And all informational messages (success or failure) are preceeded by an underscore. That means that the output data is already perfectly formatted to be used as input to another job.
This may not seem like a big deal, and it’s really not. But it is common sense. Here we have this extraordinarily flexible system, and it makes sure to ensure portability and ease of workflows. As someone who deals with tools like SQL*Plus, SQL*Loader, and others this is a breath of fresh air.
Back in the day, Hadoop used SequenceFile as a means of passing binary data into and out of Hadoop. You can still do this, but the problem is that SequenceFile is only known to Hadoop and not to other programming languages.
Apache Avro, on the other hand, is a data serialization method that has APIs for Java and many other languages. If you want to make a map job that accepts images, videos, or other binary formats, you can use Avro to bundle them up together and pass them in in a way that your Mapper code can understand. It keeps track of the location of files within the Avro binary with JSON metadata.
You may ask, why not just pass in a ton of binary data as is? Why bundle it up with Avro? The answer lies in the way HDFS stores data and the optimal way to retrieve it. The default HDFS block size is 64MB (with many opting for 128MB). Hadoop performs best when your file sizes closely match the block size. A bunch of 2MB files will occupy smaller chunks of space in HDFS and result in more random reading. Serialized 128MB chunks with a 128MB block size will result in better reads. Also, Map tasks are spun up with one per input. If a smaller file is the input, then you’ll need more Mappers and therefore more waves during your Map phase.
If you’re getting into Hadoop and MapReduce I’d highly recommend you research common algorithms for use with MapReduce. As I mentioned in the Day 1 post, the standard Word Count example of MapReduce is a little lacking and really doesn’t capture what Hadoop can do. In fact, most of the key:value concepts where ‘value’ is a plain old integer comes off pretty boring.
It’s a lot cooler once you start getting into things like sorting, indexing, term frequency inverse document frequency, etc. You can find examples of this on the web and in github if you look around. From what I understand, Hadoop in Practice and Hadoop Real World Solutions Cookbook are great resources for real-world examples.
Why bother with Hadoop if you’re just going to do the equivalent of grep, awk, and wc? Push the limits!