MapReduce with Hadoop Streaming in bash – Part 3

Share Button

In our first MapReduce with Hadoop Streaming in bash article, we took a collection of Stephen Crane poems and used a MapReduce job to calculate ‘term frequency’–meaning we counted the number of times each word in the collection appeared in the collection. In the second part, we calculated ‘document frequency’ by counting the number of documents each word appears in using results from the first job.

For this final part, we will use the term frequency and document frequency to build the final Term Frequency/Inverse Document Frequency (TF-IDF) score. To do this, we need to fill our results into the TF-IDF algorithm.

TF-IDF

This algorithm shows that TF-IDF equals the Term Frequency times the natural logarithm of total documents divided by document frequency. So for each term/file combination, we need to calculate the TF-IDF based on the values provided by our last MapReduce job. Some people prefer to use base 10 logarithm to dampen down the results–I’ll cover this in the Mapper section.

Setup

So the first thing I’m going to do is get the output from the last job. You should be used to this by now.

The next thing I’m going to do is cheat. See, the algorithm requires the total number of documents that we’ve been analyzing. Sure, I could write a MapReduce job that looks through our latest output and emits a list of unique files; however, this is very inefficient and a waste of resources. Using a simple ‘ls’ command with a glob is much more efficient and makes better use of our (pseudo)cluster. To figure out our total document count, we’ll do just that:

For our testing we’ll just explicitly set this as a variable. In the Hadoop job we’ll pass it in as a parameter.

On to the Mapper

So let’s go ahead and do our final calculation using the Mapper.

Simpler than you thought? Let’s look at what we did.

  1. Read each line into the variables term, file, tf, and df. These represent (from our last job) a unique term/document combination, the number of times the term appeared in that document (tf), and the number of documents the term appears in (df).
  2. Calculate TFIDF using awk. We do this by passing total documents ($N, calculated with the variable I mentioned in the setup), document frequency, and term frequency. TF times log(total/DF) is the final answer.
  3. Print the final output as Term, File, and TF-IDF. Term and File make up the unique key for each line of output.

This is actually our final result. This is exactly what we’ve been trying to calculate and the product of our three jobs. As I mentioned in the intro to this article, some people prefer to use log10() instead of natural logarithm (which uses the constant e) to dampen the results like this:

TF-IDF 10

If that’s the case, you can replace the awk line in the Mapper with this one:

Let’s test it our in the shell, first setting the ‘N’ variable required for the algorithm:

Not too descriptive with all those 0′s, but if you know how TF-IDF works you know it is working. The lower the value, the less relevant that word is. The letter ‘a’ appears in all 8 documents, making it a very irrelevant word. Usually words like ‘a’ or ‘and’ or ‘the’ would have been filtered out in the beginning via a stoplist.

What Reducer?

Screen Shot 2013-02-10 at 5.28.01 PMSince the Mapper produced our final output, we actually don’t need to worry about a reducer. We could specify no reducer (-reducer NONE in the options) but instead we’ll use something called the IdentityReducer. An IdentityReducer means that we want the reducer to take its input and just output it naturally with no calculation. This accomplishes two things: 1) the data is sorted/shuffled when it’s sent to the reducer, so with a single reducer it should come out sorted, and 2) we will get a single output file instead of one per mapper which is easier to work with later.

So you get a break this time. No reducer. Sweet!

Our Hadoop Command

Just as in the previous articles, the backslashes are just there to show this is a multiline command. If you put it all on one line you don’t need them.

So a few things to note here. First, we set the stream.num.map.output.key.fields variable to 2. Even though we don’t have a formal reducer, we still want to tell the job the key field count so it will sort properly. Second, we set a new variable (-D is required for each one) called ‘N’ to the result of an ‘ls’ command in Hadoop against our original document folder. This variable will be expressed as bash variable inside our shell script and denotes the total document count. The third thing is the -reducer setting. To use the identity reducer, set it to org.apache.hadoop.mapred.lib.IdentityReducer.

Running this command gives us the final job output and the save to the ‘tfidf’ folder under HDFS:

The Final Results

After three MapReduce jobs, we’re finally ready to see our word/document and associated TF-IDF. Score! (ba dum tss)

Just as in our test, “a” is not important at all as it appears in 8 documents so it came out with a score of 0. “Where” seems very important for such a common word. But it turns out that is because it shows up 3 times in only 1 file. Remember, TF-IDF is “a numerical statistic which reflects how important a word is to a document in a collection or corpus”. Take a look at words like “you” at the end of the file–it shows up in two different files, but has a different TF-IDF weight for each one. That’s because “you” only appears once in the first file but twice in the second file, making it more important to that document in relation to the whole corpus. You can see this with ‘grep’ commands against the original content.

And with that, we just built an index. If we were to build a search engine against those 8 Stephen Crane poems, then a search for a word would output file ordered by TF-IDF descending. That way the most pertinent (keyword rich) files would come first on the results.

Conclusion

Of course, we could have done this project a lot easier with tools like Lucene and Mahout. They are of course made for this sort of thing, and have a ton of extra features including automatic stoplisting, weight tuning, etc. But it wouldn’t be nearly as fun, right?

This concludes our TF-IDF with Hadoop Streaming in bash exercise. If you have any feedback on better ways to do these tasks (or errata) please let me know in the comments!

Share Button

One Response to “MapReduce with Hadoop Streaming in bash – Part 3”

  1. Great series! Thanks for the very practical guide to Hadoop streaming!

Leave a Reply