MapReduce with Hadoop Streaming in bash - Part 1

[This entry is part 1 of 4 in the series Hadoop Streaming]

MapReduce with Hadoop Streaming in bash – Part 1
MapReduce with Hadoop Streaming in bash – Part 2
MapReduce with Hadoop Streaming in bash – Part 3
Hadoop Streaming, Hue, Oozie Workflows, and Hive

So to commemorate my recent certification and because my Java absolutely sucks, I decided to do a common algorithm using Hadoop Streaming.

Hadoop Streaming

Hadoop Streaming allows you to write MapReduce code in any language that can process stdin and stdout. This includes Python, PHP, Ruby, Perl, bash, node.js, and tons of others. I’m a huge fan of node and PHP but not everyone knows those. Python is desirable and I’m working to learn it but nowhere near ready yet. So I went for bash, since most Oracle-heads and other Linux lovers know it.

The algorithm I’m using is TF-IDF, which stands for Term Frequency – Inverse Document Frequency. According to Wikipedia, TF-IDF is “a numerical statistic which reflects how important a word is to a document in a collection or corpus”. It’s useful for search ranking, collaborative filtering, and other tasks. In this article (Part 1), we’re going to calculate term frequency by grabbing the lines of each file, parsing out all the words (map), then summing them up to show the frequency of each word per document (reduce).

The Setup

To set this up I’m using the Cloudera QuickStart VM. This is a phenomenal resource that is preconfigured with CDH4.3 and tons of extra tools. The data I’m working with is small and simple since I’m running in pseudo-distributed mode on a VM, consisting of 8 Stephen Crane poems (my favorite) in text format.

[training@localhost steve]$ pwd
/home/training/steve
[training@localhost steve]$ ls crane
a_man_said_to_the_universe.txt  a_newspaper.txt    met_a_seer.txt            truth.txt
a_man_saw_a_ball_of_gold.txt    a_spirit_sped.txt  pursuing_the_horizon.txt  walked_in_a_desert.txt
[training@localhost steve]$ cat crane/pursuing_the_horizon.txt 
I saw a man pursuing the horizon; 
Round and round they sped. 
I was disturbed at this; 
I accosted the man. 
"It is futile," I said, 
"You can never -- " 

"You lie," he cried, 
And ran on.

I had to load this data into Hadoop, so I made a ‘crane’ directory and put the files in there.

[training@localhost steve]$ hadoop fs -mkdir crane
[training@localhost steve]$ hadoop fs -put crane/* crane
[training@localhost steve]$ hadoop fs -ls crane
Found 8 items
-rw-r--r--   1 training supergroup        137 2013-10-01 00:41 crane/a_man_said_to_the_universe.txt
-rw-r--r--   1 training supergroup        322 2013-10-01 00:41 crane/a_man_saw_a_ball_of_gold.txt
-rw-r--r--   1 training supergroup        747 2013-10-01 00:41 crane/a_newspaper.txt
-rw-r--r--   1 training supergroup        439 2013-10-01 00:41 crane/a_spirit_sped.txt
-rw-r--r--   1 training supergroup        350 2013-10-01 00:41 crane/met_a_seer.txt
-rw-r--r--   1 training supergroup        192 2013-10-01 00:41 crane/pursuing_the_horizon.txt
-rw-r--r--   1 training supergroup        452 2013-10-01 00:41 crane/truth.txt
-rw-r--r--   1 training supergroup        208 2013-10-01 00:41 crane/walked_in_a_desert.txt

And we’re set!

The Mapper

So here’s the mapper (maptf.sh), which reads lines of whatever file is sent to it, tokenizes it, then emits keys and values (tab separated).

[training@localhost steve]$ cat maptf.sh
#!/bin/bash

exclude="\.\,?!\-_:;\]\[\#\|\$()\""
while read split; do
 for word in $split; do
   term=`echo "${word//[$exclude]/}" | tr [:upper:] [:lower:]`
   if [ -n "$term" ]; then
     printf "%s\t%s\t%s\n" "$term" "$map_input_file" "1"
   fi
 done
done

Let’s go through the code:

Define the exclude variable. This variable holds the regex characters that will be stripped out during the map.
Main loop. This reads stdin (while read) into a variable called ‘split’, one line at a time.
Inner loop. For each word in the ‘split’ variable (native tokenizing)
Set the ‘term’ variable equal to the current word, excluding characters from the ‘exclude’ variable, and converted to lowercase.
Make sure ‘term’ isn’t empty.
Print the output in the form of: term-inputfile-1 (with tabs instead of dashes). Inputfile in this case is represented by the environment variable ‘map_input_file’. This is a standard Map variable normally denoted as map.input.file; however, Hadoop Streaming turns the periods into underscores for compatibility.

The cool part is that since this is a shell script, we can test it at the command prompt to see how it works by reading the file and piping the script. Note that I’m setting the ‘map_input_file’ variable manually for the test so I get the proper output.

[training@localhost steve]$ export map_input_file=crane/pursuing_the_horizon.txt
[training@localhost steve]$ cat crane/pursuing_the_horizon.txt | ./maptf.sh 
i	crane/pursuing_the_horizon.txt	1
saw	crane/pursuing_the_horizon.txt	1
a	crane/pursuing_the_horizon.txt	1
man	crane/pursuing_the_horizon.txt	1
pursuing	crane/pursuing_the_horizon.txt	1
the	crane/pursuing_the_horizon.txt	1
horizon	crane/pursuing_the_horizon.txt	1
... (and so on)

At this point it’s no different from a simple wordcount Mapper. Which is sort of what the term frequency portion of this algorithm is, except that it takes the file and the term into account.

The Reducer

The Reducer is where we’ll aggregate the data that was emitted. The 8 files that will serve as input to this MapReduce job will be broken into 8 Mappers which each run the maptf.sh for their specific input split. Then results are then put through the ‘shuffle and sort’ phase where the keys are sorted (the first two output columns are the key in this case, more on this later) and sent to the reducer(s). The reducer then takes all the data and aggregates it into the final format. Our reducer will take the Map data with format (term-file-1) and sum it up to (term-file-termfrequency).

[training@localhost steve]$ cat redtf.sh 
#!/bin/bash

read currterm currfile currnum
while read term file num; do
  if [[ $term = "$currterm" ]] && [[ $file = "$currfile" ]]; then
    currnum=$(( currnum + num ))
  else
    printf "%s\t%s\t%s\n" "$currterm" "$currfile" "$currnum"
    currterm="$term"
    currfile="$file"
    currnum="$num"
  fi
done
printf "%s\t%s\t%s\n" "$currterm" "$currfile" "$currnum"

Read the first line, putting the fields into the variables ‘currterm’, ‘currfile’, and ‘currnum’
Loop through the rest of the file, putting new terms into the variables ‘term’, ‘file’, and ‘num’
Check to see if the latest term matches the previous term and the latest file matches the previous file. Remember, this works because the input to a reducer is ALWAYS sorted by key! The magic of shuffle and sort.

Set ‘currnum’ equal to ‘currnum’ plus the latest value of ‘num’ (always 1 in this case

Else… (no match, it’s a new term/file combo)

Print the current term, current file, and current sum in tab delimited format.
Set ‘currterm’ equal to the latest ‘term’
Set ‘currfile’ equal to the latest ‘file’
Set ‘currnum’ equal to the latest ‘num’

Keep doing that until the loop’s exhausted, then print the final value.

Fun, right? What’s cool as that we can test this the same way we tested the mapper, as long as we sort first. Remember, sorting has to be done on the first two columns which make up the key. So:

[training@localhost steve]$ cat crane/pursuing_the_horizon.txt | ./maptf.sh | sort -k1,2 | ./redtf.sh 
accosted	crane/pursuing_the_horizon.txt	1
a	crane/pursuing_the_horizon.txt	1
and	crane/pursuing_the_horizon.txt	2
at	crane/pursuing_the_horizon.txt	1
... (and so on)

And that’s our expected result.

Hadoop It Up

The time has finally come to run our MapReduce script. To do this we’re going to use the ‘hadoop’ command with the Hadoop Streaming JAR file included with the distro. Here’s the command we’ll use:

[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=2 -input crane -output crane_out \
-mapper /home/training/steve/maptf.sh -reducer /home/training/steve/redtf.sh

NOTE: The backslashes (\) are just to say that I’m splitting the command up over multiple lines.

This command is doing a few critical things. First, it says we want to run the hadoop-streaming.jar file. The -D option then allows us to enter any general options.

The first one is absolutely critical: stream.num.map.output.key.fields=2. This tells my MapReduce job that the first two fields output by the Mapper will be the key. It’s critical because the sort and shuffle phase needs to sort keys in order for the reducer to work properly. This is the case for all MapReduce jobs in any language, but only Hadoop Streaming needs to worry about this parameter.

The next parameter is the ‘-input’ option which is the HDFS location of the input files. It can be either a directory or any POSIX compliant glob match. The next parameter is ‘-output’ which is the location on HDFS where the output should be dumped. This directory MUST NOT exist. Then we define the ‘-mapper’ and ‘-reducer’ parameters, pointing them to my shell scripts. Simple.

You can see the output of this command here:

[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -input crane -output crane_out -mapper /home/training/steve/maptf.sh -reducer /home/training/steve/redtf.sh
packageJobJar: [/tmp/hadoop-training/hadoop-unjar4001401820102363860/] [] /tmp/streamjob4042079727913400227.jar tmpDir=null
13/10/01 01:38:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/01 01:38:37 WARN snappy.LoadSnappy: Snappy native library is available
13/10/01 01:38:37 INFO snappy.LoadSnappy: Snappy native library loaded
13/10/01 01:38:37 INFO mapred.FileInputFormat: Total input paths to process : 8
13/10/01 01:38:37 INFO mapred.JobClient: Running job: job_201309292255_0058
13/10/01 01:38:38 INFO mapred.JobClient:  map 0% reduce 0%
13/10/01 01:38:45 INFO mapred.JobClient:  map 25% reduce 0%
13/10/01 01:38:51 INFO mapred.JobClient:  map 50% reduce 0%
13/10/01 01:38:56 INFO mapred.JobClient:  map 75% reduce 16%
13/10/01 01:38:59 INFO mapred.JobClient:  map 87% reduce 16%
13/10/01 01:39:00 INFO mapred.JobClient:  map 100% reduce 16%
13/10/01 01:39:02 INFO mapred.JobClient:  map 100% reduce 100%
13/10/01 01:39:02 INFO mapred.JobClient: Job complete: job_201309292255_0058
13/10/01 01:39:02 INFO mapred.JobClient: Counters: 33
13/10/01 01:39:02 INFO mapred.JobClient:   File System Counters
13/10/01 01:39:02 INFO mapred.JobClient:     FILE: Number of bytes read=34933
13/10/01 01:39:02 INFO mapred.JobClient:     FILE: Number of bytes written=1758451
13/10/01 01:39:02 INFO mapred.JobClient:     FILE: Number of read operations=0
13/10/01 01:39:02 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/10/01 01:39:02 INFO mapred.JobClient:     FILE: Number of write operations=0
13/10/01 01:39:02 INFO mapred.JobClient:     HDFS: Number of bytes read=3750
13/10/01 01:39:02 INFO mapred.JobClient:     HDFS: Number of bytes written=21678
13/10/01 01:39:02 INFO mapred.JobClient:     HDFS: Number of read operations=17
13/10/01 01:39:02 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/10/01 01:39:02 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/10/01 01:39:02 INFO mapred.JobClient:   Job Counters 
13/10/01 01:39:02 INFO mapred.JobClient:     Launched map tasks=8
13/10/01 01:39:02 INFO mapred.JobClient:     Launched reduce tasks=1
13/10/01 01:39:02 INFO mapred.JobClient:     Data-local map tasks=8
13/10/01 01:39:02 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=40278
13/10/01 01:39:02 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=16523
13/10/01 01:39:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/10/01 01:39:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/10/01 01:39:02 INFO mapred.JobClient:   Map-Reduce Framework
13/10/01 01:39:02 INFO mapred.JobClient:     Map input records=114
13/10/01 01:39:02 INFO mapred.JobClient:     Map output records=516
13/10/01 01:39:02 INFO mapred.JobClient:     Map output bytes=33895
13/10/01 01:39:02 INFO mapred.JobClient:     Input split bytes=903
13/10/01 01:39:02 INFO mapred.JobClient:     Combine input records=0
13/10/01 01:39:02 INFO mapred.JobClient:     Combine output records=0
13/10/01 01:39:02 INFO mapred.JobClient:     Reduce input groups=326
13/10/01 01:39:02 INFO mapred.JobClient:     Reduce shuffle bytes=34975
13/10/01 01:39:02 INFO mapred.JobClient:     Reduce input records=516
13/10/01 01:39:02 INFO mapred.JobClient:     Reduce output records=326
13/10/01 01:39:02 INFO mapred.JobClient:     Spilled Records=1032
13/10/01 01:39:02 INFO mapred.JobClient:     CPU time spent (ms)=3520
13/10/01 01:39:02 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1265045504
13/10/01 01:39:02 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3495202816
13/10/01 01:39:02 INFO mapred.JobClient:     Total committed heap usage (bytes)=1300004864
13/10/01 01:39:02 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
13/10/01 01:39:02 INFO mapred.JobClient:     BYTES_READ=2847
13/10/01 01:39:02 INFO streaming.StreamJob: Output directory: crane_out

Now we can go look at our results to see how the job did. The results will be in the ‘crane_out’ directory as specified by the hadoop command. So let’s take a look:

[training@localhost steve]$ hadoop fs -ls crane_out
Found 3 items
-rw-r--r--   1 training supergroup          0 2013-10-01 01:39 crane_out/_SUCCESS
drwxr-xr-x   - training supergroup          0 2013-10-01 01:38 crane_out/_logs
-rw-r--r--   1 training supergroup      21678 2013-10-01 01:38 crane_out/part-00000

The ‘part-00000’ file is our output. By default, MapReduce ignores files that begin with an underscore (_) or a period (.). The output of the MapReduce job produced two ignorable files and one ‘part’ file which was the output of the single reducer used to aggregate our numbers. If this were a bigger dataset with more reducers, we’d have more part files.

So let’s take a look at our final output:

[training@localhost steve]$ hadoop fs -cat crane_out/part-00000
a	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	2
a	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4
a	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	14
a	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3
a	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
a	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
a	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	12
a	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3
accosted	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
achieved	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
addressed	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
again	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
ages	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
agony	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
ah	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
ah	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
already	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
am	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
and	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
and	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	3
and	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3
and	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
and	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2
and	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
and	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
another's	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
are	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
as	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
at	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
aye	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
aye	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
ball	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4
bawled	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
been	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
before	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
began	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
believed	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
black	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
black	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
blind	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
book	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
boys	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
breath	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
but	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
but	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
by	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
by	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	4
called	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
calling	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
can	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
cavern	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
child	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
chronicle	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
clay	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
climbed	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
collection	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2
concentrating	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
court	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
created	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
crevice	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
cried	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
cried	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2
crowd	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
crowned	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
cuddle	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
curious	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
dead	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
death	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
deathslime	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
denial	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
desert	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3
dire	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
disturbed	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
earth	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
echoes	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
error	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
eternal	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
even	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
eventually	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
eventually	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
ever	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
every	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
exist	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
fact	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
families	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
feckless	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
fenceless	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
fireside	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
fleetly	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
for	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
for	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
for	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
fortress	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
freedom	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
from	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
from	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
from	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
from	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
futile	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
game	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
garment	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
god	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	10
god	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
gold	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4
grown	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
had	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
halfinjustices	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
hand	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
hands	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
has	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
have	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
have	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3
he	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
he	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	6
he	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	4
he	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
heat	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
heavens	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
held	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
hem	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
highest	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
him	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
him	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
his	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
his	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
hold	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
honest	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
horizon	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
horizon	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
however	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
i	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
i	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	6
i	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	4
i	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	5
i	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3
in	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
in	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
in	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
in	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
in	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
in	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
into	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
is	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
is	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	7
is	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
is	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
is	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
is	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2
it	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	5
it	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
it	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
it	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
it	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
it	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2
its	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2
its	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3
joys	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
kindly	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
know	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
let	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
lie	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
life's	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
lived	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
lo	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
lone	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
long	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
looked	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
looks	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
loud	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
mad	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
man	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
man	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
man	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2
market	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
me	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
me	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
me	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
me	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
melons	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
men	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2
merciful	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
met	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
mighty	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
mile	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2
million	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
mocked	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
much	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
never	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
never	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
newspaper	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	5
night	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
no	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
no	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2
not	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
not	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
now	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
obligation	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
of	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
of	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4
of	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	5
of	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3
of	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
of	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
often	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
on	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
one	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
opened	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
opinion	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
part	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
phantom	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
place	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
plains	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
player	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
pursued	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
pursuing	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
ran	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
read	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
remote	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
replied	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
roaming	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
rock	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
round	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2
said	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
said	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
said	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
said	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
said	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2
sand	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
saw	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
saw	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
scores	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
screamed	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
second	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
seer	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
sells	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
sense	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
shadow	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
should	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
sir	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
sir	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
skill	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
sky	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
sky	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
smiled	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
smote	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
sneering	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
so	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
space	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
spaces	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
sped	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
sped	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
spirit	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
spreads	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
spurred	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
squalor	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
strange	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
strange	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
stupidities	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
suddenly	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
swift	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
sword	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
symbol	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
take	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
tale	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
tales	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
that	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
that	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	3
the	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	3
the	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	7
the	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	4
the	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
the	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2
the	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2
the	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	4
the	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3
their	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
then	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
then	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
there	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
there	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
they	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
think	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
this	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2
this	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
this	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
through	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
through	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
to	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1
to	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
to	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2
to	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3
touched	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
tower	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
traveller	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3
tried	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
truth	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3
unfairly	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
unhaltered	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
universe	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	2
vacant	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
valleys	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
victory	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
voice	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2
walked	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
was	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4
was	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1
was	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1
was	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
well	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
went	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
went	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2
when	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1
when	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
whence	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
where	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	3
which	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
which	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
while	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2
wind	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2
wins	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
wisdom	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
wisdom	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
world	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1
world	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1
you	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
you	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2

As you can see, each output record consists of a term, a filename, and a count (term frequency).

Special Note

Daniel Templeton left a very important note in the comments. In these examples I am running my scripts from the local filesystem; however, it’s a much better practice to load them into HDFS. Running on a VM is great but can make you lazy…once you move on to running on a cluster it will make a huge difference! He offered up this example:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=2 -input crane -output crane_out \
-file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh

Conclusion

Now normally you’d want to check your words against a stoplist and rule out all the common ones like ‘a’ and ‘and’ and such. However, since this is a small dataset and Stephen Crane is a man of few words, we’ll leave them in to see how our final algorithm holds up.

What we just calculated is the crucial ‘term frequency’ part of the TF-IDF algorithm. In the next part, we’ll be calculating the number of documents each term appears in (document frequency), an important part of the IDF portion of the algorithm. We’ll do this with another MapReduce job using different code, using the output from today’s job as input. See you then!

Oh yeah, one more thing. I’m not the best bash coder out there so if I could have coded the two functions better let me know! I tried using arrays first but that was slooooooow.

Added Note: I uploaded the source data and scripts into GitHub and will add new scripts as the three part blog tutorial moves forward. The Cloudera VM comes preconfigured with git.

8 comments

Timothy Potter says:

October 22, 2013 at 3:02 pm

This is a quality blog post for sure, but I think it is worth mentioning that this approach looks a little cumbersome for computing TF-IDF. I get that you chose TF-IDF because it’s a simple and popular computation but a better solution is only a few lines of trivial Pig code (e.g. http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-topic-summarization-2-lines-of-pig/). Pig will create the optimized M/R jobs to do the actual work. Not to mention it would be hard (and/or slow) to integrate a more sophisticated analysis/tokenization strategy into a streaming job, i.e. using Lucene’s StandardAnalyzer to tokenize and split text. With Pig, you could write a simple UDF to invoke a Lucene Analyzer in a few lines of code.
Steve Karam says:

October 22, 2013 at 3:12 pm

Timothy, thanks for the input into better tech to use for this purpose! I’m hoping to tackle some Mahout tasks later for blogging and will probably use Mahout/Lucene for the same type of things.

I figured TF-IDF would be a good algorithm to use to demo the fact that shell scripts can be used for MapReduce…as cumbersome as they may be. 😉 Your acronym on LinkedIn was spot on: I’m trying not to create YAWCE (Yet Another Word Count Example)! Love it.
Daniel Templeton says:

October 24, 2013 at 11:32 am

I absolutely LOVE that you did this with with Crane’s poetry!

One thing to watch, though, is that you skipped the step where you uploaded your scripts into HDFS. A better approach would be to let streaming handle it for you:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=2 -input crane -output crane_out \
-file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh

The -file args tell streaming to upload your scripts into the working directory of the job, so the commands passed to -mapper and -reducer are running with your scripts in the cwd.

A man cried out to the JobTracker, “Sir, my job exists!”

“However,” replied the JobTracker, “that fact has not created in me a guarantee of data locality.”
Steve Karam says:

October 24, 2013 at 11:42 am

Daniel, you’re awesome. Thanks for the -file note, that’s definitely something that was missing! I’ll add it into the post.

And that parody is spot on as well. Absolutely hilarious.
Ben Okopnik says:

October 25, 2013 at 2:07 am

I’m afraid that ‘split’ trick doesn’t work too well; it simply eliminates the specified characters, joining the (possibly unrelated) words:

$ map_input_file=test ./maptf.sh
not-connected
notconnected test 1
joe,frank,lisa
joefranklisa test 1

Try this instead:

———————————————————————————————————–
#!/bin/bash

old_ifs=”$IFS”
while read split; do
IFS=’ .,?!-_:;][#|$()”‘
for word in $split; do
term=`echo $word | tr [:upper:] [:lower:]`
[ -n “$term” ] && printf “%s\t%s\t%s\n” “$term” “$map_input_file” 1
done
IFS=”$old_ifs”
done
———————————————————————————————————–

$ map_input_file=test ./maptf.sh
not-connected
not test 1
connected test 1
joe,frank,lisa
joe test 1
frank test 1
lisa test 1

Good informative post overall, though – thank you!

Ben Okopnik
rICh says:

October 30, 2013 at 6:01 pm

Steve, this. is. awesome! I’m sharing it with every developer class I have going forward. Hugely useful stuff and I love how you did it in bash. Ben, thank you for the clarifications & code rewrite.
Pswain says:

April 22, 2014 at 2:31 pm

Cant get past the error java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found – while running mapreduce streaming job.
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.4.0.jar -D stream.num.map.output.key.fields=2 -input crane -output crane_out -file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh -verbose

I have replaced the shared lib under oozie. But I’m not quite sure what could be causing this. please ignore if this is not the right forum. I’m new to this area.

thanks.

This site uses Akismet to reduce spam. Learn how your comment data is processed.