MapReduce with Hadoop Streaming in bash - Part 2

[This entry is part 2 of 4 in the series Hadoop Streaming]

MapReduce with Hadoop Streaming in bash – Part 1
MapReduce with Hadoop Streaming in bash – Part 2
MapReduce with Hadoop Streaming in bash – Part 3
Hadoop Streaming, Hue, Oozie Workflows, and Hive

In MapReduce with Hadoop Streaming in bash – Part 1 we found the ‘term frequency’ of words within a collection of documents. For the documents I chose 8 Stephen Crane poems, and our bash Map and Reduce jobs tokenized the words and found their frequency among the entire set. The final output was “term-file-tf”, where tf is term frequency and the dash delimiters were actually tabs. A sample of the output looked like this:

god	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	10
god	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1
gold	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4
grown	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1
had	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1

Today we will be calculating the ‘document frequency’, or the number of documents each word appears in. This will help us calculate the ‘inverse document frequency’ (IDF) portion of our TF-IDF algorithm. To do this, we’ll be using our term frequency output as the input to our document frequency MapReduce job, using term and filename as our input key. The actual key/value transformation will look like this (key is the first variable or parenthetical:

{(term, file),tf} -> Map -> {term,(file, tf, 1)}
{term,(file, tf, 1)} -> Reduce -> {(term, file), (tf, df)}

This is the trickiest part of the TF-IDF calculation, because the reduce job has to span multiple documents in a single read loop and therefore buffer in-progress rows. But more on that later. For now let’s get started!

The Mapper

For the purposes of testing, I’m first going to pull the results of yesterday’s term frequency job to the local filesystem.

[training@localhost steve]$ hadoop fs -get crane_out/part-00000

Now let’s take a look at our document frequency mapper code.

[training@localhost steve]$ cat maptf.sh 
[training@localhost steve]$ cat mapdf.sh 
#!/bin/bash

while read term file num; do
  printf "%s\t%s\t%s\t%s\n" "$term" "$file" "$num" "1"
done

This script is exceedingly simple this time because we’re working with more structured input as opposed to yesterday where we had to tokenize unstructured data (plain text). The code above does the following:

Read and loop the input in as three variables: term, file, and num (tf from our last job’s output)
Print the variables back out, appending a new column with a value of “1”. All we’re showing here is that yes, this term made an appearance in this file. Since this calculation is for document frequency each word-per-doc result is just 1.

That was easy, right? Let’s test it using the file we grabbed from the last job.

[training@localhost steve]$ cat part-00000 | ./mapdf.sh
a	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	2	1
a	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4	1
a	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	14	1
a	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3	1
... (and so on)

Looks good! Moving on.

The Reducer

This is where our feeling of “oh, that was easy” is bashed (pardon the pun) beyond recognition.

[training@localhost steve]$ cat reddf.sh 
#!/bin/bash

read currterm currfile currtf currdf
while read term file tf df; do
  if [[ $term = "$currterm" ]]; then
    currdf=$(( currdf + df ))
    buffer+="${term}\t${file}\t${tf}\n"
  else
    echo -e -n $buffer | while read line; do echo -e "${line}\t${currdf}"; done
    printf "%s\t%s\t%s\t%s\n" "$currterm" "$currfile" "$currtf" "$currdf"
    buffer=""
    currterm="$term"
    currfile="$file"
    currtf="$tf"
    currdf="$df"
  fi
done
echo -e -n $buffer | while read line; do echo -e "${line}\t${currdf}"; done
printf "%s\t%s\t%s\t%s\n" "$currterm" "$currfile" "$currtf" "$currdf"

Alright, let’s slog through it.

Just like our term frequency example, we’re going to read the first line of the file in as the variables “currterm”, “currfile”, “currtf”, and “currdf”.
Loop through the rest of the file with the variables “term”, “file”, “tf”, and “df”.
Remember that the “term” is the only key for the input–the rest count as values. As such, we check to see if the newest value of “term” equals the last one stored in “currterm”.

If matched

increment our document frequency (df) by the loop value (always 1 in this case)
Add the term, file, and tf to a buffer so we can print it out later (very important)

If not matched (new term)

Print the buffer, adding the total document frequency for the term to the end of each line in it (saved during incrementing from before)
Print the last and most recent term, file, term frequency, and document frequency.
Reset the buffer and set all the curr* variables to the latest variable value to begin again.

Print out the final buffer and final line of the file.

Trust me, it was more painful to write than to read. One of the tougher parts about Hadoop Streaming is that you are responsible for maintaining the state and scope of the keys, as opposed to Java where it’s done for you. Beyond my bash shortcomings, I had problems early on with this because I was using the wrong key in my conditions–it is absolutely vital that you keep track of the key in your reducer calculations. Let’s see how it looks with a bash test:

[training@localhost steve]$ cat part-00000 | ./mapdf.sh | sort -k1 | ./reddf.sh
accosted	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
achieved	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
addressed	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
again	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
ages	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
agony	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
... (and so on)

Unfortunately not much to see here (and I don’t want to paste the whole thing in the interest of space), but at least it is correct. Remember, the fields here are term, file, tf (frequency of the term within the file), and df (frequency of the term across all files). Those terms only appear once in their associated file and in the document set overall. Thank you Stephen Crane for your uniqueness.

Time to Hadoop

Now that the mapper and reducer are done, here’s the command we will use to process it through MapReduce:

[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=1 -input crane_out -output crane_out2 \
-mapper /home/training/steve/mapdf.sh -reducer /home/training/steve/reddf.sh

Remember, the backslashes are only there to say this is a multi-line input. If you type it all on one line you don’t need them. Also note that the stream.num.map.output.key.fields is set to 1 here, as the output from the Mapper has only one column for the key: term. This is important because the shuffle and sort phase needs to sort on the key. The input location is the results from the last job (crane_out/) and the output is a new directory (must not exist) called crane_out2/.

So let’s run it and see what happens!

[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=1 -input crane_out -output crane_out2 -mapper /home/training/steve/mapdf.sh -reducer /home/training/steve/reddf.sh
packageJobJar: [/tmp/hadoop-training/hadoop-unjar1827136867538905859/] [] /tmp/streamjob145477386971923155.jar tmpDir=null
13/10/01 07:30:10 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/01 07:30:10 WARN snappy.LoadSnappy: Snappy native library is available
13/10/01 07:30:10 INFO snappy.LoadSnappy: Snappy native library loaded
13/10/01 07:30:10 INFO mapred.FileInputFormat: Total input paths to process : 1
13/10/01 07:30:11 INFO mapred.JobClient: Running job: job_201309292255_0065
13/10/01 07:30:12 INFO mapred.JobClient:  map 0% reduce 0%
13/10/01 07:30:17 INFO mapred.JobClient:  map 100% reduce 0%
13/10/01 07:30:20 INFO mapred.JobClient:  map 100% reduce 100%
13/10/01 07:30:21 INFO mapred.JobClient: Job complete: job_201309292255_0065
13/10/01 07:30:21 INFO mapred.JobClient: Counters: 33
13/10/01 07:30:21 INFO mapred.JobClient:   File System Counters
13/10/01 07:30:21 INFO mapred.JobClient:     FILE: Number of bytes read=22988
13/10/01 07:30:21 INFO mapred.JobClient:     FILE: Number of bytes written=420886
13/10/01 07:30:21 INFO mapred.JobClient:     FILE: Number of read operations=0
13/10/01 07:30:21 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/10/01 07:30:21 INFO mapred.JobClient:     FILE: Number of write operations=0
13/10/01 07:30:21 INFO mapred.JobClient:     HDFS: Number of bytes read=21785
13/10/01 07:30:21 INFO mapred.JobClient:     HDFS: Number of bytes written=22330
13/10/01 07:30:21 INFO mapred.JobClient:     HDFS: Number of read operations=3
13/10/01 07:30:21 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/10/01 07:30:21 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/10/01 07:30:21 INFO mapred.JobClient:   Job Counters 
13/10/01 07:30:21 INFO mapred.JobClient:     Launched map tasks=1
13/10/01 07:30:21 INFO mapred.JobClient:     Launched reduce tasks=1
13/10/01 07:30:21 INFO mapred.JobClient:     Data-local map tasks=1
13/10/01 07:30:21 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=5186
13/10/01 07:30:21 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=3343
13/10/01 07:30:21 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/10/01 07:30:21 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/10/01 07:30:21 INFO mapred.JobClient:   Map-Reduce Framework
13/10/01 07:30:21 INFO mapred.JobClient:     Map input records=326
13/10/01 07:30:21 INFO mapred.JobClient:     Map output records=326
13/10/01 07:30:21 INFO mapred.JobClient:     Map output bytes=22330
13/10/01 07:30:21 INFO mapred.JobClient:     Input split bytes=107
13/10/01 07:30:21 INFO mapred.JobClient:     Combine input records=0
13/10/01 07:30:21 INFO mapred.JobClient:     Combine output records=0
13/10/01 07:30:21 INFO mapred.JobClient:     Reduce input groups=226
13/10/01 07:30:21 INFO mapred.JobClient:     Reduce shuffle bytes=22988
13/10/01 07:30:21 INFO mapred.JobClient:     Reduce input records=326
13/10/01 07:30:21 INFO mapred.JobClient:     Reduce output records=326
13/10/01 07:30:21 INFO mapred.JobClient:     Spilled Records=652
13/10/01 07:30:21 INFO mapred.JobClient:     CPU time spent (ms)=920
13/10/01 07:30:21 INFO mapred.JobClient:     Physical memory (bytes) snapshot=199909376
13/10/01 07:30:21 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=776908800
13/10/01 07:30:21 INFO mapred.JobClient:     Total committed heap usage (bytes)=176492544
13/10/01 07:30:21 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
13/10/01 07:30:21 INFO mapred.JobClient:     BYTES_READ=21678
13/10/01 07:30:21 INFO streaming.StreamJob: Output directory: crane_out2

Looks good! Well…completed at least. Let’s take a look at the output.

[training@localhost steve]$ hadoop fs -cat crane_out2/part-00000
a	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	14	8
a	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	2	8
a	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3	8
a	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	8
a	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	8
a	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	12	8
a	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3	8
a	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4	8
accosted	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
achieved	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
addressed	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
again	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
ages	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
agony	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
ah	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	2
ah	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
already	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
am	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
and	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	7
and	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	3	7
and	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	7
and	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2	7
and	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	7
and	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	7
and	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3	7
another's	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
are	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
as	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
at	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
aye	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
aye	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
ball	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4	1
bawled	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
been	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
before	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
began	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
believed	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
black	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	2
black	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
blind	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
book	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	1
boys	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
breath	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
but	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	2
but	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	2
by	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	4	2
by	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
called	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
calling	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	1
can	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
cavern	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
child	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	1
chronicle	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
clay	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
climbed	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
collection	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2	1
concentrating	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
court	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
created	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
crevice	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
cried	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	2
cried	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2	2
crowd	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
crowned	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
cuddle	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
curious	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
dead	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
death	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
deathslime	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
denial	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
desert	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3	1
dire	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
disturbed	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
earth	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
echoes	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
error	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
eternal	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
even	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
eventually	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
eventually	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
ever	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	1
every	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
exist	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
fact	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
families	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
feckless	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
fenceless	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
fireside	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
fleetly	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
for	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	3
for	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	3
for	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	3
fortress	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
freedom	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
from	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	4
from	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	4
from	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	4
from	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	4
futile	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
game	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
garment	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
god	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	2
god	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	10	2
gold	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4	1
grown	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
had	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
halfinjustices	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
hand	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
hands	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
has	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
have	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3	2
have	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
he	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	6	4
he	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	4	4
he	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	4
he	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	4
heat	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
heavens	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
held	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	1
hem	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
highest	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
him	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	2
him	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
his	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
his	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
hold	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
honest	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
horizon	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	2
horizon	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	2
however	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
i	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	5
i	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	6	5
i	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	5	5
i	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3	5
i	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	4	5
in	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	6
in	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	6
in	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	6
in	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	6
in	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	6
in	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	6
into	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
is	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	6
is	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	6
is	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	6
is	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	6
is	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2	6
is	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	7	6
it	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	5	6
it	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	6
it	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	6
it	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	6
it	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2	6
it	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	6
its	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2	2
its	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3	2
joys	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
kindly	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
know	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
let	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
lie	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
life's	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
lived	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
lo	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
lone	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
long	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
looked	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	1
looks	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
loud	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
mad	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
man	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	3
man	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2	3
man	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	3
market	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
me	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	4
me	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	4
me	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	4
me	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	4
melons	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
men	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2	1
merciful	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
met	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
mighty	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
mile	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2	1
million	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
mocked	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
much	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	1
never	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	2
never	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	2
newspaper	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	5	1
night	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
no	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2	2
no	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
not	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	2
not	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
now	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	1
obligation	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
of	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4	6
of	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	5	6
of	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	3	6
of	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	6
of	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	6
of	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	6
often	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
on	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
one	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
opened	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
opinion	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
part	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	1
phantom	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
place	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
plains	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
player	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
pursued	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
pursuing	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
ran	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
read	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
remote	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
replied	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
roaming	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
rock	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
round	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2	1
said	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	5
said	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	5
said	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	5
said	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2	5
said	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	5
sand	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
saw	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	2
saw	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
scores	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
screamed	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
second	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
seer	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
sells	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
sense	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	1
shadow	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
should	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
sir	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	2
sir	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	2
skill	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
sky	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
sky	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
smiled	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
smote	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
sneering	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
so	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
space	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
spaces	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
sped	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	2
sped	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	2
spirit	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
spreads	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
spurred	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
squalor	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
strange	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
strange	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	2
stupidities	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
suddenly	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
swift	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
sword	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
symbol	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
take	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
tale	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
tales	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
that	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	3	2
that	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
the	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	7	8
the	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	4	8
the	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	8
the	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	2	8
the	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2	8
the	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	4	8
the	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	3	8
the	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	3	8
their	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
then	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
then	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
there	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	2
there	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
they	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	1
think	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	1
this	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	3
this	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	3
this	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	2	3
through	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
through	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	2
to	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2	4
to	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	1	4
to	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3	4
to	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	4
touched	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
tower	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
traveller	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3	1
tried	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
truth	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	3	1
unfairly	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
unhaltered	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
universe	hdfs://0.0.0.0:8020/user/training/crane/a_man_said_to_the_universe.txt	2	1
vacant	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
valleys	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	1
victory	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
voice	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	2	1
walked	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
was	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	4	4
was	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	1	4
was	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	4
was	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	1	4
well	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	1
went	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	2
went	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
when	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
when	hdfs://0.0.0.0:8020/user/training/crane/a_man_saw_a_ball_of_gold.txt	1	2
whence	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	1
where	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	3	1
which	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
which	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
while	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	2	1
wind	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	2	1
wins	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	1
wisdom	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2
wisdom	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
world	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	2
world	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	2
you	hdfs://0.0.0.0:8020/user/training/crane/pursuing_the_horizon.txt	2	2
you	hdfs://0.0.0.0:8020/user/training/crane/met_a_seer.txt	1	2

Beautiful! Each word/file combination now has an associated term frequency and document frequency. Simple checks against your source data with ‘grep’ can determine if it’s correct or not. For example, take a look at the word ‘from’:

from	hdfs://0.0.0.0:8020/user/training/crane/a_spirit_sped.txt	2	4
from	hdfs://0.0.0.0:8020/user/training/crane/truth.txt	1	4
from	hdfs://0.0.0.0:8020/user/training/crane/walked_in_a_desert.txt	1	4
from	hdfs://0.0.0.0:8020/user/training/crane/a_newspaper.txt	1	4

According to my MapReduce job, the word ‘from’ has a document frequency of 4 because it appears in 4 different files. In ‘a_spirit_sped.txt’ it has a term frequency of 2 (appears twice) and in the others it has a term frequency of 1 (appears once). Let’s see if that’s right.

[training@localhost steve]$ grep -i from crane/*
crane/a_newspaper.txt:Which, bawled by boys from mile to mile, 
crane/a_spirit_sped.txt:From crevice and cavern 
crane/a_spirit_sped.txt:A sword from the sky, 
crane/truth.txt:From whence the world looks black." 
crane/walked_in_a_desert.txt:"Ah, God, take me from this place!"

Looks good to me! I think we’re set for the day.

Conclusion

In Part 1, we completed a MapReduce job to calculate the term frequency of words within documents. In this part, we completed a MapReduce job to go through the output and append document frequency for each term–i.e., the amount of documents the term appears in. Both of these numbers are critical for our final calculation in the next article which will calculate the Term Frequency/Inverse Document Frequency (TF-IDF). Stay tuned!

6 comments

sudheer1313 says:

October 24, 2013 at 3:58 am

Thanks for sharing this valuble information and itis useful for me .Hadoop online trainings also provides the best Hadoop online training classes in India,uk.
Daniel Templeton says:

October 24, 2013 at 11:57 am

Since your map script really doesn’t do anything, you’d probably be better off using the identity mapper and just always assuming df=1 in the reducer (which is true). It saves you a script, and your command becomes:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=1 -input crane_out -output crane_out2 \
-file reddf.sh -reducer reddf.sh \
-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat

The -inputformat is because the identity mapper spits out whatever it’s handed, and the default input format hands it keys that are type Long. The streaming reducer expects Text keys, and unhappiness ensues. The KeyValueTextInputFormat spits out text keys, as the reducer expects.
Steve Karam says:

October 24, 2013 at 12:03 pm

Thank you for pointing that out Daniel. That makes a lot of sense, not sure why I didn’t think to use the IdentityMapper. Since you brought it up though I have a question for you…I tried to use “org.apache.hadoop.mapred.lib.IdentityReducer” in Oozie (through Hue) and it didn’t like that so I wrote my own IdentityReducer. Do you know if I missed something there or is it not supported?
Daniel Templeton says:

October 24, 2013 at 2:12 pm

Off the top of my head, no idea. My Oozie experience is limited.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

The Mapper

The Reducer

Time to Hadoop

Conclusion

Leave a Reply