Wednesday, January 6, 2010

TF-IDF in Hadoop Part 2 Word Counts For Docs

The TF-IDF algorithm can be implemented in different ways. The Cloudera Hadoop training defines different steps on the implementation of each of the steps through different Jobs. I decided to take the approach of persisting the intermediate data before the execution of the subsequent steps. This part documents the implementation of Job 2 as the second part of my experiments with Hadoop.

Part 1 resulted in the word frequency for each of the documents in the input path provided, persisted at the "1-word-freq" output directory, as shown below:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 1-word-freq/part-r-00000 | less
...
therefore@all-shakespeare 652
therefore@leornardo-davinci-all.txt 124
therefore@the-outline-of-science-vol1.txt 36

The definition of Job 2 will take into account the structure of this data in the creation of the Mapper and Reducer classes.

Job 2: Word Counts for Docs

The goal of this job is to count the total number of words for each document, in a way to compare each word with the total number of words. I've tried to implement a default InputFormat and I couldn't find examples related to it. As I understood, the values could be read in the same format they are saved (Text, IntWritable), but I will keep it simple and use the same default InputFormat as before. Following the same definition as in part one, the specification of the Map and Reduce are as follows:
  • Map:
    • Input: ((word@document), n)
    • Re-arrange the mapper to have the key based on each document
    • Output: (document, word=n)
  • Reducer
    • N = totalWordsInDoc = sum [word=n]) for each document
    • Output: ((word@document), (n/N))
Note that the format used for the input of the mapper is the output for the previous job. The delimiters "@" and "/" were randomly picked to better represent the intent of the data. So, feel free to pick anything you prefer. The reducer just need to sum the total number of values in a document and pass this value over to the next step, along with the previous number of values, as necessary data for the next step.

I have learned that the Iterable values in the values of the Reducer class can't be iterated more than once. The loop just did not enter when two foreach operations were performed, so I implemented it using a temporary map.

Job2, Mapper

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
* LineIndexMapper Maps each observed word in a line to a (filename@offset) string.
*/
public class WordCountsForDocsMapper extends Mapper {
public WordCountsForDocsMapper() {
}
/**
* @param key is the byte offset of the current line in the file;
* @param value is the line from the file
* @param context
*
* PRE-CONDITION: aa@leornardo-davinci-all.txt 1
* aaron@all-shakespeare 98
* ab@leornardo-davinci-all.txt 3
*
* POST-CONDITION: Output <"all-shakespeare", "aaron=98"> pairs
*/
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] wordAndDocCounter = value.toString().split("\t");
String[] wordAndDoc = wordAndDocCounter[0].split("@");
context.write(new Text(wordAndDoc[1]), new Text(wordAndDoc[0] + "=" + wordAndDocCounter[1]));
}
}

Job2, Mapper Unit Test

I have just simplified the unit test to verify if the test Mapper generates the format needed for the Reducer.

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
* Test cases for the word count mapper.
*/
public class WordCountsForDocsMapperTest extends TestCase {

private Mapper mapper;
private MapDriver driver;

@Before
public void setUp() {
mapper = new WordCountsForDocsMapper();
driver = new MapDriver(mapper);
}

@Test
public void testMultiWords() {
List> out = null;

try {
out = driver.withInput(new LongWritable(0), new Text("crazy@all-shakespeare\t25")).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();
expected.add(new Pair(new Text("all-shakespeare"), new Text("crazy=25")));
assertListEquals(expected, out);
}
}

Job 2, Reducer

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
* WordCountsForDocsReducer counts the number of documents in the
*/
public class WordCountsForDocsReducer extends Reducer {

public WordCountsForDocsReducer() {
}

/**
* @param key is the key of the mapper
* @param values are all the values aggregated during the mapping phase
* @param context contains the context of the job run
*
* PRE-CONDITION: receive a list of
* pairs <"a.txt", ["word1=3", "word2=5", "word3=5"]>
*
* POST-CONDITION: <"word1@a.txt, 3/13">,
* <"word2@a.txt, 5/13">
*/
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
int sumOfWordsInDocument = 0;
Map tempCounter = new HashMap();
for (Text val : values) {
String[] wordCounter = val.toString().split("=");
tempCounter.put(wordCounter[0], Integer.valueOf(wordCounter[1]));
sumOfWordsInDocument += Integer.parseInt(val.toString().split("=")[1]);
}
for (String wordKey : tempCounter.keySet()) {
context.write(new Text(wordKey + "@" + key.toString()), new Text(tempCounter.get(wordKey) + "/"
+ sumOfWordsInDocument));
}
}
}

Job 2, Reducer Unit Test

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
* Test cases for the reducer of the word counts.
*/
public class WordCountsForDocsReducerTest extends TestCase {

private Reducer reducer;
private ReduceDriver driver;

@Before
public void setUp() {
reducer = new WordCountsForDocsReducer();
driver = new ReduceDriver(reducer);
}

@Test
public void testMultiWords() {
List> out = null;

try {
List values = new ArrayList();
values.add(new Text("car=50"));
values.add(new Text("hadoop=15"));
values.add(new Text("algorithms=25"));
out = driver.withInput(new Text("document"), values).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();
expected.add(new Pair(new Text("car@document"), new Text("50/90")));
expected.add(new Pair(new Text("hadoop@document"), new Text("15/90")));
expected.add(new Pair(new Text("algorithms@document"), new Text("25/90")));
assertListEquals(expected, out);
}

}

Once again, following our Test-Driven Development approach, let's test our Mapper and Reducer classes in order to verify its "correctness" of the generated data. The JUnit 4 Test suit is updated as follows:

// (c) Copyright 2009 Cloudera, Inc.
// Updated by Marcello de Sales (marcello.dsales@gmail.com)
package index;

import junit.framework.Test;
import junit.framework.TestSuite;

/**
* All tests for inverted index code
*
* @author aaron
*/
public final class AllTests {

private AllTests() { }

public static Test suite() {
TestSuite suite = new TestSuite("Tests for the TF-IDF algorithm");

suite.addTestSuite(WordFreqMapperTest.class);
suite.addTestSuite(WordFreqReducerTest.class);
suite.addTestSuite(WordCountsForDocsMapperTest.class);
suite.addTestSuite(WordCountsForDocsReducerTest.class);

return suite;
}

}

Just testing it with the ANT task test, defined in the build.xml artifact.

training@training-vm:~/git/exercises/shakespeare$ ant test
Buildfile: build.xml

compile:
[javac] Compiling 12 source files to /home/training/git/exercises/shakespeare/bin

test:
[junit] Running index.AllTests
[junit] Testsuite: index.AllTests
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.424 sec
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.424 sec
[junit]

BUILD SUCCESSFUL

Similar to the previous Part 1, the the execution of the Driver is safer to proceed with tested classes. Furthermore, it includes the definitions of the mapper and reducer classes, as well as defining the combiner class to be the same as the reducer class. Also, note that the definition of the outputKeyClass and outputValueClass are the same as the ones defined by the Reducer class!!! Once again, Hadoop complains whey they are different :)

Job2, Driver

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
* WordCountsInDocuments counts the total number of words in each document and
* produces data with the relative and total number of words for each document.
*/
public class WordCountsInDocuments extends Configured implements Tool {

// where to put the data in hdfs when we're done
private static final String OUTPUT_PATH = "2-word-counts";

// where to read the data from.
private static final String INPUT_PATH = "1-word-freq";

public int run(String[] args) throws Exception {

Configuration conf = getConf();
Job job = new Job(conf, "Words Counts");

job.setJarByClass(WordCountsInDocuments.class);
job.setMapperClass(WordCountsForDocsMapper.class);
job.setReducerClass(WordCountsForDocsReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

return job.waitForCompletion(true) ? 0 : 1;
}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCountsInDocuments(), args);
System.exit(res);
}
}

The input data is located in the directory of the first step
"1-word-freq"
, and the output persisted in the directory "2-word-counts" as listed in the main training directory in the HDFS. If you need to take a look at the ANT build and other classes, go to my personal resources at my Google Code project. Recompile the project and generate the updated Jar with the driver.

training@training-vm:~/git/exercises/shakespeare$ ant
Buildfile: build.xml

compile:
[javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.


jar:
[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar


BUILD SUCCESSFUL
Total time: 1 second

Now, executing the driver...

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordCountsInDocuments
10/01/06 16:28:04 INFO input.FileInputFormat: Total input paths to process : 1
10/01/06 16:28:04 INFO mapred.JobClient: Running job: job_200912301017_0048
10/01/06 16:28:05 INFO mapred.JobClient: map 0% reduce 0%
10/01/06 16:28:12 INFO mapred.JobClient: map 100% reduce 0%
10/01/06 16:28:18 INFO mapred.JobClient: map 100% reduce 100%
10/01/06 16:28:20 INFO mapred.JobClient: Job complete: job_200912301017_0048
10/01/06 16:28:20 INFO mapred.JobClient: Counters: 17
10/01/06 16:28:20 INFO mapred.JobClient: Job Counters
10/01/06 16:28:20 INFO mapred.JobClient: Launched reduce tasks=1
10/01/06 16:28:20 INFO mapred.JobClient: Launched map tasks=1
10/01/06 16:28:20 INFO mapred.JobClient: Data-local map tasks=1
10/01/06 16:28:20 INFO mapred.JobClient: FileSystemCounters
10/01/06 16:28:20 INFO mapred.JobClient: FILE_BYTES_READ=1685803
10/01/06 16:28:20 INFO mapred.JobClient: HDFS_BYTES_READ=1588239
10/01/06 16:28:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3371638
10/01/06 16:28:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1920431
10/01/06 16:28:20 INFO mapred.JobClient: Map-Reduce Framework
10/01/06 16:28:20 INFO mapred.JobClient: Reduce input groups=0
10/01/06 16:28:20 INFO mapred.JobClient: Combine output records=0
10/01/06 16:28:20 INFO mapred.JobClient: Map input records=48779
10/01/06 16:28:20 INFO mapred.JobClient: Reduce shuffle bytes=1685803
10/01/06 16:28:20 INFO mapred.JobClient: Reduce output records=0
10/01/06 16:28:20 INFO mapred.JobClient: Spilled Records=97558
10/01/06 16:28:20 INFO mapred.JobClient: Map output bytes=1588239
10/01/06 16:28:20 INFO mapred.JobClient: Combine input records=0
10/01/06 16:28:20 INFO mapred.JobClient: Map output records=48779
10/01/06 16:28:20 INFO mapred.JobClient: Reduce input records=48779

Note that the execution generates tens of thousands of documents shuffled from ~1.6 million entries. Let's check the result using the hadoop fs -cat command once again and navigate through the result. The most important thing to note is that the relation n/N are maintained throughout the results, for each word and each total number for each document.

training@training-vm:~/git/exercises/shakespeare$
hadoop fs -cat 2-word-counts/part-r-00000 | less
....
relished@all-shakespeare 1/738781
therefore@all-shakespeare 652/738781
eastward@all-shakespeare 1/738781
....
irrespective@leornardo-davinci-all.txt 1/149612
ignorance@leornardo-davinci-all.txt 12/149612
drawing@leornardo-davinci-all.txt 174/149612
relief@leornardo-davinci-all.txt 36/149612
...
answer@the-outline-of-science-vol1.txt 25/70650
sleeve@the-outline-of-science-vol1.txt 1/70650
regard@the-outline-of-science-vol1.txt 22/70650

Part 3 will conclude this job by combining two different steps. I'm still using the original basic tutorial from Cloudera, but using the Hadoop 0.20.1 API. Any suggestions for improvements are welcomed:

- How to write data pipes between 2 different jobs?
- How to write a custom input format?

Those questions might be answered after the training in Sunnyvale on January 19-21, during the Hadoop Training I'm excited to attend.

36 comments:

皮東 said...

Never put both feet in your mouth at the same time, because then you will not have a leg to stand on.............................................

正常關西 said...

幽默並不是諷刺,它或許帶有溫和的嘲諷,卻不傷人,它可能是以別人,也可以用自己為對象。........................................

carzy said...

你的部落帶給我愉快的心情,感謝~~ ....................................................

士銘士銘 said...

您的部落格文章真棒!!有空我一定會常來逛!!........................................

童紫勳 said...

你的文章讓我有種特別的感覺,請加油哦~~ ..................................................

Hapi said...

hello... hapi blogging... have a nice day! just visiting here....

dillont_bratt0612 said...

安以軒寫真集隋棠寫真集杜蕾斯貼圖打炮偷拍自拍正妹無名正妹老師女生自衛影片av999免費影片250av女優免費影片明星做愛自拍日本三級電影日本三級片下載日本三級片女星日本正妹人皮面具日本母子亂倫日本水手服女學生圖片日本波霸美女日本波霸美女dvd日本浣腸排泄美女 dvd美女 視訊美女 視訊米克綜合論壇85cc免費聊天室小說論壇甜心寶貝貼片嘿咻kiss168後官0951主入口hh卡通影片免費影片下載免費成人片觀賞a片免費0951撥打電話下載la論壇網愛聊天室3388影片區完美女人視訊我愛78影片線上免費看成人片77p2p影片網愛田av直播室showlive影音聊天網

茂恒 said...

ut聊天77p2p85cc85st85街視訊視訊聊天ava片a片下載成人情色色情影音視訊聊天洪爺影城洪爺免費視訊免費a片免費一對多utsogo論壇ut聊天室成人片免費看................

雅威 said...

好的開始,就是成功的一半。..................................................

90809eugeniopals said...

這一生中有多少人擦肩而過?而朋友是多麼可貴啊!..................................................

黃EugeniaL瓊文 said...

尼克成人電影免費觀看成人動畫成人鴛鴦吧歐美成人區線上免費觀看成人卡通上床免費看上床美女影片免費試看上床影片免費看上班族天室上班族室上班族聊天(f1)上班族聊天至上班族聊天室(f1)口交電影口焦大乃妹做愛影片女生裸睡圖片一葉聊天tw18禁wii後宮電影院xevideoxex888xvdeosxveidosxvideo 免費影片xvideo 松島xvideo 線上xvideo松島楓xvidesxviedo武俠小說 亂倫小說 色情文學正妹計時器

宜潔 said...

生活盡可低,志氣當高潔. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

蕙帆ElmoAc said...

you two make a lovely couple............................................................

雲亨 said...

恨一個人,比原諒一個人,更傷力氣。..................................................................

峻龍 said...

當一個人內心能容納兩樣相互衝突的東西,這個人便開始變得有價值了。..................................................................

雅惠 said...

知識可以傳授,智慧卻不行。每個人必須成為他自己。.................................................................                           

熙辰 said...

與人相處不妨多用眼睛說話,多用嘴巴思考.................................................................

DonA_Binett彥安 said...

成熟,就是有能力適應生活中的模糊。.................................................................

陳登陽 said...

It takes all kinds to make a world.............................................................

育隆 said...

來拜訪你囉~期待你的下次文章~加油^^..................................................................

佩璇佩璇 said...

Nothing comes from nothing.............................................................

天花天花 said...

我真心在追求我的夢想時,每一天都是繽紛的。因為我知道每一個小時都是實現理想 的一部份。..................................................

趙筱婷terrifields汪華昕 said...

當你真心渴望某一樣東西,整個宇宙都會聯合起來幫助你。..................................................

倫惟倫惟 said...

快樂與滿足的秘訣,就在全心全意投注於現在的每一分,每一秒上..................................................

潘凱花潘凱花 said...

加油來給你灌水 ............................................................

KyungBivo中如 said...

Joy often comes after sorrow, like morning after night.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

陳佑發 said...

Drive carefully. It is not only cars that can be recalled by their Maker...................................................

嘉王偉 said...

字是活的,人和環境的觀察是活的,思想是活的。不管怎樣,就是要有一兩樣是活的。否則都是平庸。............................................................

v彥伶林 said...

真正仁慈的人,會忘記他們做過的善行,他們全心投入現在的工作,過去的事已被遺忘。..................................................

建邱勳 said...

這不過是滑一跤,並不是死掉而爬不起來了。..................................................

原黃秋 said...

果然很有意思呀....這當然要頂一頂呀 ̄﹏ ̄............................................................

宋張建宇瑞正 said...

來替你打氣,加油A_A..................................................................

峻胡邦慧v帆 said...

今夜星光多美好~祝你快樂~~~~............................................................

幸平平平平杰 said...

老天爺賦予了強者的能力,就是要他比弱者多擔待..................................................

王辛江淑萍康 said...

多謝美味的心靈雞湯......................................................

tagskie said...

hi.. just dropping by here... have a nice day! http://kantahanan.blogspot.com/

StartupCTO - Helping Small Teams Develop Great Software