Marcello de Sales

I've been learning Hadoop for the past 15 days and I have found lots of examples of source-code. The basic training offered by Cloudera uses the 0.18 API, as well as the Yahoo developer's tutorial that describe the example of a the Inverted Line Index example. The input of this example is a list of one or more text files containing books, and the output is the index of words appearing on each of the files in the format "", where word is found on a given line of the given fileName at the byte offset given. Although the example works without a problem, I've read documentations about the Pig application where the majority of the warnings are caused by the API change. I'm particularly in favour of clean code without warnings, whenever possible. So, I started dissecting the API and could re-implement the examples using the Hadoop 0.20.1. Furthermore, the MRUnit must also be refactored in order to make use of the new API.

Both the Yahoo Hadoop Tutorial and the Cloudera Basic Training documentation "Writing MapReduce Programs" give the example of the InvertedIndex application. I used the Cloudera VMWare implementation and source-code as a starting point.

The first major change was the inclusion of the mapreduce package, containing the new implementation of the Mapper and Reducer classes, which were Interfaces in the previous APIs in the package "mapred".

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

public class MyMapper extends Mapper {}
public class MyReducer extends Reducer{}

Also, note that these classes use the Java generics capabilities and therefore, the methods "map()" and "reduce()" must follow the convention given in your implementation. Both methods removed the use of the reporter and collector by the use of a Context class, that is a static member class of each of the Mapper and Reducer classes.

protected void map(K key, V value, Mapper.Context context)

protected void reduce(K key, Iterable values, Context context)

Whatever the type of K and V are, they must be used in the implementation. For instance, I used to have an Iterator implementation for the key in the reducer, and the reduce method was never called with the wrong method signature. So, it is important to verify that you're using the Iterable class for the values instead.

Mapper Class

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**
* LineIndexMapper Maps each observed word in a line to a (filename@offset) string.
*/
public class LineIndexMapper extends Mapper {

public LineIndexMapper() {
}

/**
* Google's search Stopwords
*/
private static Set googleStopwords;

static {
googleStopwords = new HashSet();
googleStopwords.add("I"); googleStopwords.add("a"); googleStopwords.add("about");
googleStopwords.add("an"); googleStopwords.add("are"); googleStopwords.add("as");
googleStopwords.add("at"); googleStopwords.add("be"); googleStopwords.add("by");
googleStopwords.add("com"); googleStopwords.add("de"); googleStopwords.add("en");
googleStopwords.add("for"); googleStopwords.add("from"); googleStopwords.add("how");
googleStopwords.add("in"); googleStopwords.add("is"); googleStopwords.add("it");
googleStopwords.add("la"); googleStopwords.add("of"); googleStopwords.add("on");
googleStopwords.add("or"); googleStopwords.add("that"); googleStopwords.add("the");
googleStopwords.add("this"); googleStopwords.add("to"); googleStopwords.add("was");
googleStopwords.add("what"); googleStopwords.add("when"); googleStopwords.add("where");
googleStopwords.add("who"); googleStopwords.add("will"); googleStopwords.add("with");
googleStopwords.add("and"); googleStopwords.add("the"); googleStopwords.add("www");
}

/**
* @param key is the byte offset of the current line in the file;
* @param value is the line from the file
* @param output has the method "collect()" to output the key,value pair
* @param reporter allows us to retrieve some information about the job (like the current filename)
*
* POST-CONDITION: Output <"word", "filename@offset"> pairs
*/
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Compile all the words using regex
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(value.toString());

// Get the name of the file from the inputsplit in the context
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

// build the values and write pairs through the context
StringBuilder valueBuilder = new StringBuilder();
while (m.find()) {
String matchedKey = m.group().toLowerCase();
// remove names starting with non letters, digits, considered stopwords or containing other chars
if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))
|| googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {
continue;
}
valueBuilder.append(fileName);
valueBuilder.append("@");
valueBuilder.append(key.get());
// emit the partial
context.write(new Text(matchedKey), new Text(valueBuilder.toString()));
valueBuilder.setLength(0);
}
}
}

Reducer Class

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
* LineIndexReducer Takes a list of filename@offset entries for a single word and concatenates them into a list.
*/
public class LineIndexReducer extends Reducer {

public LineIndexReducer() {
}

/**
* @param key is the key of the mapper
* @param values are all the values aggregated during the mapping phase
* @param context contains the context of the job run
*
* PRE-CONDITION: receive a list of <"word", "filename@offset"> pairs
* <"marcello", ["a.txt@3345", "b.txt@344", "c.txt@785"]>
*
* POST-CONDITION: emit the output a single key-value where all the file names
* are separated by a comma ",".
* <"marcello", "a.txt@3345,b.txt@344,c.txt@785">
*/
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
StringBuilder valueBuilder = new StringBuilder();

for (Text val : values) {
valueBuilder.append(val);
valueBuilder.append(",");
}
//write the key and the adjusted value (removing the last comma)
context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));
valueBuilder.setLength(0);
}
}

These are the changes necessary for the Mapper and Reducer classes, without the need to extend the base classes. In order to unit test these classes, changes on the MRUnit are also necessary. The drivers were also added a new "mapreduce" package with the same counterparts.

Instead of the mrunit.MapDriver, use the mapreduce.MapDriver. The same for the Reducer class. The rest of the code is just the same.

~~import org.apache.hadoop.mrunit.MapDriver;~~

import org.apache.hadoop.mrunit.mapreduce.MapDriver;

JUnit's MapperTest

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mock.MockInputSplit;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
* Test cases for the inverted index mapper.
*/
public class MapperTest extends TestCase {

private Mapper mapper;
private MapDriver driver;

/** We expect pathname@offset for the key from each of these */
private final Text EXPECTED_OFFSET = new Text(MockInputSplit.getMockPath().toString() + "@0");

@Before
public void setUp() {
mapper = new LineIndexMapper();
driver = new MapDriver(mapper);
}

@Test
public void testEmpty() {
List> out = null;

try {
out = driver.withInput(new LongWritable(0), new Text("")).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();

assertListEquals(expected, out);
}

@Test
public void testOneWord() {
List> out = null;

try {
out = driver.withInput(new LongWritable(0), new Text("foo")).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();
expected.add(new Pair(new Text("foo"), EXPECTED_OFFSET));

assertListEquals(expected, out);
}

@Test
public void testMultiWords() {
List> out = null;

try {
out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();
expected.add(new Pair(new Text("foo"), EXPECTED_OFFSET));
expected.add(new Pair(new Text("bar"), EXPECTED_OFFSET));
expected.add(new Pair(new Text("baz"), EXPECTED_OFFSET));

assertListEquals(expected, out);
}
}

JUnit's ReducerTest

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)

package index;

import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import junit.framework.TestCase;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;

/**
* Test cases for the inverted index reducer.
*/
public class ReducerTest extends TestCase {

private Reducer reducer;
private ReduceDriver driver;

@Before
public void setUp() {
reducer = new LineIndexReducer();
driver = new ReduceDriver(reducer);
}

@Test
public void testOneOffset() {
List> out = null;

try {
out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset")).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();
expected.add(new Pair(new Text("word"), new Text("offset")));

assertListEquals(expected, out);
}

@Test
public void testMultiOffset() {
List> out = null;

try {
out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset1")).withInputValue(
new Text("offset2")).run();
} catch (IOException ioe) {
fail();
}

List> expected = new ArrayList>();
expected.add(new Pair(new Text("word"), new Text("offset1,offset2")));

assertListEquals(expected, out);
}

}

You can test them using the command "ant test" on the source-code directory as usual to confirm that the implementation is correct:

training@training-vm:~/git/exercises/shakespeare$ ant test
Buildfile: build.xml

compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin

test:
[junit] Running index.AllTests
[junit] Testsuite: index.AllTests
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec
[junit]

BUILD SUCCESSFUL
Total time: 2 seconds

Replacing JobConf and other deprecated classes

Other changes related to the API is on the configuration of the execution of the jobs. The class "JobConf" was deprecated, but most of the tutorials have not been updated. So, here's the updated version of the main example driver using the Configuration and Context classes. Note that the job is configured and executed with the default version of the configuration. It is the class responsible for configuring the execution of the tasks. Once again, the replacement of the classes located at the package "mapred" is important, since the new classes are located at the package "mapreduce".

InvertedIndex driver

// (c) Copyright 2009 Cloudera, Inc.
// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)
package index;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
* LineIndexer Creates an inverted index over all the words in a document corpus, mapping each observed word to a list
* of filename@offset locations where it occurs.
*/
public class LineIndexer extends Configured implements Tool {

// where to put the data in hdfs when we're done
private static final String OUTPUT_PATH = "output";

// where to read the data from.
private static final String INPUT_PATH = "input";

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new LineIndexer(), args);
System.exit(res);
}

public int run(String[] args) throws Exception {

Configuration conf = getConf();
Job job = new Job(conf, "Line Indexer 1");

job.setJarByClass(LineIndexer.class);
job.setMapperClass(LineIndexMapper.class);
job.setReducerClass(LineIndexReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

return job.waitForCompletion(true) ? 0 : 1;
}
}

After updating, make sure to run generate a new jar, remove anything under the directory "output" (since the program does not clean that up), and execute the new version.

training@training-vm:~/git/exercises/shakespeare$ ant jar
Buildfile: build.xml

compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin

jar:
[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar

BUILD SUCCESSFUL
Total time: 1 second

I have added 2 ASCII books in the input directory: the works from Leonardo Da Vinci and the first volume of the book "The outline of science".

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input
Found 3 items
-rw-r--r-- 1 training supergroup 5342761 2009-12-30 11:57 /user/training/input/all-shakespeare
-rw-r--r-- 1 training supergroup 1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt
-rw-r--r-- 1 training supergroup 674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt

The execution and output of running this example is shown as follows.

training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.LineIndexer
10/01/04 21:11:55 INFO input.FileInputFormat: Total input paths to process : 3
10/01/04 21:11:56 INFO mapred.JobClient: Running job: job_200912301017_0017
10/01/04 21:11:57 INFO mapred.JobClient: map 0% reduce 0%
10/01/04 21:12:07 INFO mapred.JobClient: map 33% reduce 0%
10/01/04 21:12:10 INFO mapred.JobClient: map 58% reduce 0%
10/01/04 21:12:13 INFO mapred.JobClient: map 63% reduce 0%
10/01/04 21:12:16 INFO mapred.JobClient: map 100% reduce 11%
10/01/04 21:12:28 INFO mapred.JobClient: map 100% reduce 77%
10/01/04 21:12:34 INFO mapred.JobClient: map 100% reduce 100%
10/01/04 21:12:36 INFO mapred.JobClient: Job complete: job_200912301017_0017
10/01/04 21:12:36 INFO mapred.JobClient: Counters: 17
10/01/04 21:12:36 INFO mapred.JobClient: Job Counters
10/01/04 21:12:36 INFO mapred.JobClient: Launched reduce tasks=1
10/01/04 21:12:36 INFO mapred.JobClient: Launched map tasks=3
10/01/04 21:12:36 INFO mapred.JobClient: Data-local map tasks=3
10/01/04 21:12:36 INFO mapred.JobClient: FileSystemCounters
10/01/04 21:12:36 INFO mapred.JobClient: FILE_BYTES_READ=58068623
10/01/04 21:12:36 INFO mapred.JobClient: HDFS_BYTES_READ=7445292
10/01/04 21:12:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=92132872
10/01/04 21:12:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26638259
10/01/04 21:12:36 INFO mapred.JobClient: Map-Reduce Framework
10/01/04 21:12:36 INFO mapred.JobClient: Reduce input groups=0
10/01/04 21:12:36 INFO mapred.JobClient: Combine output records=0
10/01/04 21:12:36 INFO mapred.JobClient: Map input records=220255
10/01/04 21:12:36 INFO mapred.JobClient: Reduce shuffle bytes=34064153
10/01/04 21:12:36 INFO mapred.JobClient: Reduce output records=0
10/01/04 21:12:36 INFO mapred.JobClient: Spilled Records=2762272
10/01/04 21:12:36 INFO mapred.JobClient: Map output bytes=32068217
10/01/04 21:12:36 INFO mapred.JobClient: Combine input records=0
10/01/04 21:12:36 INFO mapred.JobClient: Map output records=997959
10/01/04 21:12:36 INFO mapred.JobClient: Reduce input records=997959

The index entry for the word "abandoned" is an example of one present in all of the books:

training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat output/part-r-00000 | less

...

abandoned leornardo-davinci-all.txt@1257995,leornardo-davinci-all.txt@652992,all-shakespeare@4657862,all-shakespeare@738818,the-outline-of-science-vol1.txt@642211,the-outline-of-science-vol1.txt@606442,the-outline-of-science-vol1.txt@641585
...
...

VirtualBox 3.0.2 (Ubuntu 9.04 Host + Ubuntu & Windows Guests)

Problem: to have a cluster of machines to work for development using VirtualBox guests. That is, I want to be able to have different operating systems running on my own development box, having each of the machines sharing the same Internet connection (my main one) and acquiring IP address from the same DHCP server (aka wireless router, or whatever...)

Description:

Finally, something to make all of us happy! VirtualBox 3.0.2 delivers smooth support to shared Nat and Subnetwork outside-the-box!!! This has always been a dream of mine to use VirtualBox for such agility of having as much development machines as I can (your host memory).

What I could accomplished tonight was the setup of a shared subnetwork between my host Ubuntu 9.04 and my guests on Ubuntu 9.04 and Windows XP SP2. Everything I did was to setup network cards from each of my guests to use the network interface as a Bridged Adapter, and choose the one I have Internet connectivity. By the way, the connectivity I am using is my wireless card.

This can help anyone who needs different boxes and installations of machines to develop and test applications that depend on the network, shared resources, etc. For example, I need to build a Sensor Network that sends collected data to the main data center as my thesis simplest workspace. On a single box, my development environment is contains source-code for both the Sensors and the Main node. However, with the use of Subversion and SCP I can transfer artifacts over between the boxes and still test the transport layer of my sensor network.

Setup:

To better describe what I have, here's the description of each member of my network made with VirtualBox 3.0.2:

* Install VirtualBox 3.0.2 from http://www.virtualbox.org/wiki/Downloads

* Host: Ubuntu 9.04 on an IBM ThinkPad T42, with 2GB (yes, my personal HP with 4GB recently burned out... NEVER BUY HP LAPTOPS).

mdesales@mdesales-laptop:~$ ifconfig

eth0 Link encap:Ethernet HWaddr 00:11:25:81:48:6d

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:0 errors:0 dropped:0 overruns:0 frame:0

TX packets:0 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eth1 Link encap:Ethernet HWaddr 00:0e:35:c6:5a:0f

inet addr:192.168.0.102 Bcast:192.168.0.255 Mask:255.255.255.0

inet6 addr: fe80::20e:35ff:fec6:5a0f/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:583718 errors:235 dropped:235 overruns:0 frame:0

TX packets:316625 errors:0 dropped:5 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:763972253 (763.9 MB) TX bytes:30157656 (30.1 MB)

Interrupt:11 Base address:0x4000 Memory:c0214000-c0214fff

Note that you will need to know which device you currently have Internet connectivity. In my case, as you can see, I have it through the eth1. I acquire IPs through my private Wireless Router DHCP, which is connected to the Cable modem, in the range of 192.168.0.100 - 192.168.0.199.

* Guest 1: Ubuntu 9.04 Desktop

* Guest 2: Ubuntu 9.04 Mini

* Guest 3: Windows XP SP3

For each of them, I just configured the network device to use the Bridged Interface, and choosing a correct device. To be clear, you want to choose the one you currently have Internet connection. In my case, I have on the eth1, which happens to be my wireless card. However yours can differ!

* Environment Evaluation: I'd like to ping, ssh, share resources, build a mysql cluster, hadoop grid, test my grails app, write Android App... The dream I have always had is to have my own "cloud" of pre-defined VirtualBox with the profiles of what I usually need. I must say that I can finally rest my mind! I could finally make it work...

I have mixed up 2 different projects: my thesis boxes and a Software Engineering research developed in 2 different environments (that's why I love keeping my VirtualBox hard-disks in my backup storage for later reuse). Most of people who have tried integrating guests could only PING a machine, but not SSH, samba, etc. I got frustrated 3 or 4 months ago about the same situation and I had to settle. Anyway, the truth is that tonight I got it working without any voodoo command to turn off the firewall, to configure different tunnel... I just wanted it working... See how many people were having related problems here (http://www.savvyadmin.com/virtualbox-host-interface-networking-with-nat/)

Let's see each machine IP address

* Guest 2: Ubuntu 9.04 Mini: 192.168.0.100

* Guest 3: Windows XP SP3: 192.168.0.103. Note that the ping is from the Guest 2 to the Guest 3 machine!!! I had never accomplished this with previous versions of VirtualBox without reading the documentation for hours...

* Pinging the host from the Guest 2: this step I could reproduce using the voodoo steps described on other pages. Here's an example of the guest 2 pinging the host.

* SSH from among the machines: one HUGE step was that, at this time, we can definitely SSH from each of the machines. I just installed Ubuntu with openSSH server in each of them. That's ALL!!! I'm making an SSH connection from my machine to the Guest 2.

This is the SSH session between the Guest 2 and the Host.

The SSL connection between the hosts is done. In this screen, the user ssh from the Guest 3 (Windows) to the Guest 2 (Ubuntu).

Note that the window at far-most of the screenshot is the Host.

After testing all these features, I wanted to verify if the communication is done without no changes.

I was lucky to find out that the everything works outside-the-box, including Internet connection!!!!

Marcello de Sales

Tuesday, January 5, 2010

Hadoop 0.20.1 API: refactoring the InvertedLine example, removing deprecated classes (JobConf, others)

Thursday, December 31, 2009

Hadoop, mongoDB: MapReduce tools

Thursday, December 17, 2009

Finished... 23 Months Gone!!! Got my M.S. degree

Monday, November 16, 2009

Presentation of a Peer-Reviewed Paper...

Thursday, July 30, 2009

Building a Clustered Shared SubNetwork with VirtualBox 3.0.2 (Ubuntu 9.04 Host + Ubuntu 9.04 and Windows XP Guests)

Monday, July 28, 2008

Linux Magazine: "Discovering DCCP" article I contributed finally published

Friday, April 25, 2008

No Eclipse crash after FRESH INSTALLATION OF UBUNTU 8.04 64bits (sharing with Ubuntu 7.10 and Vista)

StartupCTO - Helping Small Teams Develop Great Software

About Me

Facebook Badge

FEEDJIT Live Traffic Map

Blog Archive