tag:blogger.com,1999:blog-84020555001237310622024-03-07T20:27:48.947-08:00Marcello de SalesWhy the Internet makes me feel a daily gratitude for life...?Marcello de Saleshttp://www.blogger.com/profile/12267056198563321351noreply@blogger.comBlogger16125tag:blogger.com,1999:blog-8402055500123731062.post-50245203640651822242010-01-06T14:52:00.000-08:002010-01-06T16:40:36.871-08:00TF-IDF in Hadoop Part 2 Word Counts For DocsThe TF-IDF algorithm can be implemented in different ways. The Cloudera Hadoop training defines different steps on the implementation of each of the steps through different Jobs. I decided to take the approach of persisting the intermediate data before the execution of the subsequent steps. This part documents the implementation of Job 2 as the second part of my experiments with Hadoop.
<br />
<br />Part 1 resulted in the word frequency for each of the documents in the input path provided, persisted at the "1-word-freq" output directory, as shown below:
<br />
<br /><span style=";font-family:'Courier New',Courier,monospace;font-size:small;">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 1-word-freq/part-r-00000 | less</span>
<br /><span style="font-family:'Courier New',Courier,monospace;">...</span>
<br /><span style="font-family:'Courier New',Courier,monospace;">therefore@all-shakespeare 652</span>
<br /><span style="font-family:'Courier New',Courier,monospace;">therefore@leornardo-davinci-all.txt 124</span>
<br /><span style="font-family:'Courier New',Courier,monospace;">therefore@the-outline-of-science-vol1.txt 36</span>
<br />
<br />The definition of Job 2 will take into account the structure of this data in the creation of the Mapper and Reducer classes.
<br />
<br /><span><span class="Apple-style-span" style="font-size:x-large;">Job 2: Word Counts for Doc</span></span><span class="Apple-style-span" style="font-size:x-large;">s</span>
<br />
<br />The goal of this job is to count the total number of words for each document, in a way to compare each word with the total number of words. I've tried to implement a default InputFormat and I couldn't find examples related to it. As I understood, the values could be read in the same format they are saved (Text, IntWritable), but I will keep it simple and use the same default InputFormat as before. Following the same definition as in part one, the specification of the Map and Reduce are as follows:
<br /><ul><li>Map: </li><ul><li>Input: ((word@document), n)</li><li>Re-arrange the mapper to have the key based on each document
<br /></li><li>Output: (document, word=n)</li></ul><li>Reducer</li><ul><li>N = totalWordsInDoc = sum [word=n]) for each document</li><li>Output: ((word@document), (n/N)) </li></ul></ul>Note that the format used for the input of the mapper is the output for the previous job. The delimiters "@" and "/" were randomly picked to better represent the intent of the data. So, feel free to pick anything you prefer. The reducer just need to sum the total number of values in a document and pass this value over to the next step, along with the previous number of values, as necessary data for the next step.<div>
<br /></div><div>I have learned that the Iterable values in the values of the Reducer class can't be iterated more than once. The loop just did not enter when two foreach operations were performed, so I implemented it using a temporary map.</div><div>
<br /></div><div><span class="Apple-style-span" style="font-size:x-large;">Job2, Mapper</span></div><div><span class="Apple-style-span" style="font-size:medium;">
<br /></span></div><div><span class="Apple-style-span" style="font-size:medium;"><div><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">package index;</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">import java.io.IOException;</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.io.LongWritable;</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.io.Text;</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.mapreduce.Mapper;</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">/**</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * LineIndexMapper Maps each observed word in a line to a (filename@offset) string.</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> */</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">public class WordCountsForDocsMapper extends Mapper<longwritable,> {</longwritable,></span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> </span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> public WordCountsForDocsMapper() {</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> }</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> </span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> /**</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * @param key is the byte offset of the current line in the file;</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * @param value is the line from the file</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * @param context</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * </span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * PRE-CONDITION: aa@leornardo-davinci-all.txt 1</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * aaron@all-shakespeare 98</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * ab@leornardo-davinci-all.txt 3</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * </span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> * POST-CONDITION: Output <"all-shakespeare", "aaron=98"> pairs</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> */</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> String[] wordAndDocCounter = value.toString().split("\t");</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> String[] wordAndDoc = wordAndDocCounter[0].split("@");</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> context.write(new Text(wordAndDoc[1]), new Text(wordAndDoc[0] + "=" + wordAndDocCounter[1]));</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> }</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">}</span></div><div>
<br /></div><div><span class="Apple-style-span" style="font-size:x-large;">Job2, Mapper Unit Test</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'times new roman';">I have just simplified the unit test to verify if the test Mapper generates the format needed for the Reducer.</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-size:x-large;"><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">package index;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import java.io.IOException;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import java.util.ArrayList;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import java.util.List;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import junit.framework.TestCase;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.io.LongWritable;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.io.Text;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.mapreduce.Mapper;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.mrunit.mapreduce.MapDriver;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.mrunit.types.Pair;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.junit.Before;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import org.junit.Test;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">/**</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> * Test cases for the word count mapper.</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> */</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">public class WordCountsForDocsMapperTest extends TestCase {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> private Mapper<longwritable,> mapper;</longwritable,></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> private MapDriver<longwritable,> driver;</longwritable,></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> @Before</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> public void setUp() {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> mapper = new WordCountsForDocsMapper();</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> driver = new MapDriver<longwritable,>(mapper);</longwritable,></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> }</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> @Test</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> public void testMultiWords() {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> List<pair><text,>> out = null;</text,></pair></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> try {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> out = driver.withInput(new LongWritable(0), new Text("crazy@all-shakespeare\t25")).run();</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> } catch (IOException ioe) {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> fail();</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> }</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> List<pair><text,>> expected = new ArrayList<pair><text,>>();</text,></pair></text,></pair></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> expected.add(new Pair<text,>(new Text("all-shakespeare"), new Text("crazy=25")));</text,></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> assertListEquals(expected, out);</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> }</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">}</span></span></div><div>
<br /></div><div>Job 2, Reducer</div><div><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-size:medium;">
<br /></span></span></div><div><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-size:medium;"><div>// (c) Copyright 2009 Cloudera, Inc.</div><div>// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</div><div>
<br /></div><div>package index;</div><div>
<br /></div><div>import java.io.IOException;</div><div>import java.util.HashMap;</div><div>import java.util.Map;</div><div>
<br /></div><div>import org.apache.hadoop.io.Text;</div><div>import org.apache.hadoop.mapreduce.Reducer;</div><div>
<br /></div><div>/**</div><div> * WordCountsForDocsReducer counts the number of documents in the </div><div> */</div><div>public class WordCountsForDocsReducer extends Reducer<text,> {</text,></div><div>
<br /></div><div> public WordCountsForDocsReducer() {</div><div> }</div><div>
<br /></div><div> /**</div><div> * @param key is the key of the mapper</div><div> * @param values are all the values aggregated during the mapping phase</div><div> * @param context contains the context of the job run </div><div> * </div><div> * PRE-CONDITION: receive a list of <document, word="n" b="x"></document,></div><div> * pairs <"a.txt", ["word1=3", "word2=5", "word3=5"]> </div><div> * </div><div> * POST-CONDITION: <"word1@a.txt, 3/13">,</div><div> * <"word2@a.txt, 5/13"></div><div> */</div><div> protected void reduce(Text key, Iterable<text> values, Context context) throws IOException, InterruptedException {</text></div><div> int sumOfWordsInDocument = 0;</div><div> Map<string,> tempCounter = new HashMap<string,>();</string,></string,></div><div> for (Text val : values) {</div><div> String[] wordCounter = val.toString().split("=");</div><div> tempCounter.put(wordCounter[0], Integer.valueOf(wordCounter[1]));</div><div> sumOfWordsInDocument += Integer.parseInt(val.toString().split("=")[1]);</div><div> }</div><div> for (String wordKey : tempCounter.keySet()) {</div><div> context.write(new Text(wordKey + "@" + key.toString()), new Text(tempCounter.get(wordKey) + "/"</div><div> + sumOfWordsInDocument));</div><div> }</div><div> }</div><div>}</div><div>
<br /></div></span></span></div><div>Job 2, Reducer Unit Test</div><div><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-size:medium;">
<br /></span></span></div><div><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-size:medium;"><div>// (c) Copyright 2009 Cloudera, Inc.</div><div>// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</div><div>
<br /></div><div>package index;</div><div>
<br /></div><div>import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;</div><div>
<br /></div><div>import java.io.IOException;</div><div>import java.util.ArrayList;</div><div>import java.util.List;</div><div>
<br /></div><div>import junit.framework.TestCase;</div><div>
<br /></div><div>import org.apache.hadoop.io.Text;</div><div>import org.apache.hadoop.mapreduce.Reducer;</div><div>import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;</div><div>import org.apache.hadoop.mrunit.types.Pair;</div><div>import org.junit.Before;</div><div>import org.junit.Test;</div><div>
<br /></div><div>/**</div><div> * Test cases for the reducer of the word counts.</div><div> */</div><div>public class WordCountsForDocsReducerTest extends TestCase {</div><div>
<br /></div><div> private Reducer<text,> reducer;</text,></div><div> private ReduceDriver<text,> driver;</text,></div><div>
<br /></div><div> @Before</div><div> public void setUp() {</div><div> reducer = new WordCountsForDocsReducer();</div><div> driver = new ReduceDriver<text,>(reducer);</text,></div><div> }</div><div>
<br /></div><div> @Test</div><div> public void testMultiWords() {</div><div> List<pair><text,>> out = null;</text,></pair></div><div>
<br /></div><div> try {</div><div> List<text> values = new ArrayList<text>();</text></text></div><div> values.add(new Text("car=50"));</div><div> values.add(new Text("hadoop=15"));</div><div> values.add(new Text("algorithms=25"));</div><div> out = driver.withInput(new Text("document"), values).run();</div><div> </div><div> } catch (IOException ioe) {</div><div> fail();</div><div> }</div><div>
<br /></div><div> List<pair><text,>> expected = new ArrayList<pair><text,>>();</text,></pair></text,></pair></div><div> expected.add(new Pair<text,>(new Text("car@document"), new Text("50/90")));</text,></div><div> expected.add(new Pair<text,>(new Text("hadoop@document"), new Text("15/90")));</text,></div><div> expected.add(new Pair<text,>(new Text("algorithms@document"), new Text("25/90")));</text,></div><div> assertListEquals(expected, out);</div><div> }</div><div>
<br /></div><div>}</div><div>
<br /></div><div><span class="Apple-style-span" style=" ;font-family:Georgia, serif;font-size:16px;">Once again, following our Test-Driven Development approach, let's test our Mapper and Reducer classes in order to verify its "correctness" of the generated data. The JUnit 4 Test suit is updated as follows:</span></div><div><span class="Apple-style-span" style=" ;font-family:Georgia, serif;font-size:16px;">
<br /></span></div><div><span class="Apple-style-span" style=" ;font-family:Georgia, serif;font-size:16px;"><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">// Updated by Marcello de Sales (marcello.dsales@gmail.com)</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">package index;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import junit.framework.Test;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">import junit.framework.TestSuite;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">/**</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> * All tests for inverted index code</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> *</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> * @author aaron</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> */</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">public final class AllTests {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> private AllTests() { }</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> public static Test suite() {</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> TestSuite suite = new TestSuite("Tests for the TF-IDF algorithm");</span></span></div><div><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-size:medium;">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> suite.addTestSuite(WordFreqMapperTest.class);</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> suite.addTestSuite(WordFreqReducerTest.class);</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> suite.addTestSuite(WordCountsForDocsMapperTest.class);</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> suite.addTestSuite(WordCountsForDocsReducerTest.class);</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> return suite;</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"> }</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">}</span></span></div></span></div><div><span class="Apple-style-span" style=" ;font-family:Georgia, serif;font-size:16px;">
<br /></span></div><div><span class="Apple-style-span" style="font-family:Georgia, serif;">Just testing it with the ANT task test, defined in the build.xml artifact.</span></div><div><span class="Apple-style-span" style="font-family:Georgia, serif;">
<br /></span></div><div><span class="Apple-style-span" style="font-family:Georgia, serif;"><div><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ ant test</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">Buildfile: build.xml</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">compile:</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> [javac] Compiling 12 source files to /home/training/git/exercises/shakespeare/bin</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">test:</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> [junit] Running index.AllTests</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> [junit] Testsuite: index.AllTests</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.424 sec</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.424 sec</span></div><div><span class="Apple-style-span" style="font-family:'courier new';"> [junit] </span></div><div><span class="Apple-style-span" style="font-family:'courier new';">
<br /></span></div><div><span class="Apple-style-span" style="font-family:'courier new';">BUILD SUCCESSFUL</span></div><div>
<br /></div></span></div></span></span></div><div><span class="Apple-style-span" style=" ;font-size:16px;">Similar to the previous Part 1, the the execution of the Driver is safer to proceed with tested classes. Furthermore, it includes the definitions of the mapper and reducer classes, as well as defining the combiner class to be the same as the reducer class. <b>Also, note that the definition of the outputKeyClass and outputValueClass are the same as the ones defined by the Reducer class!!! Once again, Hadoop complains whey they are different :)</b></span></div><div><span class="Apple-style-span" style=" ;font-size:16px;"><b>
<br /></b></span></div><div><b>Job2, Driver</b></div><div><span class="Apple-style-span" style=" ;font-size:16px;"><b>
<br /></b></span></div><div><span class="Apple-style-span" style="font-size:130%;"><span class="Apple-style-span" style="font-size:16px;"><b><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">// (c) Copyright 2009 Cloudera, Inc.</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">package index;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.conf.Configuration;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.conf.Configured;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.fs.Path;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.io.Text;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.mapreduce.Job;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.util.Tool;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">import org.apache.hadoop.util.ToolRunner;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">/**</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> * WordCountsInDocuments counts the total number of words in each document and </span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> * produces data with the relative and total number of words for each document.</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> */</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">public class WordCountsInDocuments extends Configured implements Tool {</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> // where to put the data in hdfs when we're done</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> private static final String OUTPUT_PATH = "2-word-counts";</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> // where to read the data from.</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> private static final String INPUT_PATH = "1-word-freq";</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> public int run(String[] args) throws Exception {</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> Configuration conf = getConf();</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> Job job = new Job(conf, "Words Counts");</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> job.setJarByClass(WordCountsInDocuments.class);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> job.setMapperClass(WordCountsForDocsMapper.class);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> job.setReducerClass(WordCountsForDocsReducer.class);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> job.setOutputKeyClass(Text.class);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> job.setOutputValueClass(Text.class);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> FileInputFormat.addInputPath(job, new Path(INPUT_PATH));</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> return job.waitForCompletion(true) ? 0 : 1;</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> }</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">
<br /></span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> public static void main(String[] args) throws Exception {</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> int res = ToolRunner.run(new Configuration(), new WordCountsInDocuments(), args);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> System.exit(res);</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;"> }</span></span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';"><span class="Apple-style-span" style="font-weight: normal;">}</span></span></span></div><div>
<br /></div><div><span class="Apple-style-span" style="font-weight: normal; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; ">The input data is located in the directory of the first step <b><div style="display: inline !important; "><span class="Apple-style-span" style="font-weight: normal; "><b><div style="display: inline !important; "><span class="Apple-style-span" style="font-weight: normal; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; display: inline !important; ">"1-word-freq"<b><div style="display: inline !important; "><span class="Apple-style-span" style="font-weight: normal; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; display: inline !important; ">, and the output persisted in the directory "2-word-counts" as listed in the main training directory in the HDFS. If you need to take a look at the ANT build and other classes, go to my personal resources at my <a href="http://code.google.com/p/programming-artifacts/source/browse/trunk/workspaces/hacking/hadoop-training/shakespeare/#shakespeare">Google Code project</a>. Recompile the project and generate the updated Jar with the driver.</div></span></div></b></div></span></div></b></span></div></b></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; ">
<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">training@training-vm:~/git/exercises/shakespeare$ ant</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">Buildfile: build.xml</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span class="Apple-style-span" style="font-family:'Courier New', Courier, monospace;">
<br /></span></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">compile:</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[javac] Note: Some input files use or override a deprecated API.</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[javac] Note: Recompile with -Xlint:deprecation for details.</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">
<br /></span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">jar:</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">
<br /></span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">BUILD SUCCESSFUL</span></span>
<br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; "><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">Total time: 1 second</span></span>
<br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; ">
<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; ">Now, executing the driver...
<br /></div><div>
<br /></div><div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordCountsInDocuments</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:04 INFO input.FileInputFormat: Total input paths to process : 1</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:04 INFO mapred.JobClient: Running job: job_200912301017_0048</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:05 INFO mapred.JobClient: map 0% reduce 0%</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:12 INFO mapred.JobClient: map 100% reduce 0%</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:18 INFO mapred.JobClient: map 100% reduce 100%</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Job complete: job_200912301017_0048</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Counters: 17</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Job Counters </span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Launched reduce tasks=1</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Launched map tasks=1</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Data-local map tasks=1</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: FileSystemCounters</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: FILE_BYTES_READ=1685803</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: HDFS_BYTES_READ=1588239</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3371638</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1920431</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Map-Reduce Framework</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Reduce input groups=0</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Combine output records=0</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Map input records=48779</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Reduce shuffle bytes=1685803</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Reduce output records=0</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Spilled Records=97558</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Map output bytes=1588239</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Combine input records=0</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Map output records=48779</span></span></div><div><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style="font-family:'courier new';">10/01/06 16:28:20 INFO mapred.JobClient: Reduce input records=48779</span></span></div><div>
<br /></div><div>Note that the execution generates tens of thousands of documents shuffled from ~1.6 million entries. Let's check the result using the hadoop fs -cat command once again and navigate through the result. The most important thing to note is that the relation n/N are maintained throughout the results, for each word and each total number for each document.</div><div>
<br /></div><div><div><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ </span><b><div style="display: inline !important; "><span class="Apple-style-span" style="font-weight: normal; "><div style="display: inline !important; "><div style="display: inline !important; "><div style="display: inline !important; "><div style="display: inline !important; "><span class="Apple-style-span" style="font-family:'courier new';">hadoop fs -cat 2-word-counts/part-r-00000 | less</span></div></div></div></div></span></div></b></div><div><b><div style="display: inline !important; "><span class="Apple-style-span" style="font-weight: normal; "><div style="display: inline !important; "><div style="display: inline !important; "><div style="display: inline !important; "><div style="display: inline !important; "><span class="Apple-style-span" style="font-family:'courier new';">....</span></div></div></div></div></span></div></b></div></div><div><div><span class="Apple-style-span" style="font-family:'courier new';">relished@all-shakespeare 1/738781</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">therefore@all-shakespeare 652/738781</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">eastward@all-shakespeare 1/738781</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">....</span></div><div><div><span class="Apple-style-span" style="font-family:'courier new';">irrespective@leornardo-davinci-all.txt 1/149612</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">ignorance@leornardo-davinci-all.txt 12/149612</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">drawing@leornardo-davinci-all.txt 174/149612</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">relief@leornardo-davinci-all.txt 36/149612</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">...</span></div><div><div><span class="Apple-style-span" style="font-family:'courier new';">answer@the-outline-of-science-vol1.txt 25/70650</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">sleeve@the-outline-of-science-vol1.txt 1/70650</span></div><div><span class="Apple-style-span" style="font-family:'courier new';">regard@the-outline-of-science-vol1.txt 22/70650</span></div><div>
<br /></div></div><div>Part 3 will conclude this job by combining two different steps. I'm still using the original basic tutorial from Cloudera, but using the Hadoop 0.20.1 API. Any suggestions for improvements are welcomed:</div><div>
<br /></div><div>- How to write data pipes between 2 different jobs?</div><div>- How to write a custom input format?</div><div>
<br /></div><div>Those questions might be answered after the training in Sunnyvale on January 19-21, during the Hadoop Training I'm excited to attend.</div></div></div></div></div></span></div></b></span></span></div></span></div></span></div>Marcello de Saleshttp://www.blogger.com/profile/12267056198563321351noreply@blogger.com1tag:blogger.com,1999:blog-8402055500123731062.post-69879265238982275522010-01-05T22:56:00.000-08:002010-01-06T16:43:06.978-08:00TF-IDF in Hadoop Part 1: Word Frequency in DocMy interest about parallel computing dates since my undergraduate school, just one or two years after Google's paper was published about how to make efficient data processing. From that time on, I was wondering how they manage to index "the web".<br /><br /><span style="font-size:x-large;">This code uses the Hadoop 0.20.1 API</span>.<br /><br />7 years passed and while writing my thesis project, I started dealing with the same questions regarding large datasets... How to process them on a database level? I mean, how to efficiently process with the computational resources you've got? mongoDB gave me at least parallel data processing with their database partitioning schame with database shards in a cluster. If the data is stored in different shards depending on different properties of the data. And of course, one of the tools to process the distributed data is a MapReduce API. The problem is that mongoDB is on its early stages, with a few people managing to have on their production environment, while the community members support among themselves through the user or developer's list. The documentation on MapReduce is very simple, but it definitely required some background...<br /><br />I finally found the Cloudera basic introduction training on MapReduce and Hadoop... and let me tell you, they made the nicest introduction to MapReduce I've seen :) The slides and documentation is very well structured and nice to follow... They actually worked at Google and the University of Washington to get to that level... I'm was very pleased to read, and understand the concept... My only need on that time was to use that knowledge on the MapReduce engine from mongoDB... I did a simple application and it proved to be interesting...<br /><br />So, I've been studying the Cloudera basic training in Hadoop, and that was the only way I could learn MapReduce! That is my suggestion to anyone with the same desire I had: learn the programming model behind the scene... The first implementation I did with Hadoop was the implementation of the indexing of words on All the Shakespeare collection, as well as Da-Vinci and a science book downloaded from the <a href="http://www.gutenberg.org/wiki/Main_Page">Gutenberg project</a>. The input directory includes the collection of all sharkespeare books in a single text file. You can add the downloaded files to the Hadoop File System by using the copyFromLocal command:<br /><br /><span style="font-family:'Courier New', Courier, monospace;">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -copyFromLocal the-outline-of-science-vol1.txt input</span><br /><span style="font-family:'Courier New', Courier, monospace;">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -copyFromLocal leornardo-davinci-all.txt input</span><br /><br /><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">You can verify if the files were added by listing the contents of the "input" directory.<br /><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"></span><br /><span style="font-family:'Courier New', Courier, monospace;">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input<br />Found 3 items<br />-rw-r--r-- 1 training supergroup 5342761 2009-12-30 11:57 /user/training/input/all-shakespeare<br />-rw-r--r-- 1 training supergroup 1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt<br />-rw-r--r-- 1 training supergroup 674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt<br /></span><br /><br /><span style="font-family:'Courier New', Courier, monospace;"><div><span style="font-family:Times;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span></span><br /></div></span><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">As I started learning the API and the HDFS, as well as exploring the implementation of the TF-IDF algorithm, as explained by the Cloudera training. I started this implementation after I implemented the InvertedIndex example using both the Hadoop 0.18 and the 0.20.1 APIs. The parts of my experiences are defined as follows:<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">Part 1: implements the "Job 1: Word Frequency in Doc". This is documented in this post;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><a href="http://distributed-agility.blogspot.com/2010/01/tf-idf-in-hadoop-part-2-word-counts-for.html">Part 2: implements the "Job 2: Word Counts For Docs"</a>;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">Part 3: "Job 3: Word Frequency In Corpus and Job 4: Calculate TF-IDF";<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div>Following the suggestion of the documentation, the approach I took to easily understand the concepts was to device-to-conquer. Each of the jobs are executed in separate as an exercise, saving the generated reduced values into the HDFS.<br /><br /><span style="font-size:x-large;">Job 1: Word Frequency in Doc</span><br /><br />As mentioned before, the word frequency phase is designed in a Job whose task is to count the number of words in each of the documents in the input directory. In this case, the specification of the Map and Reduce are as follows:<br /><ul><li>Map: </li><ul><li>Input: (document, each line contents)</li><li>Output: (word@document, 1))</li></ul><li>Reducer</li><ul><li>n = sum of the values of for each key "word@document"</li><li>Output: ((word@document), n) </li></ul></ul><div>In order to decrease the payload received by reducers, I'm considering the very-high-frequency words such as "the" as the Google's stopwords list. Also, the result of each job is the intermediate values for the next jobs are saved to the regular file, followed by the next MapReduce pass. In general, the strategy is:<br /></div><br /><ol><li>Reduces the map phase by using the lower-case values, because they will be aggregated before the reduce phase;</li><li>Don't use unnecessary words by verifying in the stopwords dictionary (Google search stopwords);</li><li>Use RegEx to select only words, removing punctuation and other data anomalies;</li></ol><span style="font-size:x-large;">Job1, Mapper</span><br /><br /><span style="font-family:'Courier New', Courier, monospace;">// (c) Copyright 2009 Cloudera, Inc.</span><br /><span style="font-family:'Courier New', Courier, monospace;">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</span><br /><span style="font-family:'Courier New', Courier, monospace;">package index;</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">import java.io.IOException;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import java.util.HashSet;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import java.util.Set;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import java.util.regex.Matcher;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import java.util.regex.Pattern;</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.IntWritable;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.LongWritable;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.Text;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.mapreduce.Mapper;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.mapreduce.lib.input.FileSplit;</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">/**</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * WordFrequenceInDocMapper implements the Job 1 specification for the TF-IDF algorithm</span><br /><span style="font-family:'Courier New', Courier, monospace;"> */</span><br /><span style="font-family:'Courier New', Courier, monospace;">public class WordFrequenceInDocMapper extends Mapper<longwritable, intwritable=""> {</longwritable,></span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> public WordFrequenceInDocMapper() {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> /**</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * Google's search Stopwords</span><br /><span style="font-family:'Courier New', Courier, monospace;"> */</span><br /><span style="font-family:'Courier New', Courier, monospace;"> private static Set<string> googleStopwords;</string></span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> static {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords = new HashSet<string>();</string></span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("I"); googleStopwords.add("a"); googleStopwords.add("about");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("an"); googleStopwords.add("are"); googleStopwords.add("as");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("at"); googleStopwords.add("be"); googleStopwords.add("by");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("com"); googleStopwords.add("de"); googleStopwords.add("en");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("for"); googleStopwords.add("from"); googleStopwords.add("how");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("in"); googleStopwords.add("is"); googleStopwords.add("it");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("la"); googleStopwords.add("of"); googleStopwords.add("on");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("or"); googleStopwords.add("that"); googleStopwords.add("the");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("this"); googleStopwords.add("to"); googleStopwords.add("was");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("what"); googleStopwords.add("when"); googleStopwords.add("where");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("who"); googleStopwords.add("will"); googleStopwords.add("with");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> googleStopwords.add("and"); googleStopwords.add("the"); googleStopwords.add("www");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;"> </span><br /><span style="font-family:'Courier New', Courier, monospace;"> /**</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param key is the byte offset of the current line in the file;</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param value is the line from the file</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param output has the method "collect()" to output the key,value pair</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param reporter allows us to retrieve some information about the job (like the current filename) </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * POST-CONDITION: Output <"word@filename", 1> pairs</span><br /><span style="font-family:'Courier New', Courier, monospace;"> */</span><br /><span style="font-family:'Courier New', Courier, monospace;"> public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> // Compile all the words using regex</span><br /><span style="font-family:'Courier New', Courier, monospace;"> Pattern p = Pattern.compile("\\w+");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> Matcher m = p.matcher(value.toString());</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> // Get the name of the file from the inputsplit in the context</span><br /><span style="font-family:'Courier New', Courier, monospace;"> String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> // build the values and write <k,v> pairs through the context</k,v></span><br /><span style="font-family:'Courier New', Courier, monospace;"> StringBuilder valueBuilder = new StringBuilder();</span><br /><span style="font-family:'Courier New', Courier, monospace;"> while (m.find()) {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> String matchedKey = m.group().toLowerCase();</span><br /><span style="font-family:'Courier New', Courier, monospace;"> // remove names starting with non letters, digits, considered stopwords or containing other chars</span><br /><span style="font-family:'Courier New', Courier, monospace;"> if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))</span><br /><span style="font-family:'Courier New', Courier, monospace;"> || googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> continue;</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;"> valueBuilder.append(matchedKey);</span><br /><span style="font-family:'Courier New', Courier, monospace;"> valueBuilder.append("@");</span><br /><span style="font-family:'Courier New', Courier, monospace;"> valueBuilder.append(fileName);</span><br /><span style="font-family:'Courier New', Courier, monospace;"> // emit the partial <k,v></k,v></span><br /><span style="font-family:'Courier New', Courier, monospace;"> context.write(new Text(valueBuilder.toString()), new IntWritable(1));</span><br /><span style="font-family:'Courier New', Courier, monospace;"> valueBuilder.setLength(0);</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;">}</span><br /><br /><span style="font-size:x-large;">Job1, Mapper Unit Test</span><br /><span style="font-size:x-large;"><span style="font-size:medium;"></span></span><br /><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:Times, 'Times New Roman', serif;">Note that the unit tests use the JUnit 4 API. The MRUnit API is also updated to use the Hadoop 0.20.1 API for the Mapper and the respective MapDriver. Generics are used to emulate the actual implementation as well.</span><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"></span><br /></div><span style="font-family:'Courier New', Courier, monospace;"></span><br /><span style="font-family:'Courier New', Courier, monospace;"></span><br /><span style="font-family:'Courier New', Courier, monospace;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br />// (c) Copyright 2009 Cloudera, Inc.<br />// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br />package index;<br /><br />import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;<br /><br />import java.io.IOException;<br />import java.util.ArrayList;<br />import java.util.List;<br /><br />import junit.framework.TestCase;<br /><br />import org.apache.hadoop.io.IntWritable;<br />import org.apache.hadoop.io.LongWritable;<br />import org.apache.hadoop.io.Text;<br />import org.apache.hadoop.mapreduce.Mapper;<br />import org.apache.hadoop.mrunit.mapreduce.MapDriver;<br />import org.apache.hadoop.mrunit.mock.MockInputSplit;<br />import org.apache.hadoop.mrunit.types.Pair;<br />import org.junit.Before;<br />import org.junit.Test;<br /><br />/**<br />* Test cases for the word frequency mapper.<br />*/<br />public class WordFreqMapperTest extends TestCase {<br /><br /> private Mapper<longwritable, intwritable=""> mapper;</longwritable,><br /> private MapDriver<longwritable, intwritable=""> driver;</longwritable,><br /><br /> /** We expect pathname@offset for the key from each of these */<br /> private final Text KEY_SUFIX = new Text("@" + MockInputSplit.getMockPath().toString());<br /><br /> @Before<br /> public void setUp() {<br /> mapper = new WordFrequenceInDocMapper();<br /> driver = new MapDriver<longwritable, intwritable="">(mapper);</longwritable,><br /> }<br /><br /> @Test<br /> public void testEmpty() {<br /> List<pair><text, intwritable="">> out = null;</text,></pair><br /> <br /> try {<br /> out = driver.withInput(new LongWritable(0), new Text("")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List<pair><text, text="">> expected = new ArrayList<pair><text, text="">>();</text,></pair></text,></pair><br /> <br /> assertListEquals(expected, out);<br /> }<br /><br /> @Test<br /> public void testOneWord() {<br /> List<pair><text, intwritable="">> out = null;</text,></pair><br /><br /> try {<br /> out = driver.withInput(new LongWritable(0), new Text("foo")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List<pair><text, intwritable="">> expected = new ArrayList<pair><text, intwritable="">>();</text,></pair></text,></pair><br /> expected.add(new Pair<text, intwritable="">(new Text("foo" + KEY_SUFIX), new IntWritable(1)));</text,><br /><br /> assertListEquals(expected, out);<br /> }<br /><br /> @Test<br /> public void testMultiWords() {<br /> List<pair><text, intwritable="">> out = null;</text,></pair><br /><br /> try {<br /> out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List<pair><text, intwritable="">> expected = new ArrayList<pair><text, intwritable="">>();</text,></pair></text,></pair><br /> expected.add(new Pair<text, intwritable="">(new Text("foo" + KEY_SUFIX), new IntWritable(1)));</text,><br /> expected.add(new Pair<text, intwritable="">(new Text("bar" + KEY_SUFIX), new IntWritable(1)));</text,><br /> expected.add(new Pair<text, intwritable="">(new Text("baz" + KEY_SUFIX), new IntWritable(1)));</text,><br /><br /> assertListEquals(expected, out);<br /> }<br />}<br /><br /></div></span><br /><span style="font-size:x-large;">Job1, Reducer</span><br /><br /><br /><span style="font-family:'Courier New', Courier, monospace;">// (c) Copyright 2009 Cloudera, Inc.</span><br /><span style="font-family:'Courier New', Courier, monospace;">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">package index;</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">import java.io.IOException;</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.IntWritable;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.Text;</span><br /><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.mapreduce.Reducer;</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;">/**</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * WordFrequenceInDocReducer.</span><br /><span style="font-family:'Courier New', Courier, monospace;"> */</span><br /><span style="font-family:'Courier New', Courier, monospace;">public class WordFrequenceInDocReducer extends Reducer<text, intwritable=""> {</text,></span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> public WordFrequenceInDocReducer() {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> /**</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param key is the key of the mapper</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param values are all the values aggregated during the mapping phase</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * @param context contains the context of the job run</span><br /><span style="font-family:'Courier New', Courier, monospace;"> * </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * PRE-CONDITION: receive a list of <"word@filename",[1, 1, 1, ...]> pairs </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * <"marcello@a.txt", [1, 1]> </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * POST-CONDITION: emit the output a single key-value where the sum of the occurrences. </span><br /><span style="font-family:'Courier New', Courier, monospace;"> * <"marcello@a.txt", 2></span><br /><span style="font-family:'Courier New', Courier, monospace;"> */</span><br /><span style="font-family:'Courier New', Courier, monospace;"> protected void reduce(Text key, Iterable<intwritable> values, Context context) throws IOException, InterruptedException {</intwritable></span><br /><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /><span style="font-family:'Courier New', Courier, monospace;"> int sum = 0;</span><br /><span style="font-family:'Courier New', Courier, monospace;"> for (IntWritable val : values) {</span><br /><span style="font-family:'Courier New', Courier, monospace;"> sum += val.get();</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;"> //write the key and the adjusted value (removing the last comma)</span><br /><span style="font-family:'Courier New', Courier, monospace;"> context.write(key, new IntWritable(sum));</span><br /><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /><span style="font-family:'Courier New', Courier, monospace;">}</span><br /><div><br /><span style="font-size:x-large;">Job1, Reducer Unit Test</span><br /><span style="font-size:x-large;"><span style="font-size:medium;"></span></span><br /><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"></span><br /></div><span style="font-family:'Courier New', Courier, monospace;"></span><br /><span style="font-family:'Courier New', Courier, monospace;"></span><br /><span style="font-family:'Courier New', Courier, monospace;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">// (c) Copyright 2009 Cloudera, Inc.<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">package index;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import java.io.IOException;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import java.util.ArrayList;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import java.util.List;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import junit.framework.TestCase;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.apache.hadoop.io.IntWritable;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.apache.hadoop.io.Text;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.apache.hadoop.mapreduce.Reducer;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.apache.hadoop.mrunit.types.Pair;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.junit.Before;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">import org.junit.Test;<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">/**<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> * Test cases for the inverted index reducer.<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> */<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">public class WordFreqReducerTest extends TestCase {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> private Reducer<text, intwritable=""> reducer;</text,><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> private ReduceDriver<text, intwritable=""> driver;</text,><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> @Before<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> public void setUp() {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> reducer = new WordFrequenceInDocReducer();<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> driver = new ReduceDriver<text, intwritable="">(reducer);</text,><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> }<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> @Test<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> public void testOneItem() {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> List<pair><text, intwritable="">> out = null;</text,></pair><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> try {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> out = driver.withInputKey(new Text("word")).withInputValue(new IntWritable(1)).run();<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> } catch (IOException ioe) {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> fail();<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> }<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> List<pair><text, intwritable="">> expected = new ArrayList<pair><text, intwritable="">>();</text,></pair></text,></pair><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> expected.add(new Pair<text, intwritable="">(new Text("word"), new IntWritable(1)));</text,><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> assertListEquals(expected, out);<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> }<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> @Test<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> public void testMultiWords() {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> List<pair><text, intwritable="">> out = null;</text,></pair><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> try {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> List<intwritable> values = new ArrayList<intwritable>();</intwritable></intwritable><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> values.add(new IntWritable(2));<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> values.add(new IntWritable(5));<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> values.add(new IntWritable(8));<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> out = driver.withInput(new Text("word1"), values).run();<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> <br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> } catch (IOException ioe) {<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> fail();<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> }<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> List<pair><text, intwritable="">> expected = new ArrayList<pair><text, intwritable="">>();</text,></pair></text,></pair><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> expected.add(new Pair<text, intwritable="">(new Text("word1"), new IntWritable(15)));</text,><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> assertListEquals(expected, out);<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"> }<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">}<br /></div></span><br /><div></div><br /><span style="font-size:x-large;"><span style="font-size:medium;"><br /></span></span><br /></div>Before executing the hadoop application, make sure that the Mapper and Reducer classes are passing the unit tests for each of them. Test-Driven Development helps during the development of the Mappers and Reducers by identifying problems related to incorrect inherited methods (Generics in special), where wrong "map" or "reduce" method signatures may lead to skipping designed phases. Therefore, run the test cases before the actual execution of the driver classes is safer.<br /><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">training@training-vm:~/git/exercises/shakespeare$ ant test</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">Buildfile: build.xml</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"><br /></span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">compile:</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[javac] Note: Some input files use or override a deprecated API.</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[javac] Note: Recompile with -Xlint:deprecation for details.</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"><br /></span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">test:</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[junit] Running index.AllTests</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[junit] Testsuite: index.AllTests</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.279 sec</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.279 sec</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">[junit] </span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"><br /></span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">BUILD SUCCESSFUL</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">Total time: 2 seconds</span></span><br /><br />Then, the execution of the Driver can proceed. It includes the definitions of the mapper and reducer classes, as well as defining the combiner class to be the same as the reducer class. <b>Also, note that the definition of the outputKeyClass and outputValueClass are the same as the ones defined by the Reducer class!!! If not, Hadoop will complain! :)</b><br /><br /><br /><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-size:x-large;">Job1, Driver</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div></div><div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">// (c) Copyright 2009 Cloudera, Inc.</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">package index;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.conf.Configuration;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.conf.Configured;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.fs.Path;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.IntWritable;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.io.Text;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.mapreduce.Job;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.util.Tool;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">import org.apache.hadoop.util.ToolRunner;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">/**</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> * WordFrequenceInDocument Creates the index of the words in documents, </span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> * mapping each of them to their frequency.</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> */</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">public class WordFrequenceInDocument extends Configured implements Tool {</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> // where to put the data in hdfs when we're done</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> private static final String OUTPUT_PATH = "1-word-freq";</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> // where to read the data from.</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> private static final String INPUT_PATH = "input";</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> public int run(String[] args) throws Exception {</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> Configuration conf = getConf();</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> Job job = new Job(conf, "Word Frequence In Document");</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> job.setJarByClass(WordFrequenceInDocument.class);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> job.setMapperClass(WordFrequenceInDocMapper.class);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> job.setReducerClass(WordFrequenceInDocReducer.class);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> job.setCombinerClass(WordFrequenceInDocReducer.class);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> job.setOutputKeyClass(Text.class);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> job.setOutputValueClass(IntWritable.class);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> FileInputFormat.addInputPath(job, new Path(INPUT_PATH));</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> return job.waitForCompletion(true) ? 0 : 1;</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><br /></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> public static void main(String[] args) throws Exception {</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> int res = ToolRunner.run(new Configuration(), new WordFrequenceInDocument(), args);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> System.exit(res);</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"> }</span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;">}</span><br /></div></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">As specified by the Driver class, the data is read from the books listed in the input directory from the HDFS and the output is the directory from this first step "1-word-freq". The training virtual machine contains the necessary build scripts to compile and generate the jars for the execution of the map reduce application, as well as running Unit Tests for each of the Mapper and Reducer classes.<br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">training@training-vm:~/git/exercises/shakespeare$ ant</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">Buildfile: build.xml</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"><br /></span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">compile:</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"> [javac] Compiling 5 source files to /home/training/git/exercises/shakespeare/bin</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"> [javac] Note: Some input files use or override a deprecated API.</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"> [javac] Note: Recompile with -Xlint:deprecation for details.</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"><br /></span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">jar:</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"> [jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;"><br /></span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">BUILD SUCCESSFUL</span></span><br /></div></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">Total time: 1 second</span></span><br /></div></div><div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">Now, executing the driver...<br /></div></div><br /><br /><span style="font-family:'Courier New', Courier, monospace;"><span style="font-size:small;">training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.WordFrequenceInDocument</span></span><br /><span style="font-family:'Courier New', Courier, monospace;"></span><br /><span style="font-family:'Courier New', Courier, monospace;">hadoop jar indexer.jar index.WordFrequenceInDocument<br />10/01/05 16:34:54 INFO input.FileInputFormat: Total input paths to process : 3<br />10/01/05 16:34:54 INFO mapred.JobClient: Running job: job_200912301017_0046<br />10/01/05 16:34:55 INFO mapred.JobClient: map 0% reduce 0%<br />10/01/05 16:35:10 INFO mapred.JobClient: map 50% reduce 0%<br />10/01/05 16:35:13 INFO mapred.JobClient: map 66% reduce 0%<br />10/01/05 16:35:16 INFO mapred.JobClient: map 100% reduce 0%<br />10/01/05 16:35:19 INFO mapred.JobClient: map 100% reduce 33%<br />10/01/05 16:35:25 INFO mapred.JobClient: map 100% reduce 100%<br />10/01/05 16:35:27 INFO mapred.JobClient: Job complete: job_200912301017_0046<br />10/01/05 16:35:27 INFO mapred.JobClient: Counters: 17<br />10/01/05 16:35:27 INFO mapred.JobClient: Job Counters<br />10/01/05 16:35:27 INFO mapred.JobClient: Launched reduce tasks=1<br />10/01/05 16:35:27 INFO mapred.JobClient: Launched map tasks=3<br />10/01/05 16:35:27 INFO mapred.JobClient: Data-local map tasks=3<br />10/01/05 16:35:27 INFO mapred.JobClient: FileSystemCounters<br />10/01/05 16:35:27 INFO mapred.JobClient: FILE_BYTES_READ=3129067<br />10/01/05 16:35:27 INFO mapred.JobClient: HDFS_BYTES_READ=7445292<br />10/01/05 16:35:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4901739<br />10/01/05 16:35:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1588239<br />10/01/05 16:35:27 INFO mapred.JobClient: Map-Reduce Framework<br />10/01/05 16:35:27 INFO mapred.JobClient: Reduce input groups=0<br />10/01/05 16:35:27 INFO mapred.JobClient: Combine output records=94108<br />10/01/05 16:35:27 INFO mapred.JobClient: Map input records=220255<br />10/01/05 16:35:27 INFO mapred.JobClient: Reduce shuffle bytes=1772576<br />10/01/05 16:35:27 INFO mapred.JobClient: Reduce output records=0<br />10/01/05 16:35:27 INFO mapred.JobClient: Spilled Records=142887<br />10/01/05 16:35:27 INFO mapred.JobClient: Map output bytes=27375962<br />10/01/05 16:35:27 INFO mapred.JobClient: Combine input records=1004372<br />10/01/05 16:35:27 INFO mapred.JobClient: Map output records=959043<br />10/01/05 16:35:27 INFO mapred.JobClient: Reduce input records=48779<br /></span><br /><br />The execution generates the output as shown in the following listing (note that I had piped the cat process to the less process for you to navigate over the stream). Searching for the word "therefore" shows its use on the different documents.<br /><br /><span style=" ;font-family:'Courier New', Courier, monospace;font-size:small;">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat 1-word-freq/part-r-00000 | less</span><br /><br /><span style="font-family:'Courier New', Courier, monospace;">...</span><br /><span style="font-family:'Courier New', Courier, monospace;">therefore@all-shakespeare 652</span><br /><span style="font-family:'Courier New', Courier, monospace;">therefore@leornardo-davinci-all.txt 124</span><br /><span style="font-family:'Courier New', Courier, monospace;">therefore@the-outline-of-science-vol1.txt 36</span><br /><br />The results produced are the intermediate data necessary for the execution of the Job 2, specified in the Part 2 of this tutorial, where it counts the total number of words encountered using the data in the 1-word-freq directory.Marcello de Saleshttp://www.blogger.com/profile/12267056198563321351noreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-36444881311305107362010-01-05T00:08:00.000-08:002010-01-05T22:43:01.655-08:00Hadoop 0.20.1 API: refactoring the InvertedLine example, removing deprecated classes (JobConf, others)I've been learning Hadoop for the past 15 days and I have found lots of examples of source-code. The basic training offered by Cloudera uses the 0.18 API, as well as the Yahoo developer's tutorial that describe the example of a the Inverted Line Index example. The input of this example is a list of one or more text files containing books, and the output is the index of words appearing on each of the files in the format "<word>", where word is found on a given line of the given fileName at the byte offset given. Although the example works without a problem, I've read documentations about the Pig application where the majority of the warnings are caused by the API change. I'm particularly in favour of clean code without warnings, whenever possible. So, I started dissecting the API and could re-implement the examples using the Hadoop 0.20.1. Furthermore, the MRUnit must also be refactored in order to make use of the new API.</word><br /><br />Both the <a href="http://developer.yahoo.com/hadoop/tutorial/">Yahoo Hadoop Tutorial</a> and the <a href="http://www.cloudera.com/hadoop-training-exercise-writing-mapreduce-programs">Cloudera Basic Training documentation "Writing MapReduce Programs"</a> give the example of the InvertedIndex application. I used the <a href="http://www.cloudera.com/hadoop-training-virtual-machine">Cloudera VMWare implementation</a> and source-code as a starting point.<br /><br />The first major change was the inclusion of the mapreduce package, containing the new implementation of the Mapper and Reducer classes, which were Interfaces in the previous APIs in the package "mapred". <br /><br /><div style="font-family:";"><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.mapreduce.Mapper;</span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span></div><div style="Courier New",Courier,monospace;font-family:";"><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">import org.apache.hadoop.mapreduce.Reducer;</span><br /></span></div><div face=""" style="Courier New",Courier,monospace;"><br /></div><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">public class MyMapper extends Mapper</span><k v="" valueout=""><span class="Apple-style-span" style="font-family:'courier new';"> {}</span></k></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">public class MyReducer extends Reducer{}</span></span><div><br />Also, note that these classes use the Java generics capabilities and therefore, the methods "map()" and "reduce()" must follow the convention given in your implementation. Both methods removed the use of the reporter and collector by the use of a Context class, that is a static member class of each of the Mapper and Reducer classes.<br /><br /><div style="font-family:";"><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">protected void map(K key, V value, Mapper.Context context)</span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span></div><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">protected void reduce(K key, Iterable</span><v><span class="Apple-style-span" style="font-family:'courier new';"> values, Context context) </span></v></span><br /><br />Whatever the type of K and V are, they must be used in the implementation. For instance, I used to have an Iterator implementation for the key in the reducer, and the reduce method was never called with the wrong method signature. So, it is important to verify that you're using the Iterable class for the values instead.<br /><br /><span><span class="Apple-style-span" style="font-size: x-large;">Mapper Class</span></span><br /><br /><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.<br />// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br /><br />package index;<br /><br />import java.io.IOException;<br />import java.util.HashSet;<br />import java.util.Set;<br />import java.util.regex.Matcher;<br />import java.util.regex.Pattern;<br /><br />import org.apache.hadoop.io.LongWritable;<br />import org.apache.hadoop.io.Text;<br />import org.apache.hadoop.mapreduce.Mapper;<br />import org.apache.hadoop.mapreduce.lib.input.FileSplit;<br /><br />/**<br /> * LineIndexMapper Maps each observed word in a line to a (filename@offset) string.<br /> */<br />public class LineIndexMapper extends Mapper</span><longwritable, text=""><span class="Apple-style-span" style="font-family:'courier new';"> {<br /><br /> public LineIndexMapper() {<br /> }<br /><br /> /**<br /> * Google's search Stopwords<br /> */<br /> private static Set</span><string><span class="Apple-style-span" style="font-family:'courier new';"> googleStopwords;<br /><br /> static {<br /> googleStopwords = new HashSet</span><string><span class="Apple-style-span" style="font-family:'courier new';">();<br /> googleStopwords.add("I"); googleStopwords.add("a"); googleStopwords.add("about");<br /> googleStopwords.add("an"); googleStopwords.add("are"); googleStopwords.add("as");<br /> googleStopwords.add("at"); googleStopwords.add("be"); googleStopwords.add("by");</span></string></string></longwritable,></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';"> googleStopwords.add("com"); googleStopwords.add("de"); googleStopwords.add("en");<br /> googleStopwords.add("for"); googleStopwords.add("from"); googleStopwords.add("how");<br /> googleStopwords.add("in"); googleStopwords.add("is"); googleStopwords.add("it");<br /> googleStopwords.add("la"); googleStopwords.add("of"); googleStopwords.add("on");<br /> googleStopwords.add("or"); googleStopwords.add("that"); googleStopwords.add("the");<br /> googleStopwords.add("this"); googleStopwords.add("to"); googleStopwords.add("was");<br /> googleStopwords.add("what"); googleStopwords.add("when"); googleStopwords.add("where");<br /> googleStopwords.add("who"); googleStopwords.add("will"); googleStopwords.add("with");<br /> googleStopwords.add("and"); googleStopwords.add("the"); googleStopwords.add("www");<br /> }<br /> <br /> /**<br /> * @param key is the byte offset of the current line in the file;<br /> * @param value is the line from the file<br /> * @param output has the method "collect()" to output the key,value pair<br /> * @param reporter allows us to retrieve some information about the job (like the current filename)<br /> *<br /> * POST-CONDITION: Output <"word", "filename@offset"> pairs<br /> */<br /> public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {<br /> // Compile all the words using regex<br /> Pattern p = Pattern.compile("\\w+");<br /> Matcher m = p.matcher(value.toString());<br /><br /> // Get the name of the file from the inputsplit in the context<br /> String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();<br /><br /> // build the values and write </span><k,v><span class="Apple-style-span" style="font-family:'courier new';"> pairs through the context<br /> StringBuilder valueBuilder = new StringBuilder();<br /> while (m.find()) {<br /> String matchedKey = m.group().toLowerCase();<br /> // remove names starting with non letters, digits, considered stopwords or containing other chars<br /> if (!Character.isLetter(matchedKey.charAt(0)) || Character.isDigit(matchedKey.charAt(0))<br /> || googleStopwords.contains(matchedKey) || matchedKey.contains("_")) {<br /> continue;<br /> }<br /> valueBuilder.append(fileName);<br /> valueBuilder.append("@");<br /> valueBuilder.append(key.get());<br /> // emit the partial </span><k,v><span class="Apple-style-span" style="font-family:'courier new';"><br /> context.write(new Text(matchedKey), new Text(valueBuilder.toString()));<br /> valueBuilder.setLength(0);<br /> }<br /> }<br />}</span></k,v></k,v></span><br /><br /><span><span class="Apple-style-span" style="font-size: x-large;">Reducer Class</span></span><span class="Apple-style-span" style="font-size: x-large;"><br /></span><br /><span style="Courier New",Courier,monospace; font-family:";font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.<br />// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br /><br />package index;<br /><br />import java.io.IOException;<br /><br />import org.apache.hadoop.io.Text;<br />import org.apache.hadoop.mapreduce.Reducer;<br /><br />/**<br /> * LineIndexReducer Takes a list of filename@offset entries for a single word and concatenates them into a list.<br /> */<br />public class LineIndexReducer extends Reducer</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';"> {<br /><br /> public LineIndexReducer() {<br /> }<br /><br /> /**<br /> * @param key is the key of the mapper<br /> * @param values are all the values aggregated during the mapping phase<br /> * @param context contains the context of the job run<br /> *<br /> * PRE-CONDITION: receive a list of <"word", "filename@offset"> pairs<br /> * <"marcello", ["a.txt@3345", "b.txt@344", "c.txt@785"]><br /> * <br /> * POST-CONDITION: emit the output a single key-value where all the file names<br /> * are separated by a comma ",".<br /> * <"marcello", "a.txt@3345,b.txt@344,c.txt@785"><br /> */<br /> protected void reduce(Text key, Iterable</span><text><span class="Apple-style-span" style="font-family:'courier new';"> values, Context context) throws IOException, InterruptedException {<br /> StringBuilder valueBuilder = new StringBuilder();<br /><br /> for (Text val : values) {<br /> valueBuilder.append(val);<br /> valueBuilder.append(",");<br /> }<br /> //write the key and the adjusted value (removing the last comma)<br /> context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));<br /> valueBuilder.setLength(0);<br /> }<br />}</span><br /></text></text,></span><br />These are the changes necessary for the Mapper and Reducer classes, without the need to extend the base classes. In order to unit test these classes, changes on the MRUnit are also necessary. The drivers were also added a new "mapreduce" package with the same counterparts.<br /><br />Instead of the mrunit.MapDriver, use the mapreduce.MapDriver. The same for the Reducer class. The rest of the code is just the same.<br /><br /><div style="color: red;"><strike>import org.apache.hadoop.mrunit.MapDriver;</strike><br /></div>import org.apache.hadoop.mrunit.mapreduce.MapDriver;<br /><br /><span><span class="Apple-style-span" style="font-size: x-large;">JUnit's MapperTest</span></span><br /><br /><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.<br />// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br />package index;<br /><br />import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;<br /><br />import java.io.IOException;<br />import java.util.ArrayList;<br />import java.util.List;<br /><br />import junit.framework.TestCase;<br /><br />import org.apache.hadoop.io.LongWritable;<br />import org.apache.hadoop.io.Text;<br />import org.apache.hadoop.mapreduce.Mapper;<br />import org.apache.hadoop.mrunit.mapreduce.MapDriver;<br />import org.apache.hadoop.mrunit.mock.MockInputSplit;<br />import org.apache.hadoop.mrunit.types.Pair;<br />import org.junit.Before;<br />import org.junit.Test;<br /><br />/**<br /> * Test cases for the inverted index mapper.<br /> */<br />public class MapperTest extends TestCase {<br /><br /> private Mapper</span><longwritable, text=""><span class="Apple-style-span" style="font-family:'courier new';"> mapper;<br /> private MapDriver</span><longwritable, text=""><span class="Apple-style-span" style="font-family:'courier new';"> driver;<br /><br /> /** We expect pathname@offset for the key from each of these */<br /> private final Text EXPECTED_OFFSET = new Text(MockInputSplit.getMockPath().toString() + "@0");<br /><br /> @Before<br /> public void setUp() {<br /> mapper = new LineIndexMapper();<br /> driver = new MapDriver</span><longwritable, text=""><span class="Apple-style-span" style="font-family:'courier new';">(mapper);<br /> }<br /><br /> @Test<br /> public void testEmpty() {<br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> out = null;<br /><br /> try {<br /> out = driver.withInput(new LongWritable(0), new Text("")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> expected = new ArrayList</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">>();<br /><br /> assertListEquals(expected, out);<br /> }<br /><br /> @Test<br /> public void testOneWord() {<br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> out = null;<br /><br /> try {<br /> out = driver.withInput(new LongWritable(0), new Text("foo")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> expected = new ArrayList</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">>();<br /> expected.add(new Pair</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(new Text("foo"), EXPECTED_OFFSET));<br /><br /> assertListEquals(expected, out);<br /> }<br /><br /> @Test<br /> public void testMultiWords() {<br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> out = null;<br /><br /> try {<br /> out = driver.withInput(new LongWritable(0), new Text("foo bar baz!!!! ????")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> expected = new ArrayList</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">>();<br /> expected.add(new Pair</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(new Text("foo"), EXPECTED_OFFSET));<br /> expected.add(new Pair</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(new Text("bar"), EXPECTED_OFFSET));<br /> expected.add(new Pair</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(new Text("baz"), EXPECTED_OFFSET));<br /><br /> assertListEquals(expected, out);<br /> }<br />}<br /></span></text,></text,></text,></text,></pair></text,></pair></text,></pair></text,></text,></pair></text,></pair></text,></pair></text,></pair></text,></pair></text,></pair></longwritable,></longwritable,></longwritable,></span><br /><br /><span><span class="Apple-style-span" style="font-family:'times new roman';"><span class="Apple-style-span" style="font-size: x-large;">JUnit's ReducerTest</span></span></span><span class="Apple-style-span" style="font-family:'times new roman';"><span class="Apple-style-span" style="font-size: x-large;"><br /></span></span><br /><span style="Courier New",Courier,monospace; font-family:";font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.<br />// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br /><br />package index;<br /><br />import static org.apache.hadoop.mrunit.testutil.ExtendedAssert.assertListEquals;<br /><br />import java.io.IOException;<br />import java.util.ArrayList;<br />import java.util.List;<br /><br />import junit.framework.TestCase;<br /><br />import org.apache.hadoop.io.Text;<br />import org.apache.hadoop.mapreduce.Reducer;<br />import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;<br />import org.apache.hadoop.mrunit.types.Pair;<br />import org.junit.Before;<br />import org.junit.Test;<br /><br />/**<br /> * Test cases for the inverted index reducer.<br /> */<br />public class ReducerTest extends TestCase {<br /><br /> private Reducer</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';"> reducer;<br /> private ReduceDriver</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';"> driver;<br /><br /> @Before<br /> public void setUp() {<br /> reducer = new LineIndexReducer();<br /> driver = new ReduceDriver</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(reducer);<br /> }<br /><br /> @Test<br /> public void testOneOffset() {<br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> out = null;<br /><br /> try {<br /> out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> expected = new ArrayList</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">>();<br /> expected.add(new Pair</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(new Text("word"), new Text("offset")));<br /><br /> assertListEquals(expected, out);<br /> }<br /><br /> @Test<br /> public void testMultiOffset() {<br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> out = null;<br /><br /> try {<br /> out = driver.withInputKey(new Text("word")).withInputValue(new Text("offset1")).withInputValue(<br /> new Text("offset2")).run();<br /> } catch (IOException ioe) {<br /> fail();<br /> }<br /><br /> List</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">> expected = new ArrayList</span><pair><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">>();<br /> expected.add(new Pair</span><text, text=""><span class="Apple-style-span" style="font-family:'courier new';">(new Text("word"), new Text("offset1,offset2")));<br /><br /> assertListEquals(expected, out);<br /> }<br /><br />}</span><br /></text,></text,></pair></text,></pair></text,></pair></text,></text,></pair></text,></pair></text,></pair></text,></text,></text,></span><br />You can test them using the command "ant test" on the source-code directory as usual to confirm that the implementation is correct:<br /><br /><span style="Courier New",Courier,monospace; font-family:";font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ ant test<br />Buildfile: build.xml<br /><br />compile:<br /> [javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin<br /><br />test:<br /> [junit] Running index.AllTests<br /> [junit] Testsuite: index.AllTests<br /> [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec<br /> [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.418 sec<br /> [junit]<br /><br />BUILD SUCCESSFUL<br />Total time: 2 seconds</span><br /></span><br /><br /><span style="font-size:large;"><span class="Apple-style-span" style="font-size: x-large;">Replacing JobConf and other deprecated classes</span><br /></span><br />Other changes related to the API is on the configuration of the execution of the jobs. The class "JobConf" was deprecated, but most of the tutorials have not been updated. So, here's the updated version of the main example driver using the Configuration and Context classes. Note that the job is configured and executed with the default version of the configuration. It is the class responsible for configuring the execution of the tasks. Once again, the replacement of the classes located at the package "mapred" is important, since the new classes are located at the package "mapreduce".<br /><br /><span><span class="Apple-style-span" style="font-size: x-large;">InvertedIndex driver<br /></span></span><br /><br /><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">// (c) Copyright 2009 Cloudera, Inc.<br />// Hadoop 0.20.1 API Updated by Marcello de Sales (marcello.desales@gmail.com)<br />package index;<br /><br />import org.apache.hadoop.conf.Configuration;<br />import org.apache.hadoop.conf.Configured;<br />import org.apache.hadoop.fs.Path;<br />import org.apache.hadoop.io.Text;<br />import org.apache.hadoop.mapreduce.Job;<br />import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;<br />import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;<br />import org.apache.hadoop.util.Tool;<br />import org.apache.hadoop.util.ToolRunner;<br /><br />/**<br /> * LineIndexer Creates an inverted index over all the words in a document corpus, mapping each observed word to a list<br /> * of filename@offset locations where it occurs.<br /> */<br />public class LineIndexer extends Configured implements Tool {<br /><br /> // where to put the data in hdfs when we're done<br /> private static final String OUTPUT_PATH = "output";<br /><br /> // where to read the data from.<br /> private static final String INPUT_PATH = "input";<br /></span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /><span style="font-size:x-small;"></span><br /><span style="font-size:x-small;"></span><br /></span><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';"> </span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';"> public static void main(String[] args) throws Exception {<br /> int res = ToolRunner.run(new Configuration(), new LineIndexer(), args);<br /> System.exit(res);<br /> }</span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /><br /></span><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';"> public int run(String[] args) throws Exception {<br /><br /> Configuration conf = getConf();<br /> Job job = new Job(conf, "Line Indexer 1");<br /><br /> job.setJarByClass(LineIndexer.class);<br /> job.setMapperClass(LineIndexMapper.class);<br /> job.setReducerClass(LineIndexReducer.class);<br /><br /> job.setOutputKeyClass(Text.class);<br /> job.setOutputValueClass(Text.class);<br /><br /> FileInputFormat.addInputPath(job, new Path(INPUT_PATH));<br /> FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));<br /><br /> return job.waitForCompletion(true) ? 0 : 1;<br /> }<br />}</span></span><br /><br />After updating, make sure to run generate a new jar, remove anything under the directory "output" (since the program does not clean that up), and execute the new version.<br /><div face=""" style="Courier New",Courier,monospace;"><br /></div><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ ant jar<br />Buildfile: build.xml<br /><br />compile:<br /> [javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin<br /><br />jar:<br /> [jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar<br /><br />BUILD SUCCESSFUL<br />Total time: 1 second</span></span><br /><br />I have added 2 ASCII books in the input directory: the works from Leonardo Da Vinci and the first volume of the book "The outline of science".<br /><br /><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input<br />Found 3 items<br />-rw-r--r-- 1 training supergroup 5342761 2009-12-30 11:57 /user/training/input/all-shakespeare<br />-rw-r--r-- 1 training supergroup 1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt<br />-rw-r--r-- 1 training supergroup 674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt</span></span><br /><br />The execution and output of running this example is shown as follows.<br /><br /><span style="Courier New",Courier,monospace; font-family:";font-size:x-small;"> </span><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.LineIndexer<br />10/01/04 21:11:55 INFO input.FileInputFormat: Total input paths to process : 3<br />10/01/04 21:11:56 INFO mapred.JobClient: Running job: job_200912301017_0017<br />10/01/04 21:11:57 INFO mapred.JobClient: map 0% reduce 0%<br />10/01/04 21:12:07 INFO mapred.JobClient: map 33% reduce 0%<br />10/01/04 21:12:10 INFO mapred.JobClient: map 58% reduce 0%<br />10/01/04 21:12:13 INFO mapred.JobClient: map 63% reduce 0%<br />10/01/04 21:12:16 INFO mapred.JobClient: map 100% reduce 11%<br />10/01/04 21:12:28 INFO mapred.JobClient: map 100% reduce 77%<br />10/01/04 21:12:34 INFO mapred.JobClient: map 100% reduce 100%<br />10/01/04 21:12:36 INFO mapred.JobClient: Job complete: job_200912301017_0017<br />10/01/04 21:12:36 INFO mapred.JobClient: Counters: 17<br />10/01/04 21:12:36 INFO mapred.JobClient: Job Counters<br />10/01/04 21:12:36 INFO mapred.JobClient: Launched reduce tasks=1<br />10/01/04 21:12:36 INFO mapred.JobClient: Launched map tasks=3<br />10/01/04 21:12:36 INFO mapred.JobClient: Data-local map tasks=3<br />10/01/04 21:12:36 INFO mapred.JobClient: FileSystemCounters<br />10/01/04 21:12:36 INFO mapred.JobClient: FILE_BYTES_READ=58068623<br />10/01/04 21:12:36 INFO mapred.JobClient: HDFS_BYTES_READ=7445292<br />10/01/04 21:12:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=92132872<br />10/01/04 21:12:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26638259<br />10/01/04 21:12:36 INFO mapred.JobClient: Map-Reduce Framework<br />10/01/04 21:12:36 INFO mapred.JobClient: Reduce input groups=0<br />10/01/04 21:12:36 INFO mapred.JobClient: Combine output records=0<br />10/01/04 21:12:36 INFO mapred.JobClient: Map input records=220255<br />10/01/04 21:12:36 INFO mapred.JobClient: Reduce shuffle bytes=34064153<br />10/01/04 21:12:36 INFO mapred.JobClient: Reduce output records=0<br />10/01/04 21:12:36 INFO mapred.JobClient: Spilled Records=2762272<br />10/01/04 21:12:36 INFO mapred.JobClient: Map output bytes=32068217<br />10/01/04 21:12:36 INFO mapred.JobClient: Combine input records=0<br />10/01/04 21:12:36 INFO mapred.JobClient: Map output records=997959<br />10/01/04 21:12:36 INFO mapred.JobClient: Reduce input records=997959</span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span><br />The index entry for the word "abandoned" is an example of one present in all of the books:<br /><br /><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">training@training-vm:~/git/exercises/shakespeare$ hadoop fs -cat output/part-r-00000 | less</span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span><div style="font-family:";"><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';"> ...</span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span></div><div style="font-family:";"><span style="font-size:x-small;"><span class="Apple-style-span" style="font-family:'courier new';">...<br /></span></span><span class="Apple-style-span" style="font-family:'courier new';"><br /></span></div><span style="font-size:x-small;"><span><span class="Apple-style-span" style="font-family:'courier new';">abandoned leornardo-davinci-all.txt@1257995,leornardo-davinci-all.txt@652992,all-shakespeare@4657862,all-shakespeare@738818,the-outline-of-science-vol1.txt@642211,the-outline-of-science-vol1.txt@606442,the-outline-of-science-vol1.txt@641585</span></span></span><span class="Apple-style-span" style="font-family:'courier new';"><br />...<br />...</span></div>Anonymousnoreply@blogger.com6tag:blogger.com,1999:blog-8402055500123731062.post-43562677262093026762009-12-31T03:47:00.000-08:002010-01-05T22:56:02.813-08:00Hadoop, mongoDB: MapReduce toolsThis blog post will be eventually written when time permits :)<br /><br />- Is there any use case of using monogDB as a data server for Hadoop?Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-60424644421770667862009-12-17T17:57:00.000-08:002009-12-17T17:58:48.061-08:00Finished... 23 Months Gone!!! Got my M.S. degreeFinally, it is over!!! I got my M.S. degree in Computer Science at SFSU. It was a long, fun, and sometimes tiring journey, but I finally accomplished another dream of mine...<br />
<br />
The thesis write up was a big rush... 4 Months, no sleep... Got lots of experience in Data Persistence for Sensor Networks, Cloud-computing techniques in Data Persistence using Key-Value-Pair data model, Database Shards and Partition...<br />
<br />
Well, that is all...<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/_xT0hoI9ybHE/Syrhz-B5cLI/AAAAAAAAECk/WZpK8w84-_A/s1600-h/2009-12-17+10.43.19.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/_xT0hoI9ybHE/Syrhz-B5cLI/AAAAAAAAECk/WZpK8w84-_A/s320/2009-12-17+10.43.19.jpg" /></a><br />
</div>Anonymousnoreply@blogger.com2tag:blogger.com,1999:blog-8402055500123731062.post-20548412050357692812009-11-16T12:57:00.000-08:002009-11-16T12:59:07.813-08:00Presentation of a Peer-Reviewed Paper...In the middle of my dissertation write-up, I went to present a peer-reviewed paper related to my M.S. dissertation... It was fun to get feedback from other researchers in the area of Sensor Networks...<br />
<br />
<span style="background-color: #fff2cc;">Arno Puder, Teresa Johnson, Kleber Sales, Marcello de Sales, and Dale Davidson. </span><b><i><span style="background-color: #fff2cc;">A Component-based Sensor Network for Environmental Monitoring</span></i></b><span style="background-color: #fff2cc;">. In SNA-2009: 1st International Conference on Sensor Networks and Applications, pages 54–60, San Francisco, CA, USA, 2009. The International Society for Computers and Their Applications - ISCA.</span><br />
<br />
Below is a picture of the event that take place in November 4th, 2009.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/_xT0hoI9ybHE/SwG8uCcDffI/AAAAAAAAECU/LgREgcSbMQ8/s1600/IMG_0079.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/_xT0hoI9ybHE/SwG8uCcDffI/AAAAAAAAECU/LgREgcSbMQ8/s320/IMG_0079.JPG" /></a><br />
</div>Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-30005099963984680052009-07-30T04:06:00.000-07:002009-07-30T06:11:36.738-07:00Building a Clustered Shared SubNetwork with VirtualBox 3.0.2 (Ubuntu 9.04 Host + Ubuntu 9.04 and Windows XP Guests)<div style="text-align: center;"><span class="Apple-style-span" style="color:#0000EE;"><span class="Apple-style-span" style="text-decoration: underline;"><span class="Apple-style-span" style="color: rgb(0, 0, 0); -webkit-text-decorations-in-effect: none; white-space: pre; font-family:Arial;font-size:13px;">VirtualBox 3.0.2 (Ubuntu 9.04 Host + Ubuntu & Windows Guests)</span></span></span></div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_xT0hoI9ybHE/SnGDl1ZF_gI/AAAAAAAAD-U/WKSRAEgKcTw/s1600-h/Screenshot-VirtualBox+-+About.png"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 240px;" src="http://4.bp.blogspot.com/_xT0hoI9ybHE/SnGDl1ZF_gI/AAAAAAAAD-U/WKSRAEgKcTw/s320/Screenshot-VirtualBox+-+About.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364213317185699330" /></a><br /><div><b>Problem</b>: to have a cluster of machines to work for development using VirtualBox guests. That is, I want to be able to have different operating systems running on my own development box, having each of the machines sharing the same Internet connection (my main one) and acquiring IP address from the same DHCP server (aka wireless router, or whatever...)</div><div><a href="http://2.bp.blogspot.com/_xT0hoI9ybHE/SnGFXfTZpDI/AAAAAAAAD-c/IOEKnk40bH4/s1600-h/Screenshot-guest-boxes.png"></a><br /></div><div><b>Description</b>:</div>Finally, something to make all of us happy! VirtualBox 3.0.2 delivers smooth support to shared Nat and Subnetwork outside-the-box!!! This has always been a dream of mine to use VirtualBox for such agility of having as much development machines as I can (your host memory). <div><br /></div><div>What I could accomplished tonight was the setup of a shared subnetwork between my host Ubuntu 9.04 and my guests on Ubuntu 9.04 and Windows XP SP2. Everything I did was to setup network cards from each of my guests to use the network interface as a Bridged Adapter, and choose the one I have Internet connectivity. By the way, the connectivity I am using is my wireless card.</div><div><br /></div><div>This can help anyone who needs different boxes and installations of machines to develop and test applications that depend on the network, shared resources, etc. For example, I need to build a <a href="http://code.google.com/p/netbeams">Sensor Network</a> that sends collected data to the main data center as my thesis simplest workspace. On a single box, my development environment is contains source-code for both the Sensors and the Main node. However, with the use of Subversion and SCP I can transfer artifacts over between the boxes and still test the transport layer of my sensor network. </div><div><br /></div><div><b>Setup</b>:</div><div>To better describe what I have, here's the description of each member of my network made with VirtualBox 3.0.2: </div><div><br /></div><div>* Install VirtualBox 3.0.2 from http://www.virtualbox.org/wiki/Downloads</div><div><br /></div><div><b>* Host</b>: Ubuntu 9.04 on an IBM ThinkPad T42, with 2GB (yes, my personal HP with 4GB recently burned out... NEVER BUY HP LAPTOPS).</div><div><br /></div><div><div><div>mdesales@mdesales-laptop:~$ ifconfig</div><div>eth0 Link encap:Ethernet HWaddr 00:11:25:81:48:6d </div><div> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1</div><div> RX packets:0 errors:0 dropped:0 overruns:0 frame:0</div><div> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0</div><div> collisions:0 txqueuelen:1000 </div><div> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)</div><div><br /></div><div><b>eth1</b> Link encap:Ethernet HWaddr 00:0e:35:c6:5a:0f </div><div> inet addr:<b>192.168.0.102</b> Bcast:192.168.0.255 Mask:255.255.255.0</div><div> inet6 addr: fe80::20e:35ff:fec6:5a0f/64 Scope:Link</div><div> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1</div><div> RX packets:583718 errors:235 dropped:235 overruns:0 frame:0</div><div> TX packets:316625 errors:0 dropped:5 overruns:0 carrier:0</div><div> collisions:0 txqueuelen:1000 </div><div> RX bytes:763972253 (763.9 MB) TX bytes:30157656 (30.1 MB)</div><div> Interrupt:11 Base address:0x4000 Memory:c0214000-c0214fff</div><div><br /></div></div><div><br /></div></div><div>Note that you will need to know which device you currently have Internet connectivity. In my case, as you can see, I have it through the eth1. I acquire IPs through my private Wireless Router DHCP, which is connected to the Cable modem, in the range of <b>192.168.0.100</b> - <b>192.168.0.199</b>.</div><div><br /></div><div>* Guest 1: Ubuntu 9.04 Desktop</div><div>* Guest 2: Ubuntu 9.04 Mini</div><div>* Guest 3: Windows XP SP3</div><div><br /></div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_xT0hoI9ybHE/SnGFgX40nyI/AAAAAAAAD-k/trkYNhb3EC4/s1600-h/Screenshot-guest-boxes.png"><img src="http://1.bp.blogspot.com/_xT0hoI9ybHE/SnGFgX40nyI/AAAAAAAAD-k/trkYNhb3EC4/s400/Screenshot-guest-boxes.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364215422389624610" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 135px; " /></a></div><div><br /></div><div>For each of them, I just configured the network device to use the Bridged Interface, and choosing a correct device. To be clear, you want to choose the one you currently have Internet connection. In my case, I have on the eth1, which happens to be my wireless card. However yours can differ!</div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_xT0hoI9ybHE/SnGGi_kWHjI/AAAAAAAAD-s/42nNRJrvYIE/s1600-h/Screenshot-NetBeams+Development+-+Settings.png"><img src="http://1.bp.blogspot.com/_xT0hoI9ybHE/SnGGi_kWHjI/AAAAAAAAD-s/42nNRJrvYIE/s400/Screenshot-NetBeams+Development+-+Settings.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364216566912523826" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 324px; " /></a></div><div><br /></div><div>* <b>Environment Evaluation</b>: I'd like to ping, ssh, share resources, build a mysql cluster, hadoop grid, test my grails app, write Android App... The dream I have always had is to have my own "cloud" of pre-defined VirtualBox with the profiles of what I usually need. I must say that I can finally rest my mind! I could finally make it work... </div><div><br /></div><div>I have mixed up 2 different projects: my thesis boxes and a Software Engineering research developed in 2 different environments (that's why I love keeping my VirtualBox hard-disks in my backup storage for later reuse). Most of people who have tried integrating guests could only PING a machine, but not SSH, samba, etc. I got frustrated 3 or 4 months ago about the same situation and I had to settle. Anyway, the truth is that tonight I got it working without any voodoo command to turn off the firewall, to configure different tunnel... I just wanted it working... See how many people were having related problems here (http://www.savvyadmin.com/virtualbox-host-interface-networking-with-nat/)</div><div><br /></div><div>Let's see each machine IP address</div><div><br /></div><div><div>* Guest 2: Ubuntu 9.04 Mini: <b>192.168.0.100</b></div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_xT0hoI9ybHE/SnGOxpHPgaI/AAAAAAAAD-0/U3lGrxnoM8U/s1600-h/Screenshot-NetBeams+Sensor+Development+%5BRunning%5D+-+Sun+VirtualBox.png"><img src="http://1.bp.blogspot.com/_xT0hoI9ybHE/SnGOxpHPgaI/AAAAAAAAD-0/U3lGrxnoM8U/s400/Screenshot-NetBeams+Sensor+Development+%5BRunning%5D+-+Sun+VirtualBox.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364225614675935650" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 237px; " /></a></div><div>* Guest 3: Windows XP SP3: <b>192.168.0.103</b>. Note that the ping is from the Guest 2 to the Guest 3 machine!!! I had never accomplished this with previous versions of VirtualBox without reading the documentation for hours...</div><div><br /></div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_xT0hoI9ybHE/SnGQGcP21_I/AAAAAAAAD_M/GEgEfgRIIvA/s1600-h/Screenshot-ping-Guest-2-Guest-3.png"><img src="http://3.bp.blogspot.com/_xT0hoI9ybHE/SnGQGcP21_I/AAAAAAAAD_M/GEgEfgRIIvA/s400/Screenshot-ping-Guest-2-Guest-3.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364227071511287794" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 300px; " /></a></div><div><br /></div><div>* <b>Pinging the host from the Guest 2</b>: this step I could reproduce using the voodoo steps described on other pages. Here's an example of the guest 2 pinging the host. </div><div><br /></div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_xT0hoI9ybHE/SnGPjUMPWMI/AAAAAAAAD_E/534X4oWmj28/s1600-h/Guest2-to-Host.png"><img src="http://4.bp.blogspot.com/_xT0hoI9ybHE/SnGPjUMPWMI/AAAAAAAAD_E/534X4oWmj28/s400/Guest2-to-Host.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364226468053211330" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 345px; " /></a></div><div><br /></div><div>* <b>SSH from among the machines</b>: one HUGE step was that, at this time, we can definitely SSH from each of the machines. I just installed Ubuntu with openSSH server in each of them. That's ALL!!! I'm making an SSH connection from my machine to the Guest 2.</div><div><br /></div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_xT0hoI9ybHE/SnGRRwDYjvI/AAAAAAAAD_U/dNAP1YbBZUU/s1600-h/Screenshot-marcello@ysisensor01:+~+ssh.png"><img src="http://3.bp.blogspot.com/_xT0hoI9ybHE/SnGRRwDYjvI/AAAAAAAAD_U/dNAP1YbBZUU/s400/Screenshot-marcello@ysisensor01:+~+ssh.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364228365317869298" style="float: left; margin-top: 0px; margin-right: 10px; margin-bottom: 10px; margin-left: 0px; cursor: pointer; width: 400px; height: 152px; " /></a>This is the SSH session between the Guest 2 and the Host.</div><div><br /></div><div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_xT0hoI9ybHE/SnGR9ewzM6I/AAAAAAAAD_k/2FqMUR1C7F4/s1600-h/Screenshot-guest-03-guest-02-host.png"><img src="http://4.bp.blogspot.com/_xT0hoI9ybHE/SnGR9ewzM6I/AAAAAAAAD_k/2FqMUR1C7F4/s400/Screenshot-guest-03-guest-02-host.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364229116590764962" style="float: right; margin-top: 0px; margin-right: 0px; margin-bottom: 10px; margin-left: 10px; cursor: pointer; width: 400px; height: 227px; " /></a><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div>The SSL connection between the hosts is done. In this screen, the user ssh from the Guest 3 (Windows) to the Guest 2 (Ubuntu). </div><div><br /></div><div>Note that the window at far-most of the screenshot is the Host.</div><div><br /></div><div>After testing all these features, I wanted to verify if the communication is done without no changes. </div><div><br /></div><div>I was lucky to find out that the everything works outside-the-box, including Internet connection!!!!</div><div><br /></div><div><br /></div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_xT0hoI9ybHE/SnGTay3s1SI/AAAAAAAAD_s/vGbw0vgKP2Q/s1600-h/Screenshot-internet-connection-guest-03.png"><img src="http://3.bp.blogspot.com/_xT0hoI9ybHE/SnGTay3s1SI/AAAAAAAAD_s/vGbw0vgKP2Q/s400/Screenshot-internet-connection-guest-03.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5364230719716250914" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 290px; " /></a><div> </div></div><div><div> </div></div>Anonymousnoreply@blogger.com2tag:blogger.com,1999:blog-8402055500123731062.post-43790691079679290962008-07-28T21:01:00.000-07:002008-07-28T21:14:03.118-07:00Linux Magazine: "Discovering DCCP" article I contributed finally publishedI'm on my first vacations in more than 4 years and I got very good news... While my brother came to San Francisco to visit me, I helped him out with his experiments for this MS dissertation questionings about the DCCP protocol. His findings about how that protocol helps multimedia applications over the Internet was published at the Linux Magazine, August 2008 issue. "Congestion Control: Developing multimedia applications with DCCP". I just got my name spelled Marcello Junior :) instead of my de Sales. I usually omit my suffix, commonly shortened to Jr.<br /><br />For more about the article <a href="http://www.linux-magazine.com/issues/2008/93">http://www.linux-magazine.com/issues/2008/93</a>Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-43906105774311041112008-04-25T18:28:00.000-07:002008-07-28T21:14:47.524-07:00No Eclipse crash after FRESH INSTALLATION OF UBUNTU 8.04 64bits (sharing with Ubuntu 7.10 and Vista)Hello all,<br /><br />It has been a while since I last posted something... These days I can finally work with Eclipse without any crash!!! Yes, Ubuntu 8.04 finally was released yesterday and I downloaded the 64bit alternate version for AMD 64bit processors and installed it on my Dual Core Intel 32bit. The reason is that I have 4GB of physical memory and a 32bit OS can only address 3GB...<br /><br />The installation process was smoothly done using the alternate CD, it recognized all my other OS's and I'm just migrating from one to another little by little... Although everything was just find during the installation, JAVA 5 doesn't run on Ubuntu and Java 6 was the only way to have JAVA-based applications running... Another negative point is the shipment of Firefox 3 beta, which can be really fast on gmail email accounts but high memory consumption is still an issue and for this reason a downgrade to Firefox 2.x was my solution.<br /><br />All the steps of the installation are described at <a href="http://ubuntuforums.org/showpost.php?p=4794923&postcount=458">http://ubuntuforums.org/showpost.php?p=4794923&postcount=458</a><br /><br />Good luck!Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-32351124672108563132008-02-22T09:19:00.000-08:002008-02-23T01:43:19.350-08:00AOP rescued me today!!!Agility has to do with how you can apply programming techniques to solve usual problems that usually takes long time to be developed. If it's not a time constraint, it is code design and this "bad smell" is going to rest there until you use refactoring methods to organize "your home".<br /><br />During the development of my <a href="http://code.google.com/p/v-octopus/">vOtopus</a>, my personal web server for my Advanced Internet Design and Engineering class, I got the usual requirement of loggin capability, where I should print out the execution path for debug purposes, and about the clients' request information. So I started adding the following Logger information:<br /><pre class="prettyprint">import java.util.logging.Logger;<br />...<br /></pre><pre class="prettyprint">private Logger vologger = Logger.getLogger("tracer");</pre> I started smelling the bad code again when I had pasted the same line of code into 3 classes... When you only know OOP, or Object-Oriented Programming, you can do everything using the same old Pojo classes. However, I thought that this would be the best time to rescue myself and put AOP, or <a href="http://en.wikipedia.org/wiki/Aspect-oriented_programming">Aspect-Oriented Programming</a>, into practice (Check the Wikipedia for a complete introduction to the concepts behind AOP).<br /><br />Anyway, it had been a long time when I first studied AOP, still in 2004 while taking my Internship at Motorola, and I was just using on simple and small applications by my own using AspectJ, one of the most used implementations and available at the Eclipse IDE as a plug-in. Today, I definitely felt the need to use this paradigm to quickly design and implement a simple and reusable solution, maintaining quality and integrity of my system already designed (<a href="http://code.google.com/p/v-octopus/wiki/ArchitecturalUMLDesign">VOtopus Architecture</a>). I tried to maintain designed with as loosely coupled as possible (most of them a weak dependency) through the use of the "Program to an Interface, not to an Implementation" idea.<br /><ul><li>Log requirements</li></ul>I just need to print the execution path, or the access information into my webserver log. Instead of pasting the code snippet above, I created an Aspect whose responsibility, here implemented as pointcuts, were used to intercept all the classes from my already designed system I wanted to log, without adding invasive code, as I was doing while pasting the loggin code snippet.<br /><pre class="prettyprint">pointcut vOcotpusPackage() : execution(* edu.sfsu.cs.csc867.msales.voctopus..*(..)) || execution(edu.sfsu.cs.csc867.msales.voctopus..new(..));</pre> It's clear how Aspect-Oriented Programming deal with this cross-cutting concern. The "loggin concern" isn't part of the webserver business logic, but it is a concern that cross-cuts the entire application because each individual class need that functionality.<br /><br />The next step was to create advices around the join points, or the moment in which we are going to do something related in the pointcut. That means before the execution, after the execution, during the execution... Find out more about it at <a href="http://en.wikipedia.org/wiki/Aspect-oriented_programming">Aspect-Oriented Programming</a>. For my webserver, I just want to give the advice saying "I'm advising you that my <span style="font-weight: bold;">join points</span> are the ones exactly <span style="font-weight: bold;">before</span> the <span style="font-weight: bold;">execution</span> of any method or constructor from the <span style="font-weight: bold;">pointcuts</span> called loggableCalls". It's really clear that this is the only code I needed to implement the "tracer" for method calls (private, default, protected, public).<br /><br /> before() : loggableCalls() {<br /> <br /> if (this.vologger.isLoggable(Level.INFO)) {<br /> Signature sig = thisJoinPointStaticPart.getSignature();<br /> this.vologger.logp(Level.INFO, sig.getDeclaringType().getName(), sig.getName(), "Entering");<br /> }<br /> }<br /><br />There are a lot of discussions on this subject due to the a lot of things you can do with AOP. Check the links about and have fun being agile with AspectJ.Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-32417506567261626332008-01-25T00:45:00.000-08:002008-01-25T01:09:42.459-08:00Off-topic: Admitted at the MS in Computer Science - SFSUWell, this is not an agile post, but the happiest post ever: I was admitted to the M.S. in Computer Science program at San Francisco State University... This is something I dreamed since high-school and I believe God makes everything at His own time.... I will create a new blog about the life of a student-professional-student with life back to the academic world...<br /><br />This will be fun... The ideas about a different Web 2.0 approach to my studies will be discussed on this new blog I'm thinking about just to address information about "flattening the world of academic research"... Will discuss much more there...Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-55504617191054503642008-01-21T10:49:00.000-08:002008-01-22T01:32:44.140-08:00Never trust an IDE (shame on Eclipse's auto-completion feature)Being agile sometimes brings us headaches that can make one lose hours of going through debugging log files, and sometimes without clues to the newly created puzzle. J2EE applications can make you feel that headache when we have to deal with applications and frameworks written without "proper" error-handling/logging.<br /><br />My frustration started while finishing a personal application using the widely-used MVC framework Struts. I already had lots of "controlled" Actions and I was just adding a new Action to the play, but I was trying to be Agile :D I had downloaded the newest version of Eclipse 3.3 and imported my project accordingly. Then, I promptly just fired the "Create New Class" and extended the Action. That's it!!! (I thought), but my headache had just started after I trusted my loved IDE's auto-completion capabilities.<br /><br />Before I describe what I had implemented on my execute method (the one we must override in order to have our application to "dance"), I found descriptions of which mistakes one can make in order to get the so-called blank/white page result after executing a strusts action. My application was also suffering from this well-known and poorly documented problem. According to a post at <a href="%28http://jmatrix.net/dao/case/case.jsp?case=7F000001-FA39D7-FDE4EE6C70-4%29">JMatrix</a>, 2 are the possibilities of getting this blank page side-effect: If the Default Action is called instead of your Action class or if the input class on the action mapping is invalid. However, I'm writing this post just to add my 2 cents to the scope of this problem. Everything started when I trusted Eclipse to override the method signature below:<br /><span style="font-style: italic;"><blockquote>public ActionForward execute(ActionMapping mapping, ActionForm form, HttpServletRequest request, HttpServletResponse response)</blockquote></span>This is the correct signature of the method execute, the one that will be executed by the Action Executor. However, after using the code-completion feature, I got the following signature:<br /> <span style="font-style: italic;"></span><blockquote><span style="font-style: italic;">/* (non-Javadoc)</span><br /><span style="font-style: italic;"> * @see org.apache.struts.action.Action#execute(org.apache.struts.action.ActionMapping, org.apache.struts.action.ActionForm, javax.servlet.ServletRequest, javax.servlet.ServletResponse)</span><br /><span style="font-style: italic;"> */</span><br /><span style="font-style: italic;"> public ActionForward execute(ActionMapping mapping, ActionForm form, ServletRequest request, ServletResponse response)</span><br /><span style="font-style: italic;"> throws Exception {</span></blockquote><span style="font-style: italic;"></span>Being agile is all about trusting the tools that help you be productive, and this was just a moment of sadness. For me, the code was totally correct, but it was completed with a different signature. Although the action is loaded, nothing will be executed with this scenario. I spent a few hours with the clue of just having, among others, the following lines of the debugger output that proved that my Action class was being created and used by the Action executor class ActionCommandBase to execute the default "execute method".<br /><br /><span style="font-style: italic;"></span><blockquote><span style="font-style: italic;">00:48:11,658 DEBUG CreateForumPosterAction:27 - Starting the create forum poster action...</span><br /><span style="font-style: italic;">00:48:11,659 DEBUG AbstractCreateAction:93 - setting action to net.jsurfer.cryptonline.client.web.CreateForumPosterAction@192563a</span><br /><span style="font-style: italic;">00:48:11,660 DEBUG ActionCommandBase:49 - Executing org.apache.struts.chain.commands.servlet.ExecuteAction</span><br /></blockquote>Any clues? The reason is that the method added was not overriding, but overloading the method execute from the class AbstractAction, which is just an empty method. Therefore, another reason to get the blank page from the execution of an Action struts class is when you trust an IDE such as Eclipse and hope that it will auto-complete the overridden method when you need. In other words, my 2 cents for JMatrix list is:<br /><br />3. If you have NOT overridden the method execute, but overloaded it by having a different method signature. Be careful!!! It is HttpServletRequest and HttpServletResponse instead of ServletRequest and Servlet Response on the signature of the execute method. Never trust your loved IDE.Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-32628930230122725892008-01-16T22:31:00.000-08:002008-01-22T01:40:02.273-08:00Installing SourceForge Enterprise Edition VMWare-based on Ubuntu 7.10 GutsyWhat about managing your team with the first-class distributed software development tool? SourceForge Enterprise Edition allows project managers to better follow his/her teams of projects in a very convenient way. This post is just a hint on how to install SourceForge Enterprise Edition vmware player instance on Ubuntu 7.10...<br /><br />First of all, I'm assuming that you have Ubuntu 7.10 installed as your own desktop, remote server, etc... For remote servers, I recommend one using <a href="http://ubuntuforums.org/showthread.php?t=620057">NXServers</a> and in order to access your server just use the <a href="http://www.nomachine.com/documents/configuration/client-guide.php">NoMachine clients</a>.<br /><br />At this start point, you can download SourceForge Enterprise Edition 15 users from<a href="http://downloads.open.collab.net/sfee15.html"> http://downloads.open.collab.net/sfee15.html</a>. I had downloaded the file SourceForge-4_4-DL6.zip and unzipped its contents to my home directory by running<br /><blockquote><span style="font-style: italic;">unzip SourceForge-4_4-DL6.zip</span></blockquote>With the access to the GUI, you can proceed with the installation of the VMPlayer... Just follow the instructions on how to do so at <a href="http://ubuntu-tutorials.com/2007/11/17/install-vmware-server-on-ubuntu-710-gutsy-gibbon-updated/">Ubuntu Forum</a>. It's mandatory the installation of the build-essential package!!! If the package update bugs you complaining about the Ubuntu CD from where you installed Ubuntu and you don't have access to one, just comment the first line at the /etc/apt/source.list file about the cdrom access... Finally, just complete the installation answering to all the questions by default, <span style="font-style: italic;">HAVING ONLY THE BRIDGED</span> configuration. If asked, verify which interface has your IP address by running "ifconfig" and use it on the configuration (mine was eth1)... Once configured, it will display a welcome message from the VMware team...<br /><br />With VMWare up and running, go to Applications -> System Tools -> VMware Player. Then, click on "Open an existing Virtual Machine" and go to the directory where you unzipped the download of SourceForge-4_4-DL6.zip and select the only file for the SourceForge virtual machine... the VA software centOS server will begin to be loaded... We are almost there... :D<br /><br />A lot of users (including me) are just reading the documentation available online, but for this installation the package includes the file "install_guide.pdf" with information about pre-installation instructions. I just needed to see which username and password allows me to make the first login to the SFEE server. When prompted, just use the following:<br /><blockquote style="font-style: italic;">username: root<br />password: sourceforge</blockquote>At this step, since it's the first time you're running the installation, you will need to change the root password. Follow the instructions and go to the IP/Network configuration. I chose to have a static ip address in order to be able to have other http servers on my network... After following the installation procedures, the SFEE server will be restarted and the configuration script will show the ip address and host name on how to access your SFEE installation from your local network. After that, you are all set and you just need to hit the virtual host address on the browser and follow the welcome instructions... The following is the username and password for the admin account<br /><blockquote>username: admin<br />password: admin<br /></blockquote>For management of projects using SFEE, a good starting point is the following documentation:<a href="http://sfee.open.collab.net/sf-help/en/doc/User_Guide.pdf"><br /><span style=""><span class="a">http://sfee.open.collab.net/sf-help/en/doc/User_Guide.pdf</span></span></a>Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-3149062380707046822008-01-03T01:16:00.000-08:002008-01-03T01:28:30.838-08:00Axis2, Take 1... Developing POJO based ServicesI spent most of the last year developing using AXIS 1.3 ... Today I'm really happy with AXIS2 because of the Plug-and-play capabilities when it comes to publishing services. I'm rewriting a J2EE application I wrote 6 years ago using the latest technologies, and of course publishing the Interfaces using AXIS2.<br /><br />First of all, install Axis2 service on your application service of choice (Tomcat, in my case). Then, start with the development of the aar POJO service: specifying the methods of the Interface, then describing the type of the messages (in-out, in only, out only) on the services descriptor file. After running the WSDL generator, one can customize the design or integrate with your own schema catalog (in case one exists). The next step is just to publish the service on the Upload tool from the Axis2 application.<br /><br />The POJO implementation might be really slow, since there are other possibilities on how to implement the services using different techniques such as the use of an XML Pullparser, thus, giving direct implementation performance improvements to the developer. I have the idea to measure the execution of different implementations of services, analyzing development and deployment strategies. Everything would just differ on the deployment automation scripts and patch generation.Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-82848125368134144292007-12-10T23:25:00.000-08:002007-12-30T13:47:16.729-08:00There's an encrypted message for you...How would you show the chair of the Computer Science department where you are applying that you are really interested on the program? Although this was not Agile, I'd like to share my request...<br /><br />-> My Encrypted Request<br />1277422-2847316-2693100<div><wbr>-1261616-1-2938277-4074945<wbr>-3827899-25393-2962151-4074945<wbr>-1430646-1258669-59049-3200000<br />1629705-7776-3443146-3200000-1<wbr>-2899021-530979-3827899-161051<wbr>-371293-486282-2112274-39709<wbr>-1398153-3080659-3658437<br />1044731-3282995-212373-1812438<wbr>-3846068-1160197-530979<wbr>-1430646-32-580174-1768792<wbr>-1658195-1091755-2267810<wbr>-1497451<br />3200000-4046644-32-1419857<wbr>-573939-2041489-3647676-537178<wbr>-48909-3125-1091755-32-3640644<wbr>-1903585-1320821-2209028-1<br />4084101-32-3824505-1171281<wbr>-2228976-59049-3200000-48414<wbr>-4106401-2135263-2745785<wbr>-857024-3734162-4072015-4084101<br />159491-2228557-3846068-1776204<wbr>-2476099-59049-3200000-2812195<wbr>-3910998-25393-4084101-2256068<wbr>-7776-1430646-573939<br />161051-243-32-1-2556440<wbr>-3783996-4084101-806332-248832<wbr>-3853387-2644820-419895-3472022<br /><br />-> Setting the private key<br />(N , D) = (4108901 , 1641293)<br /><br />-> Decrypting each block<br />Ascii(x) = x ^ D mod N<br /><br />Ascii(1277422) = 1277422 ^ 1641293 mod 4108901 = 1731<br />Ascii(2847316) = 2847316 ^ 1641293 mod 4108901 = 39200<br />Ascii(2693100) = 2693100 ^ 1641293 mod 4108901 = 13220<br />Ascii(1261616) = 1261616 ^ 1641293 mod 4108901 = 821<br />Ascii(1) = 1 ^ 1641293 mod 4108901 = 1<br />Ascii(2938277) = 2938277 ^ 1641293 mod 4108901 = 21820<br />Ascii(4074945) = 4074945 ^ 1641293 mod 4108901 = 1132<br />Ascii(3827899) = 3827899 ^ 1641293 mod 4108901 = 2162<br />Ascii(25393) = 25393 ^ 1641293 mod 4108901 = 1113<br />Ascii(2962151) = 2962151 ^ 1641293 mod 4108901 = 219820<br />Ascii(4074945) = 4074945 ^ 1641293 mod 4108901 = 1132<br />Ascii(1430646) = 1430646 ^ 1641293 mod 4108901 = 197<br />Ascii(1258669) = 1258669 ^ 1641293 mod 4108901 = 20020<br />Ascii(59049) = 59049 ^ 1641293 mod 4108901 = 9<br />Ascii(3200000) = 3200000 ^ 1641293 mod 4108901 = 20<br />Ascii(1629705) = 1629705 ^ 1641293 mod 4108901 = 521<br />Ascii(7776) = 7776 ^ 1641293 mod 4108901 = 6<br />Ascii(3443146) = 3443146 ^ 1641293 mod 4108901 = 216<br />Ascii(3200000) = 3200000 ^ 1641293 mod 4108901 = 20<br />Ascii(1) = 1 ^ 1641293 mod 4108901 = 1<br />Ascii(2899021) = 2899021 ^ 1641293 mod 4108901 = 200<br />Ascii(530979) = 530979 ^ 1641293 mod 4108901 = 132<br />Ascii(3827899) = 3827899 ^ 1641293 mod 4108901 = 2162<br />Ascii(161051) = 161051 ^ 1641293 mod 4108901 = 11<br />Ascii(371293) = 371293 ^ 1641293 mod 4108901 = 13<br />Ascii(486282) = 486282 ^ 1641293 mod 4108901 = 2216<br />Ascii(2112274) = 2112274 ^ 1641293 mod 4108901 = 20420<br />Ascii(39709) = 39709 ^ 1641293 mod 4108901 = 113<br />Ascii(1398153) = 1398153 ^ 1641293 mod 4108901 = 217<br />Ascii(3080659) = 3080659 ^ 1641293 mod 4108901 = 7183<br />Ascii(3658437) = 3658437 ^ 1641293 mod 4108901 = 13221<br />Ascii(1044731) = 1044731 ^ 1641293 mod 4108901 = 22<br />Ascii(3282995) = 3282995 ^ 1641293 mod 4108901 = 14211<br />Ascii(212373) = 212373 ^ 1641293 mod 4108901 = 2032<br />Ascii(1812438) = 1812438 ^ 1641293 mod 4108901 = 141<br />Ascii(3846068) = 3846068 ^ 1641293 mod 4108901 = 97<br />Ascii(1160197) = 1160197 ^ 1641293 mod 4108901 = 209<br />Ascii(530979) = 530979 ^ 1641293 mod 4108901 = 132<br />Ascii(1430646) = 1430646 ^ 1641293 mod 4108901 = 197<br />Ascii(32) = 32 ^ 1641293 mod 4108901 = 2<br />Ascii(580174) = 580174 ^ 1641293 mod 4108901 = 161<br />Ascii(1768792) = 1768792 ^ 1641293 mod 4108901 = 3221<br />Ascii(1658195) = 1658195 ^ 1641293 mod 4108901 = 6204<br />Ascii(1091755) = 1091755 ^ 1641293 mod 4108901 = 201<br />Ascii(2267810) = 2267810 ^ 1641293 mod 4108901 = 1321<br />Ascii(1497451) = 1497451 ^ 1641293 mod 4108901 = 67211<br />Ascii(3200000) = 3200000 ^ 1641293 mod 4108901 = 20<br />Ascii(4046644) = 4046644 ^ 1641293 mod 4108901 = 9212<br />Ascii(32) = 32 ^ 1641293 mod 4108901 = 2<br />Ascii(1419857) = 1419857 ^ 1641293 mod 4108901 = 17<br />Ascii(573939) = 573939 ^ 1641293 mod 4108901 = 21620<br />Ascii(2041489) = 2041489 ^ 1641293 mod 4108901 = 121<br />Ascii(3647676) = 3647676 ^ 1641293 mod 4108901 = 413<br />Ascii(537178) = 537178 ^ 1641293 mod 4108901 = 2183<br />Ascii(48909) = 48909 ^ 1641293 mod 4108901 = 19920<br />Ascii(3125) = 3125 ^ 1641293 mod 4108901 = 5<br />Ascii(1091755) = 1091755 ^ 1641293 mod 4108901 = 201<br />Ascii(32) = 32 ^ 1641293 mod 4108901 = 2<br />Ascii(3640644) = 3640644 ^ 1641293 mod 4108901 = 101<br />Ascii(1903585) = 1903585 ^ 1641293 mod 4108901 = 99<br />Ascii(1320821) = 1320821 ^ 1641293 mod 4108901 = 20113<br />Ascii(2209028) = 2209028 ^ 1641293 mod 4108901 = 220020<br />Ascii(1) = 1 ^ 1641293 mod 4108901 = 1<br />Ascii(4084101) = 4084101 ^ 1641293 mod 4108901 = 21<br />Ascii(32) = 32 ^ 1641293 mod 4108901 = 2<br />Ascii(3824505) = 3824505 ^ 1641293 mod 4108901 = 1972<br />Ascii(1171281) = 1171281 ^ 1641293 mod 4108901 = 142<br />Ascii(2228976) = 2228976 ^ 1641293 mod 4108901 = 1620<br />Ascii(59049) = 59049 ^ 1641293 mod 4108901 = 9<br />Ascii(3200000) = 3200000 ^ 1641293 mod 4108901 = 20<br />Ascii(48414) = 48414 ^ 1641293 mod 4108901 = 12102<br />Ascii(4106401) = 4106401 ^ 1641293 mod 4108901 = 1613<br />Ascii(2135263) = 2135263 ^ 1641293 mod 4108901 = 21972<br />Ascii(2745785) = 2745785 ^ 1641293 mod 4108901 = 16132<br />Ascii(857024) = 857024 ^ 1641293 mod 4108901 = 18319<br />Ascii(3734162) = 3734162 ^ 1641293 mod 4108901 = 72<br />Ascii(4072015) = 4072015 ^ 1641293 mod 4108901 = 1013<br />Ascii(4084101) = 4084101 ^ 1641293 mod 4108901 = 21<br />Ascii(159491) = 159491 ^ 1641293 mod 4108901 = 70<br />Ascii(2228557) = 2228557 ^ 1641293 mod 4108901 = 2141<br />Ascii(3846068) = 3846068 ^ 1641293 mod 4108901 = 97<br />Ascii(1776204) = 1776204 ^ 1641293 mod 4108901 = 210<br />Ascii(2476099) = 2476099 ^ 1641293 mod 4108901 = 19<br />Ascii(59049) = 59049 ^ 1641293 mod 4108901 = 9<br />Ascii(3200000) = 3200000 ^ 1641293 mod 4108901 = 20<br />Ascii(2812195) = 2812195 ^ 1641293 mod 4108901 = 52151<br />Ascii(3910998) = 3910998 ^ 1641293 mod 4108901 = 992<br />Ascii(25393) = 25393 ^ 1641293 mod 4108901 = 1113<br />Ascii(4084101) = 4084101 ^ 1641293 mod 4108901 = 21<br />Ascii(2256068) = 2256068 ^ 1641293 mod 4108901 = 8321<br />Ascii(7776) = 7776 ^ 1641293 mod 4108901 = 6<br />Ascii(1430646) = 1430646 ^ 1641293 mod 4108901 = 197<br />Ascii(573939) = 573939 ^ 1641293 mod 4108901 = 21620<br />Ascii(161051) = 161051 ^ 1641293 mod 4108901 = 11<br />Ascii(243) = 243 ^ 1641293 mod 4108901 = 3<br />Ascii(32) = 32 ^ 1641293 mod 4108901 = 2<br />Ascii(1) = 1 ^ 1641293 mod 4108901 = 1<br />Ascii(2556440) = 2556440 ^ 1641293 mod 4108901 = 852<br />Ascii(3783996) = 3783996 ^ 1641293 mod 4108901 = 10205<br />Ascii(4084101) = 4084101 ^ 1641293 mod 4108901 = 21<br />Ascii(806332) = 806332 ^ 1641293 mod 4108901 = 820<br />Ascii(248832) = 248832 ^ 1641293 mod 4108901 = 12<br />Ascii(3853387) = 3853387 ^ 1641293 mod 4108901 = 1421<br />Ascii(2644820) = 2644820 ^ 1641293 mod 4108901 = 520<br />Ascii(419895) = 419895 ^ 1641293 mod 4108901 = 52162<br />Ascii(3472022) = 3472022 ^ 1641293 mod 4108901 = 21133<br /><br />-> Complete message in ASCII<br />1731392001322082112182011322162<wbr>1113219820113219720020920521621<wbr>620120013221621113221<br />6204201132177183132212214211203<wbr>2141972091321972161322162042011<wbr>321672112092122172162<br />0121413218319920520121019920113<wbr>2200201212197214216209201210216<wbr>132197216132183197210<br />1321702141972101992052151992111<wbr>3218321619721620113218521020521<wbr>8201214215205216221133<br /><br />-> Original Message<br />I'd love to be admitted to the MS program at the Computer Science department at San Francisco State University!<br /></div>Anonymousnoreply@blogger.com0tag:blogger.com,1999:blog-8402055500123731062.post-34316042784525746872007-11-28T22:10:00.000-08:002007-12-02T06:54:40.158-08:00Starting over again...Starting over again after having to quit CollabNet due to Visa regulations... However, nothing let me down and stop to do what I love to (I think I wrote the same thing on my last email to the company)...<br /><br />Anyway, I'm trying to be agile today after talking to my brother Leandro and accepting the position of CTO, WOW... Well, at least I will direct the company in the perspective of research and development and educate his startup with a flat view of the world of Software Development... In this way, it has been a really long day, after setting up the company's infrastructure at GoogleApps (work for the CIO, anyone???)... I present to all of you Spry Software Factory, or just Spry Softwares, that will primarily be funded by services to Nokia...<br /><br />http://www.sprysoftwares.com<br /><br />The vision I have for this startup is to follow a mix of models, which includes Software Factory and Distributed Open-Source company. In this way, the imperatives to drive research and development will involve the following vertical markets:<br /><ul><li><span style="font-weight: bold;">Mobile Software Development</span>: primarily for the following platforms:</li><ul><li><a href="http://www.forum.nokia.com/main/platforms/maemo/index.html"><span style="font-weight: bold;">Nokia's Maemo Linux</span></a>: Leandro de Sales, the CEO, is about to defend his M.S. by contributing to the open-source world by implementing the DCCP protocol to the Linux kernel. Most solutions include maintenance of the Maemo platform and applications using UPnP;<br /><br /></li><li><a href="http://code.google.com/android/"><span style="font-weight: bold;">Google's Android</span></a>: After the announcement of U$10 Millions, who doesn't want to develop for this rich platform that promises...? I will take this opportunity to work on my M.S. dissertation (still looking to be admitted in the Bay Area) and implement my personal vision of Active User Interfaces I suggested to Motorola in 2004, which unfortunately was rejected. However, I am happy to see the results of limitation (After iPhone, Razor died...) and also happy to see Google's efforts to transform the world again, following what most of the 3G/4G applications in Japan already do... I'm a big fan of the NTT DoCoMo!<br /><br /></li></ul><li><span style="font-weight: bold;">E-Commerce Internet Applications</span>: I spent around a year and half developing osCommerce applications and I chose 2 applications to attack this imperative market:</li><ul><li><a href="http://www.oscommerce.com/community/contributions"><span style="font-weight: bold;">osCommerce Contributions</span></a>: developed in PHP, this shopping cart solution represents the most successful application for small companies starting their operations online;<br /><br /></li></ul><ul><li><a href="http://code.google.com/p/openkkart/"><span style="font-weight: bold;">KonaKart Open-Source Leadership</span></a>: I will drive the open-source Java development of the free components of this application. It will include the development of Contributions in a way of Java Plug-ins.<br /><br /></li></ul><li><span style="font-weight: bold;">Virtualization</span>: Transforming development teams with solutions of distributed software development and backup centers with the following technologies:</li><ul><li><span style="font-weight: bold;">Distributed VMWare</span>: Create profiles of operating systems, save their specification and let your employees reuse them accordingly;<br /><br /></li><li><a style="font-weight: bold;" href="http://www.collab.net/products/CUBiT/">CollabNet CUBi</a><a style="font-weight: bold;" href="http://www.collab.net/products/CUBiT/">T</a>: A highest level of control and customization for profiles, which includes the creation of online profiles based on Linux and Windows machines!<br /><br /></li></ul><li><span style="font-weight: bold;">Global, Agile, Distributed and Collaborative Development Process</span>: The most important of all verticals is the change in current development process of companies willing to step ahead the old processes used by using a proven way to run a business:</li><ul><li><span style="font-weight: bold;">Subversion on Demand</span>: Tracking changes from departments producing text-based artifacts through the use of Subversion, a version-control system which keeps track of every change you do;<br /><br /></li><li><span style="font-weight: bold;">ALM Tools</span>: Companies of any size can better manage their activities using a proven set of tools powered by Enterprise Application Life-cycle Management tools such as SourceForge Enterprise Edition, which includes tracking any activity of users from any department using Subversion on Demand technology;<br /><br /></li><li>Changes on an Enterprise development culture with Agile Methodologies, which can be applicable from Engineering to Management teams, in order to identify its own collaborative way with the highest level of productivity;<br /></li></ul></ul><br />Spent the day setting up the basic Google Apps for the company... The third-level domain names or Subdomains is not a great feature on Google, since one has to go to the Advanced features on the setup section and guess what? We are redirected to the place where our domain name was registered... In our case, eNom is the company, and everything has to be done there... But wait, after browsing for more than 2 hours I realized that the only way to link a given subdomain to a page is to create a domain URL frame instead a regular CNAME, which would point to the Google Apps server... It's a shame, because I believe it would be easier to just offer that from its interface... I bought the domain from google and I was expecting the full integration....<br /><br />Anyway, this I will link this blog as http://marcello.sprysoftwares.comAnonymousnoreply@blogger.com0