Hadoop Wordcount using Pentaho Data Integration/Kettle

Hadoop Wordcount using Pentaho Data Integration/Kettle

Published by

Rishu Shrivastava

on

June 23, 2015

This blog demonstrate the use of big data and Hadoop using Pentaho Data Integration. I will explain the basic hadoop-wordcount example using PDI.

Prerequisite

PDI version 5.0.0 or more installed in your system
Apache Hadoop or any other hadoop distribution installed in a vm or any server
Basic Understanding of Big Data and Hadoop architecture. You can read some of my old docs on Big Data and Hadoop in codemphasis.

Steps

PDI provides a very intuitive steps to deal with HDFS and MapReduce. All the steps under the Big Data group is basically used to do all the hadoop activities (if you are using PDI 4.x.x then it would be under the Hadoop group). Assuming you know a bit of BigData and Hadoop, let me jump straight into demonstrating the Hadoop – Wordcount example in PDI.

*Master Job demonstrating the use of Big Data in PDI*

We will be creating one Job and two Transformation for doing the map-reduce activity. The transformations would act similar to Mapper and Reducer class when writing the map-reduce programs in Java. The Job would be similar to the Main Method class where would be doing all the configurations and executing it.

1. Creating a MAPPER transformation

The Mapper Transformation is the same as the Mapper class in map-reduce. It will read the data from the HDFS as a key/value pair. The value are the blocks of data that are read by your Mapper whereas the key are the sequence number associated to it.

*Mapper Transformation – Hadoop Wordcount using PDI*

In Kettle, we would build the mapper class using the below steps in a transformation:

MapReduce Input: Reads the Data from HDFS as a Key/Value Pair.
Split Field to Rows: Splits the data set based on a delimiter into words.
Add Constants: This steps add a constant value to each row.
MapReduce Output: This is the final output sent out by the Mapper as a key/value pair.

The work of Mapper .ktr is completed by the MapReduce Output. Now the Mapper is sending out the data to the Reducer .ktr in the form of a <Key,Value> pair; where Key is the word that’s been split and the Value is the constant integer that represent single count.

2. Creating a REDUCER transformation

The Reducer Transformation represent the same as the reducer class of map-reduce programming. The job of this transformation is to aggregate or reduce the data set coming in from the Mapper Transformation to finally give us a data set having the total count of each word.

b11 — *Reducer Transformation – Hadoop Wordcount using PDI*

Above image represent a Reducer transformation. The steps we use is as follow:

MapReduce Input: The data that is coming from the Mapper output step is the input data here.
Group By: The input data is aggregated based on the value to get the total count of words.
MapReduce Output: The Final output is again a <Key,Value> pair which will have the words and the total count of each words.

3. Configuring the Pentaho MapReduce Step in Job

Pentaho MapReduce is the step we will use our Master Job. This Job is like the Driver class or the Main Class of Map-reduce program which actually handles all the necessary configurations.

Once we open the Pentaho MapReduce in the Job, we will be provided by a set of tabs to configure, like the below:

b12 — *Configuring the Job Setup tab in the Job – Pentaho MapReduce Step*

Tech Spaghetti

Hadoop Wordcount using Pentaho Data Integration/Kettle

Prerequisite

Steps

1. Creating a MAPPER transformation

2. Creating a REDUCER transformation

3. Configuring the Pentaho MapReduce Step in Job

5 responses to “Hadoop Wordcount using Pentaho Data Integration/Kettle”

Hadoop Wordcount using Pentaho Data Integration/Kettle

Prerequisite

Steps

1. Creating a MAPPER transformation

2. Creating a REDUCER transformation

3. Configuring the Pentaho MapReduce Step in Job

Subscribe to continue reading

Share this:

5 responses to “Hadoop Wordcount using Pentaho Data Integration/Kettle”