This blog demonstrate the use of big data and Hadoop using Pentaho Data Integration. I will explain the basic hadoop-wordcount example using PDI.
Prerequisite
- PDI version 5.0.0 or more installed in your system
- Apache Hadoop or any other hadoop distribution installed in a vm or any server
- Basic Understanding of Big Data and Hadoop architecture. You can read some of my old docs on Big Data and Hadoop in codemphasis.
Steps
PDI provides a very intuitive steps to deal with HDFS and MapReduce. All the steps under the Big Data group is basically used to do all the hadoop activities (if you are using PDI 4.x.x then it would be under the Hadoop group). Assuming you know a bit of BigData and Hadoop, let me jump straight into demonstrating the Hadoop – Wordcount example in PDI.

We will be creating one Job and two Transformation for doing the map-reduce activity. The transformations would act similar to Mapper and Reducer class when writing the map-reduce programs in Java. The Job would be similar to the Main Method class where would be doing all the configurations and executing it.
1. Creating a MAPPER transformation
The Mapper Transformation is the same as the Mapper class in map-reduce. It will read the data from the HDFS as a key/value pair. The value are the blocks of data that are read by your Mapper whereas the key are the sequence number associated to it.

In Kettle, we would build the mapper class using the below steps in a transformation:
- MapReduce Input: Reads the Data from HDFS as a Key/Value Pair.
- Split Field to Rows: Splits the data set based on a delimiter into words.
- Add Constants: This steps add a constant value to each row.
- MapReduce Output: This is the final output sent out by the Mapper as a key/value pair.
The work of Mapper .ktr is completed by the MapReduce Output. Now the Mapper is sending out the data to the Reducer .ktr in the form of a <Key,Value> pair; where Key is the word that’s been split and the Value is the constant integer that represent single count.
2. Creating a REDUCER transformation
The Reducer Transformation represent the same as the reducer class of map-reduce programming. The job of this transformation is to aggregate or reduce the data set coming in from the Mapper Transformation to finally give us a data set having the total count of each word.

Above image represent a Reducer transformation. The steps we use is as follow:
- MapReduce Input: The data that is coming from the Mapper output step is the input data here.
- Group By: The input data is aggregated based on the value to get the total count of words.
- MapReduce Output: The Final output is again a <Key,Value> pair which will have the words and the total count of each words.
3. Configuring the Pentaho MapReduce Step in Job
Pentaho MapReduce is the step we will use our Master Job. This Job is like the Driver class or the Main Class of Map-reduce program which actually handles all the necessary configurations.
Once we open the Pentaho MapReduce in the Job, we will be provided by a set of tabs to configure, like the below:

Subscribe to continue reading
Subscribe to get access to the rest of this post and other subscriber-only content.


5 responses to “Hadoop Wordcount using Pentaho Data Integration/Kettle”