How to implement decision tree algorithms using MapReduce

Description

The algorithm for building decision trees is called C4.5. It builds classification in the form of a tree structure. It is implemented based on the MapReduce Framework. In the code given below, the first Mapreduce function finds the occurrences of attribute that is associated with class labels. After identifying the occurrences of attributes, the second Mapreduce function computes the Gain Ratio using information gain and split information of attributes. From the Gain Ratio, the reduce function selects the best attributes and then constructs the decision tree on third Map Reduce Function.

Sample Code

//First MapReduce

//count the attribute for specific class label

Map Phase input:<k1, v1>

k1-line no

v1-records

Extract attribute and class label from instance of the record

k2-attribute with class label

v2-1

output(k2, v2)

Reduce Phase input<k2, List<v2>>

//Counts number of occurrences of combination for attribute with class Label

k3-attribute with class Label

v3-frequency

output(k3, v3)

//Second MapReduce

//Find the best splitting attribute (decision node)

//Calculate Gain Ratio for each attribute

Map Phase input: <k3, v3>

Compute Entropy(k3)

Compute InformationGain(k3)

Compute SplitInfo(k3)

k4-attribute

v4-Entropy,Information and SplitInfomation

Output(k4,v4)

Reduce Phase input:<k4, List<v4>>

//Calculate Information Gain Ratio for each attribute

Highest Gain ratio is the best splitting attribute (decision node)

k5-decision node

v5–Information Gain Ratio

Output(k5,v5)

//Third MapReduce

//Tree construction

Map Phase input:<k5, v5>

//Reads the record of best attribute

//Compute the node id for highest attribute

//Every Node is represented by a list of attribute and its values.

K6-node id

V6-Elements (attribute values)

This process recursively call for creating non leaf branches, until all data is classified.
Output: <k6,v6>

Screenshots

Input-Weather Data

Output-Decision Tree