# Supervised Machine Learning with Python (Course VI)

July 13, 2020 2020-08-04 10:50## Supervised Machine Learning with Python (Course VI)

### Decision Tree Regression

So far, we have discussed regression algorithms that involve changing a function’s parametric value (weight and bias) to fit the distribution of the data.

In this lesson, you will be introduced to a different kind of Machine Learning algorithm, called a decision tree.

#### What is a Decision Tree?

A decision tree is one of the most frequently used Machine Learning algorithms for solving **regression** as well as **classification** problems. As the name suggests, the algorithm uses a tree-like model of decisions to either predict the target value (regression) or predict the target class (classification). Before diving into how decision trees work, first, let us be familiar with the basic terminologies of a decision tree:

**Root Node:**This represents the topmost node of the tree that represents the whole data points.**Splitting:**It refers to dividing a node into two or more sub-nodes.**Decision Node:**They are the nodes that are further split into sub-nodes, i.e., this node that is split is called a decision node.**Leaf / Terminal Node:**Nodes that do not split are called Leaf or Terminal nodes. These nodes are often the final result of the tree.**Branch / Sub-Tree:**A subsection of the entire tree is called branch or sub-tree.**Parent and Child Node:**A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of the parent node. In the figure above, the decision node is the parent of the terminal nodes (child).**Pruning:**Removing sub-nodes of a decision node is called pruning. Pruning is often done in decision trees to prevent overfitting.

#### How does a Decision Tree work?

The process of splitting starts at the root node and is followed by a branched tree that finally leads to a leaf node (terminal node) that contains the prediction or the final outcome of the algorithm. Construction of decision trees usually works top-down, by choosing a variable at each step that best splits the set of items. Each sub-tree of the decision tree model can be represented as a binary tree where a decision node splits into two nodes based on the conditions.

Decision trees where the target variable or the terminal node can take continuous values (typically real numbers) are called **regression trees** which will be discussed in this lesson. If the target variable can take a discrete set of values these trees are called **classification trees.**

**Decision Tree Regression in Python**

We will now go through a step-wise Python implementation of the Decision Tree Regression algorithm that we just discussed.

**1. Importing necessary libraries**

**1. Importing necessary libraries**

First, let us import some essential Python libraries.

# Importing the libraries import numpy as np # for array operations import pandas as pd # for working with DataFrames import requests, io # for HTTP requests and I/O commands import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn modules from sklearn.model_selection import train_test_split # for splitting the data from sklearn.metrics import mean_squared_error # for calculating the cost function from sklearn.tree import DecisionTreeRegressor # for building the model

###### 2**.** **Importing the data set **

For this problem, we will be loading a CSV dataset through a HTTP request (you can also download the dataset from here).

The dataset consists of data related to petrol consumptions (in millions of gallons) for 48 US states. This value is based upon several features such as the petrol tax (in cents), Average income (dollars), paved highways (in miles), and the proportion of the population with a driver’s license. We will be loading the data set using the read_csv() function from the pandas module and store it as a pandas DataFrame object.

# Importing the dataset from the url of the dataset url = "https://drive.google.com/u/0/uc?id=1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_&export=download" data = requests.get(url).content # Reading the data dataset = pd.read_csv(io.StringIO(data.decode('utf-8'))) dataset.head()

**3.** **Separating the features and the target variable**

After loading the dataset, the independent variable () and the dependent variable () need to be separated. Our concern is to model the relationships between the features (Petrol_tax, Average_income, etc.) and the target variable (Petrol_consumption) in the dataset.

x = dataset.drop('Petrol_Consumption', axis = 1) # Features y = dataset['Petrol_Consumption'] # Target

**4. Splitting the data into a train set and a test set**

We use the train_test_split() module of scikit-learn for splitting the data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.

# Splitting the dataset into training and testing set (80/20) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

###### 5.** Fitting the model to the training dataset**

After splitting the data, let us initialize a Decision Tree Regressor model and fit it to the training data. This is done with the help of DecisionTreeRegressor() module of scikit-learn.

# Initializing the Decision Tree Regression model model = DecisionTreeRegressor(random_state = 0) # Fitting the Decision Tree Regression model to the data model.fit(x_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=0, splitter='best')

**6. Calculating the loss after training**

Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).

where,

is the actual target value,

is the predicted target value, and

is the total number of data points.

The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.

# Predicting the target values of the test set y_pred = model.predict(x_test) # RMSE (Root Mean Square Error) rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f')) print("\nRMSE: ", rmse)

RMSE: 133.351

###### 7. Visualizing the decision tree

After building and executing the model, we can also view the tree structure of the model created using a tool **WebGraphviz**. We will be copying the content of the ‘tree_structure.dot’ file saved to the local working directory to the input area on the **WebGraphviz** tool which then generates the visualized structure of our Decision tree.

from sklearn.tree import export_graphviz # export the decision tree model to a tree_structure.dot file # paste the contents of the file to webgraphviz.com export_graphviz(model, out_file ='tree_structure.dot', feature_names =['Petrol_tax', 'Average_income', 'Paved_Highways', 'Population_Driver_licence(%)'])

This is a sample sub-branch of what our Decision Tree looks like.

**Putting it all together**

The final code for the implementation of **Decision Tree Regression in Python** is as follows.

# Importing the libraries import numpy as np # for array operations import pandas as pd # for working with DataFrames import requests, io # for HTTP requests and I/O commands import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn modules from sklearn.model_selection import train_test_split # for splitting the data from sklearn.metrics import mean_squared_error # for calculating the cost function from sklearn.tree import DecisionTreeRegressor # for building the model # Importing the dataset from the url of the data set url = "https://drive.google.com/u/0/uc?id=1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_&export=download" data = requests.get(url).content # Reading the data dataset = pd.read_csv(io.StringIO(data.decode('utf-8'))) dataset.head() x = dataset.drop('Petrol_Consumption', axis = 1) # Features y = dataset['Petrol_Consumption'] # Target # Splitting the dataset into training and testing set (80/20) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28) # Initializing the Decision Tree Regression model model = DecisionTreeRegressor(random_state = 0) # Fitting the Decision Tree Regression model to the data model.fit(x_train, y_train) # Predicting the target values of the test set y_pred = model.predict(x_test) # RMSE (Root Mean Square Error) rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)),'.3f')) print("\nRMSE:",rmse) # Visualizing the decision tree structure from sklearn.tree import export_graphviz # export the decision tree model to a tree_structure.dot file # paste the contents of the file to webgraphviz.com export_graphviz(model, out_file ='tree_structure.dot', feature_names = ['Petrol_tax', 'Average_income', 'Paved_Highways', 'Population_Driver_licence(%)'])

In this lesson, we discussed the working of the decision tree regression along with its implementation in Python.

Decision trees, however, have a tendency to overfit and have a poor generalization performance. In the next lesson, we will discuss on Random Forest Regression algorithm which is an algorithm that uses a collection of decision trees and performs better than a single decision tree.