Thanks to theidioms.com

Supervised Machine Learning with Python (Course VI)

Supervised Machine Learning with Python (Course VI)

Decision Tree Classifier

In the previous lessons, we discussed the types of classification algorithms that involved classifying data based on a specific set of rules and functions that model the data distribution. In this lesson, we will be implementing classification using one of the most powerful algorithms in Machine Learning, i.e., The Decision Tree.

What is a decision tree?

We have briefly discussed the concept of decision trees in the Decision Tree Regression lesson in the previous section of this course. It is one of the most frequently used machine learning algorithms for solving regression and classification problems. The algorithm is based on a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is a way to display an algorithm in terms of conditional control statements.

A decision tree is a flowchart-like structure which can be described by following terminologies:

  • Root Node: This represents the whole data points which can be further divided into different subsets.
  • Splitting: It refers to dividing a node into two or more sub-nodes.
  • Decision Node: A sub-node can split into further sub-nodes based on certain conditions. This node that decides to split is called a decision node.
  • Leaf/Terminal Node: Nodes that do not split are called Leaf or Terminal nodes. These nodes are often the final result of the tree.
  • Pruning: Removing sub-nodes of a decision node is called pruning. Pruning is often done in decision trees to prevent overfitting problems.
  • Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  • Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of the parent node. Here in the figure below the decision node is the parent of the terminal nodes (child).

Let us consider the problem of predicting the diagnosis of breast cancer. The prediction depends on a set of characteristics of patients that determine the criticality of the disease such as the mean perimeter, area, texture, radius, etc. The process starts at a root node and is followed by a branched tree that finally leads to a leaf node (Terminal node) that contains the prediction or the final outcome of the algorithm.

Construction of decision trees usually works top-down, by choosing a variable at each step that best splits the set of items. Each sub-tree of the decision tree model can be represented as a binary tree where a decision node splits into two nodes based on the conditions. Decision trees classifiers contain a target variable with a discrete set of values and the final terminal node represents the predicted class.

Decision Tree Classifier in Python

Now that we know the basic idea of Decision trees, we will now discuss a step-wise Python implementation of the algorithm.

1. Importing necessary libraries

Before we begin to build the model, let us import some essential Python libraries for mathematical calculations, data loading, preprocessing, and model development and prediction.

# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# scikit-learn modules
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# For plotting the classification results
from mlxtend.plotting import plot_decision_regions
2. Importing the dataset

For this problem, we will be loading the Breast Cancer dataset from scikit-learn. The dataset consists of data related to breast cancer patients and their diagnosis (malignant or benign).

# Importing the dataset
dataset = load_breast_cancer() 

# Converting to pandas DataFrame
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df.head()
Breast Cancer  data set dataframe
print("Total samples in our dataset is: {}".format(df.shape[0]))
Total samples in our dataset is: 569
dataset.describe()
Breast cancer data set describe
3. Separating the features and target variable

After loading the data set, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the features and the target variable from the above dataset.

For this implementation example, we will only be using the ‘mean perimeter’ and ‘mean texture’ features but you can certainly use all of them.

# Selecting the features
features = ['mean perimeter', 'mean texture']
x = df[features]

# Target Variable
y = df['target']
4. Splitting the dataset into training and test set

After separating the independent variables ($x$) and dependent variable $(y)$, these values are split into train and test sets to train and evaluate the linear model. We use the train_test_split() module of scikit-learn for splitting the available data into an 80-20 split. We will be using twenty percent of the available data as the test set and the remaining data as the train set.

# Splitting the dataset into the training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )
5. Fitting the model to the training set

After splitting the data into dependent and independent variables, the Decision Tree Classifier model is fitted with the training data using the DecisiontreeClassifier() class from scikit-learn.

# Fitting Decision Tree Classifier to the Training set
model = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
model.fit(x_train, y_train)
DecisionTreeClassifier(random_state=0)
6. Predicting the test results

Finally, the model is tested on the data to get the predictions.

# Predicting the results
y_pred = model.predict(x_test)
7. Evaluating the model

Let us now evaluate the model using confusion matrix and calculate its classification accuracy. Confusion matrix determines the performance of the predicted model. Other metrics such as the precision, recall and f1-score are given by the classification report module of scikit-learn.

Precision defines the ratio of correctly predicted positive observations of the total predicted positive observations. It defines how accurate the model is. Recall defines the ratio of correctly predicted positive observations to all observations in the actual class. F1 Score is the weighted average of Precision and Recall and is often used as a metric in place of accuracy for imbalanced datasets.

# Confusion matrix
print("Confusion Matrix")
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

# Classification Report
print("\nClassification Report")
report = classification_report(y_test, y_pred)
print(report)

# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Decision Tree Classification Accuracy of the model: {:.2f}%'.format(accuracy*100))
Confusion Matrix 
[[28 11] 
[ 8 67]]

Classification Report 
             precision    recall    f1-score    support 
           0      0.78      0.72        0.75         39 
           1      0.86      0.89        0.88         75 
    accuracy                            0.83        114 
   macro avg      0.82      0.81        0.81        114  
weighted avg      0.83      0.83        0.83        114

Decision Tree Classification Accuracy of the model: 83.33%

Hence, the model is working quite well with an accuracy of 83.33%.

8. Plotting the decision boundary

We will now plot the decision boundary of the model on test data.

# Plotting the decision boundary
plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2)
plt.title("Decision boundary using Decision Tree Classification (Test)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture")
Decision boundary using Decision Tree Classification

Hence, the plot shows the distinction between the two classes as classified by the Decision Tree Classification algorithm in Python.

Putting it all together

The final code for the implementation of Decision Tree Classification in Python is as follows.

# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# scikit-learn modules
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Plotting the classification results
from mlxtend.plotting import plot_decision_regions

# Importing the dataset
dataset = load_breast_cancer() 

# Converting to pandas DataFrame
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
df['target'] = pd.Series(dataset.target)

print("Total samples in our dataset is: {}".format(df.shape[0]))

# Describe the dataset
df.describe()

# Selecting the features
features = ['mean perimeter', 'mean texture']
x = df[features]

# Target variable
y = df['target']

# Splitting the dataset into the training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )

# Fitting Decision Tree Classifier to the Training set
model = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
model.fit(x_train, y_train)

# Predicting the results
y_pred = model.predict(x_test)

# Confusion matrix
print("Confusion Matrix")
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

# Classification Report
print("\nClassification Report")
report = classification_report(y_test, y_pred)
print(report)

# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Decision Tree Classification Accuracy of the model: {:.2f}%'.format(accuracy*100))

# Plotting the decision boundary
plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2)
plt.title("Decision boundary using Decision Tree Classification (Test)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture")

In this lesson, we discussed Decision Tree Classifier along with its implementation in Python. In the next lesson, we will discuss Random Forest Classifier which is built upon the concept of Decision Trees.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami