# Supervised Machine Learning with Python (Course VI)

July 13, 2020 2020-08-04 10:50## Supervised Machine Learning with Python (Course VI)

### Stochastic Gradient Descent (SGD) Classifier

Stochastic Gradient Descent (SGD) is an optimization algorithm used to find the values of parameters (coefficients) of a function that minimizes a cost function (objective function).

The algorithm is very much similar to traditional Gradient Descent. However, it only calculates the derivative of the loss of a single random data point rather than all of the data points (hence the name, stochastic). This makes the algorithm much faster than Gradient Descent.

Stochastic Gradient Descent is a popular algorithm for training a wide range of models in Machine Learning, including (linear) support vector machines, logistic regression, and graphical models. When combined with the backpropagation algorithm, it is the *de facto* standard algorithm for training artificial neural networks. Recently, SGD has been applied to large-scale and sparse machine learning problems often encountered in text classification and Natural Language Processing.

**Stochastic Gradient Descent Classifier in Python**

Now that we know the basic idea of SGD Classifier, we will now discuss a step-wise Python implementation of the algorithm.

###### 1**. Importing the data set**

Before we begin to build a model, let us import some essential Python libraries for mathematical calculations, data loading, preprocessing, and model development and prediction.

# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # For plotting the classification results from mlxtend.plotting import plot_decision_regions

###### 2**. Importing the dataset**

For this problem, we will be loading the Breast Cancer dataset from scikit-learn. The dataset consists of data related to breast cancer patients and their diagnosis (malignant or benign).

# Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) df.head()

print("Total samples in our dataset is: {}".format(df.shape[0]))

Total samples in our dataset is: 569

dataset.describe()

###### 3**.** **Separating the features and target variable**

After loading the data set, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the features and the target variable from the above dataset.

For this implementation example, we will only be using the ‘mean perimeter’ and ‘mean texture’ features but you can certainly use all of them.

# Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target Variable y = df['target']

###### 4**.** **Splitting the data set into training and test set **

After separating the independent variables ($x$) and dependent variable $(y)$, these values are split into train and test sets to train and evaluate the linear model. We use the train_test_split() module of scikit-learn for splitting the available data into an 80-20 split. We will be using twenty percent of the available data as the test set and the remaining data as the train set.

# Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )

###### 5**.** **Fitting the model to the training set**

After splitting the data into dependent and independent variables, the SGD Classifier model is fitted with the training data using the SGDClassifier() class from scikit-learn.

# Fitting SGD Classifier to the Training set model = SGDClassifier(loss="hinge", alpha=0.01, max_iter=200) model.fit(x_train, y_train)

SGDClassifier(alpha=0.01, max_iter=200)

###### 6**.** **Predicting the test results**

Finally, the model is tested on the data to get the predictions.

# Predicting the results y_pred = model.predict(x_test)

###### 7. **Evaluating** the model

Let us now evaluate the model using confusion matrix and calculate its classification accuracy. Confusion matrix determines the performance of the predicted model. Other metrics such as the precision, recall and f1-score are given by the classification report module of scikit-learn.

**Precision** defines the ratio of correctly predicted positive observations of the total predicted positive observations. It defines how accurate the model is. **Recall** defines the ratio of correctly predicted positive observations to all observations in the actual class. **F1 Score** is the weighted average of Precision and Recall and is often used as a metric in place of accuracy for imbalanced datasets.

# Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('SGD Classifier Accuracy of the model: {:.2f}%'.format(accuracy*100))

Confusion Matrix [[22 17] [ 0 75]] Classification Report precision recall f1-score support 0 1.00 0.56 0.72 39 1 0.82 1.00 0.90 75 accuracy 0.85 114 macro avg 0.91 0.78 0.81 114 weighted avg 0.88 0.85 0.84 114 SGD Classifier Accuracy of the model: 85.09%

Hence, the model is working quite well with an accuracy of *85.09%*.

###### 8. **Plotting** the decision boundary

We will now plot the decision boundary of the model on test data.

# Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using SGD Classifier (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")

Hence, the plot shows the distinction between the two classes as classified by the SGD Classification algorithm in Python.

**Putting it all together**

The final code for the implementation of **SGD Classification in Python** is as follows.

# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # Plotting the classification results from mlxtend.plotting import plot_decision_regions # Importing the dataset dataset = load_breast_cancer() # Converting to pandas dataframe df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) print("Total samples in our dataset is: {}".format(df.shape[0])) # Describe the dataset df.describe() # Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target variable y = df['target'] # Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 ) # Fitting SGD Classifier to the Training set model = SGDClassifier(loss="hinge", alpha=0.01, max_iter=200) model.fit(x_train, y_train) # Predicting the results y_pred = model.predict(x_test) # Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('SGD Classifier Accuracy of the model: {:.2f}%'.format(accuracy*100)) # Plotting the decision boundary plt.figure(figsize=(10,6)) plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using SGD Classifier (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")

In this lesson, we discussed the concept of Stochastic Gradient Descent Classifier along with its implementation in Python.

This marks the end of our course on Supervised Machine Learning with Python.