Thanks to theidioms.com

Supervised Machine Learning with Python (Course VI)

Supervised Machine Learning with Python (Course VI)

Linear Regression with multiple variables

In the previous lesson, we learnt about Simple Linear Regression where we modelled the relationship between a target variable and an independent variable.

However, in practicality, most regression problems have more than one independent variable that determine/influence the value of the dependent variable. In this lesson, we will discuss on how to solve such problems using Multiple Linear Regression.

What is Multiple Linear Regression?

Multiple Linear Regression is a linear regression algorithm used in datasets containing a single dependent variable and multiple independent variables.

If (x_1, x_2, x_3, \dots, x_n) be the set of independent variables, the value of the dependent variable (y) is modelled by the Multiple Linear Regression algorithm as,

    \[y = w_1 * x_1 + w_2 * x_2 +......+ w_n * x_n + b\]

where,
y is the dependent variable,
x_1, x_2, \dots, x_n are the independent variables,
w_1, w_2, \dots, w_n are the weights,
b is the bias or the intercept, and
n is a positive integer.

For example, consider the problem of pricing a house. The selling price of a house can depend on a wide range of factors such as its location, area covered by the house, the number of rooms, the year it was built, etc.

House No.LocationArea (in sq. feet)Number of RoomsBuilt Year Price (in USD)
1Kathmandu, Nepal100,00052018300,000
2Bhaktapur, Nepal80,00042018250,000
3New York, USA50,00032019100,000
4Kathmandu, Nepal120,00062020?

Such problems need to consider more than one variable to be able to predict the value of the dependent variable. This is a typical problem where Multiple Linear Regression is used.

(Note: The training process of a Multiple Linear Regression model is the same as a Simple Linear Regression model.)

Multiple Linear Regression in Python

We have already discussed the concept of Multiple Linear Regression, and its application. We will now go through a step-wise Python implementation of the algorithm.

1. Importing necessary libraries

First, let us import some essential Python libraries.

# Importing necessary libraries
import numpy as np # for array operations
import matplotlib.pyplot as plt # for visualizing data
%matplotlib inline

# scikit-learn for model building and validation
from sklearn.datasets import load_boston # for loading the dataset
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.linear_model import LinearRegression # for building the model
from sklearn.metrics import mean_squared_error # for calculating the cost function
2. Importing the dataset

For this implementation example, we will be importing a sample dataset from scikit-learn, called the Boston housing prices dataset.

# Loading the dataset
dataset = load_boston()

# Getting the features (x) and target (y)
x = dataset.data
y = dataset.target

print("Total number of samples in the dataset: {}".format(x.shape[0]))
Total number of samples in the dataset: 506

We can have a proper look at the data by converting it into a pandas DataFrame and using the head() function to display the first five rows.

# Importing pandas for working with DataFrames
import pandas as pd

# Creating a pandas DataFrame from the loaded dataset
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
df['TARGET'] = dataset.target

# Printing the first five rows of the DataFrame
df.head()

As we can see, there are 13 features (x_1, x_2, \dots, x_{13}) in the dataset and a single target variable (y).

3. Splitting the dataset into a train set and a test set

We will use the train_test_split() module of scikit-learn for splitting the available data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

If you are confused about why we are splitting the data, please make sure to go through ‘Introduction to Supervised Machine Learning‘.

4. Fitting the model to the training dataset

After splitting the data, let us initialize a Linear Regression model and fit it to the training data. This is done with the help of the LinearRegression() module of scikit-learn.

# Initializing the Linear Regression model
model = LinearRegression()

# Fitting the Multiple Linear Regression model to the data
model.fit(x_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We have trained (fitted) the model in just two lines of code!

5. Summarizing the model

The goal of model training is to determine the value for the x-coefficient (weight) and the intercept (bias) that results in a straight line that best fits the data distribution. Let us print the value of these variables from the fitted model.

# x-coefficients
print("\nCoefficients:\n", model.coef_)

# Intercept
print("\nIntercept:\n", model.intercept_)
Coefficients: 
[-9.41693929e-02 4.02843274e-02 4.38808541e-02 2.45921683e+00 
-1.66514077e+01 4.55748564e+00 -3.02324498e-03 -1.27668975e+00 
2.80805954e-01 -1.16199877e-02 -1.01204495e+00 1.00501337e-02 
-4.83886151e-01] 

Intercept: 
30.72849196987436 

Since we have 13 features in the training dataset, there are 13 different x-coeeficients (weights), i.e., (w_1, w_2, \dots, w_{13}).

6. Calculating the loss after training

Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).

    \[RMSE = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} - \hat{y_{i}})^{2}}\]

where,
y_i is the actual target value, 
\hat{y_{i}} is the predicted target value, and
n is the total number of data points.

The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error) as the cost function
rmse = float(format(np.sqrt(metrics.mean_squared_error(y_test, y_pred)),'.3f'))
print("\nRMSE:\n",rmse)
RMSE:
5.494

A RMSE value of 5.494 indicates that there is some loss in the model. This is quite normal since we are trying to model the relationship of 13 different features with the target variable and a straight line may not fit all the data points exactly.

However, it is necessary to point out that there are various methods available to further minimize the loss of the model but we will not be discussing those in this lesson.

7. Visualizing the results

Let us now visualize the test set results by plotting the actual target values in the test set vs the predicted target values to see how well the model is fitted.

# Plotting the result of actual target values in the test set vs the predicted target values
plt.scatter(y_test, y_pred)
plt.xlabel('Test data')
plt.ylabel('Predicted Y')
Multiple linear regression result plot

Although the predicted target values and the actual target values are not exactly the same, the above graphs looks somewhat linear and our Multiple Linear Regression model seems to be performing okay.

Putting it all together

The final code for the implementation of Multiple Linear Regression in Python is as follows.

# Importing necessary libraries
import numpy as np # for array operations
import matplotlib.pyplot as plt # for visualizing data
%matplotlib inline

# scikit-learn for model building and validation
from sklearn.datasets import load_boston # for loading the dataset
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.linear_model import LinearRegression # for building the model
from sklearn.metrics import mean_squared_error # for calculating the cost function

# Loading the dataset
dataset = load_boston()

# Getting the features (x) and target (y)
x = dataset.data
y = dataset.target

print("Total number of samples in the dataset: {}".format(x.shape[0]))

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

# Initializing the Linear Regression model
model = LinearRegression()

# Fitting the Multiple Linear Regression model to the data
model.fit(x_train, y_train)

# x-coefficients
print("\nCoefficients:\n", model.coef_)

# Intercept
print("\nIntercept:\n", model.intercept_)

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error) as the cost function
rmse = float(format(np.sqrt(metrics.mean_squared_error(y_test, y_pred)),'.3f'))
print("\nRMSE:\n",rmse)

# Plotting the result of actual target values in the test set vs the predicted target values
plt.scatter(y_test, y_pred)
plt.xlabel('Test data')
plt.ylabel('Predicted Y')

In this lesson, we discussed the basics of Multiple Linear Regression along with its implementation in Python. In the next lesson, we will discuss about Polynomial Regression.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami