Kernel Method With Python Full Tutorial


In machine learning, kernel methods are a class of algorithms for pattern analysis to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets.

Basically Kernel is a function that is used to map a lower-dimensional data into a higher dimensional data. Algorithms capable of operating with kernels include the kernel perceptron, support vector machines (SVM), Gaussian processes, principal components analysis (PCA), canonical correlation analysis, ridge regression, spectral clustering, linear adaptive filters, and many others. Any linear model can be turned into a non-linear model by applying the kernel trick to the model: replacing its features (predictors) by a kernel function.

Why use Kernel?

The idea is to use a higher-dimension feature space to make the non linearly separable data almost linearly separable. The following example problem and solution to it depicts the need for kernels:

Suppose we have linearly separable data so that we can easily draw the line as decision boundary to separate classes using linear classifiers as shown below :


Consider a case when we have non linearly separable data, a line or linear classifiers cannot separate the classes (ie there is no clear hyperplane to separate) as shown in fig below


In order to classify data like the one above let’s move away from a 2d view of the data to a 3d view. Now in three dimensions, our hyperplane can no longer be a line but a plane. As we can see from the examples below that we can easily draw a plane that separates the data. Finally, if we project this plane back to 2d we obtain a nonlinear decision boundary that separates classes.



One approach to finding decision boundary as above, is by explicit method, computing the coordinates of the data in that feature space which is computationally expensive. Another approach to do this by using the kernel method, computing the inner products between the images of all pairs of data in the feature space which is often computationally cheaper than computing the coordinates of the data in that space. This trick is usually known as kernel trick.[See the example presented below that illustrates this]

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using kernel trick as mentioned above (ie avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary), implicitly mapping their inputs into high-dimensional feature spaces.

How Kernels work Quicker?

Kernels can be mathematically defined as:

K(x, y) = <f(x), f(y)>

where K is the kernel function, x, y are n-dimensional inputs, f is a map from n-dimension to m-dimension space and <x , y> denotes the dot product .

The dot product of the same dimensional two vectors gives a single number. Kernel utilizes this property to compute the dot product in a different space without even visiting the space. This can be illustrated by the following simple example:


Assume that, we have two features represented in the following form:

x_i = [x_{i1}, x_{i2}] where i represent the data points. Similarly, 1 and 2 denote the features.

Also, let us assume that we are applying some transformation function to convert two-dimensional input space (two features) into four-dimensional feature space.

Let us take two data points x_i and x_j.

To calculate the dot product of two vectors in the four dimensional space/transformed space, the standard way is:

\phi(x_i) = [(x_{i1})^2, x_{i1}x_{i2}, x_{i2}x_{i1}, (x_{i2})^2]

\phi(x_j) = [(x_{j1})^2, x_{j1}.x_{j2}, x_{j2}.x_{j1}, (x_{j2})^2]

\phi(x_i).\phi(x_j) = (x_{i1})^2  (x_{j1})^2 + (x_{i1}x_{i2})(x_{j1}x_{j2}) + (x_{i2}x_{i1})(x_{j2}x_{j1}) + (x_{i2})^2  (x_{j2})^2

Since kernel function calculates the dot product in the different spaces without even visiting it. So using Kernel function for the above transformation is:

K(x_i, x_j) = (T(x_i).x_j)^2

where T(x_i) is the transpose of xi.

Let us now instantiate values of x_i and x_j as:

  • x_i = [1, 2]
  • x_j = [3, 4]

The dot product in the four-dimensional space by the standard way is: Using ϕ(x_i), ϕ(x_j) and dot product formula as above:

[1, 2, 2, 4].[9, 12, 12, 16] = 9+24+24+64 = 121

The above dot product can be calculated using the above kernel function K(x_i, x_j) without even transforming the original space as:

K(x_i, x_j) = (T(x_i).x_j)^2 = ([1, 2].[3, 4])^2 = (3+8)^2 = 121

As we can clearly see from the above example that the standard method of calculating the dot product requires O(n^2) time. But using kernel requires just O(n) time.

Some common Kernel Functions

Kernel functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Some common kernel functions are described below:

  • Linear Kernel
    • K(x_i, x_j) = sum(x_i * x_j) This defines the similarity or a distance measure between new data and the support vectors. The dot product is the similarity measure used for linear SVM or a linear kernel because the distance is a linear combination of the inputs.
  • Polynomial Kernel
    • K(x_i, x_j) = 1 + sum(x_i * x_j)^d Where the degree of the polynomial must be specified by hand to the learning algorithm. When d=1, this is the same as the linear kernel. The polynomial kernel allows for curved lines in the input space
  • Radial Kernel
    • K(x_i, x_j) = exp(-gamma * sum((x_i - x_j^2)) Where gamma is a parameter that must be specified to the learning algorithm. The radial kernel is very local and can create complex regions within the feature space, like closed polygons in two-dimensional space.

Implementation of Kernel SVM Classification

1. Import the necessary libraries/modules

Some essential python libraries are needed namely NumPy ( for some mathematical calculations), Pandas (for data loading and preprocessing) and some modules of Sklearn(for model development and prediction). Lets import other necessary libraries before we import modules of Sklearn:

#Import necessary libraries
import numpy as np
import pandas as pd

2. Import and Inspect the dataset

After importing necessary libraries, pandas function read_csv() is used to load the CSV file and store it as a pandas dataframe object. Then to inspect the dataset, head() function of the dataframe object is used as shown below. This dataset consists of logs which tell which of the users purchased/not purchased a particular product given other features (Id, Gender, age, estimated salary) as shown below:

#Import and Inspect the dataset

3. Separate Dependent- Independent variables

After inspecting the dataset, the independent variable(X) and the dependent variable (y) are separated using iloc function for slicing as shown below. Our concern is to find the purchased or not value given Estimated Salary and Age from the above dataset. So the features Estimated Salary and Age (X) is the independent variable and Purchased(y) is the dependent variable with their values shown below.

#Separate Dependent and Independent variables
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values


4. Split the dataset into train-test sets and Feature Scale

After separating the independent variable (X) and dependent variable(y), these values are split into train and test sets to train and evaluate the linear model. To split into test train sets test_train_split module of Sklearn is used with the test set 25 percent of available data as shown below. Here X_train and y_train are train sets and X_test and y_test are test sets. Also, the data is scaled using StandardScaler class form Sklearn that standardize features by removing the mean and scaling to unit variance as shown below:

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4. Fit SVM kernel model to the dataset

After splitting the data into dependent and independent variables, the SVM kernel model is fitted with train sets (ie X and y) using SVC class specifying the kernel function to be used from Sklearn as shown below. Here in this problem, we are using a Radial kernel as explained above.

# Fitting SVM with different kernels to the Training set
classifier = SVC(kernel = 'rbf', random_state = 0)  #ie Radial basis kernel
#classifier = SVC(kernel = 'poly', random_state = 0) #ie polynomial kernel, y_train)

5. Predict the test results

Finally, the model is tested on test data and compared with the actual values and showing this on the confusion matrix as shown below:

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
 [[64  4] 
 [ 3 29]] 

In Conclusion

With this, we have come to the end of our Machine Learning Full Course. We hope that this course helped you to get started with Machine Learning. Also, if you have any questions or feedback, please feel free to let us know in the comment section.


Please enter your comment!
Please enter your name here