Kernel Regression
Table of Contents :
Motivation
Now we have already seen that to create a non-linear regression model you need to either create polynomial matured functions on your own (polynomial regression) or with the help of a domain expert you need to create matured features (domain expert regression). But both these methods have their disadvantages….
The disadvantage of polynomial regression is that creating these features is compute-intensive !!! The disadvantage of domain expert regression is that sometimes even the domain expert doesn’t know what the new matured features (polynomial or non-polynomial) will be good.
Hence to address these problems people came up with something called kernel functions (also called radial basis functions) which give us the matured features (both polynomial and non-polynomial) when applied to the original features without the need of a domain expert and in low compute cost. Now to understand how it does this, we will first see linear regression equations in a different format, below I have shown the same!!
Linear Regression — Revamped
I would encourage you to first read linear regression here and then read further…
Now let’s try to decode the above weights formula
Hence one thing to notice in all the above equations is that you get Xt.X everywhere.
- This same equation will hold true in case of fitting a non-linear line too but you will have to first calculate the mature feature set X and then calculate Xt to finally compute Xt.X
- The above approach is computationally highly inefficient and hence what researchers thought is that instead of calculating X and then Xt and then Xt.X why don’t we directly calculate Xt.X? Because if we can directly calculate it then we do not need to calculate X i.e. matured feature set!! This is exactly what Kernel Functions enable us to do and hence called the “Kernel Trick”.
Kernel Functions
Kernel functions have to be non linear function so that the transformed features are non linear. This ensures that fitting linear model on nonlinear matured feature leads to nonlinear model on original features. Hence any random function cannot be named as a kernel function. For a function to be a kernel function it should satisfy the “Mercer’s condition”. We will later see what this condition is.
There are 1000s such kernel function defined by researchers but some of the most famous kernel functions are listed below:
Linear Kernel
Polynomial Kernel
Hence a polynomial function takes us to some FINITE dimension
Exponential Kernel / Gaussian Kernel / Gaussian Radial Basis Function (RBF)
Hence a gaussian function takes us to INFINITE dimension and hence is considered the best kernel function. Why going to infinite dimension is best?? Because we know that in an infinite dimension the data will surely be linearly dependent.
Gaussian kernel also follows mercer’s condition and you know that you can never calculate a feature like X^infinity manually but gaussian kernel made this possible. hence called kernel trick
Sigmoid Kernel
Keynotes
Now given there are so many kernel functions how to decide which kernel function is best for our problem?? Treat the kernel function as a hyperparameter
Advantages
- The time complexity of the algorithm does not increase because no of matured features is the same as the no of original features.
- Now we saw that in domain expert regression we needed the help of domain expert who tells us if features like X1(X2)².6 feature should be used or not but here we dont need domain expert the kernel function automatically does this for us.
Disadvantages
- This is a manual process
- You need to perform normalization before applying kernel functions else (say) if kernel is a polynomial kernel then big values will explore and small values will demish after application of kernel.
- The problem of local constancy creeps in due to kernel functions
- The kernel functions are very generic and not specific to the problem and hence they usually cause overfitting problem. To deal with overfitting you can use following techniques (as clearly explained in polynomial regression post)
- Data Argumentation
- Early Stopping
- Penalty Regularization
Final verdict
It is preferred not to use kernel regression cause we dont get an exceptions increase in the accuracy compared to the computation time complexity which increases exponentially