Unfolding Math for Polynomial (Non-Linear) Regression

Discriminant Regression

6 min readJun 8, 2024

Motivation

We have already seen how to solve a regression problem statement using Linear Regression. But it is not necessary that the mapping function is always a linear line, it can also be a non-linear function that fits the dataset better compared to a linear function. Hence here we attempt to derive mathematics to fit a non-linear (polynomial) line to the regression dataset !!

Simple Polynomial Regression

We will first understand the intuition of this algorithm using just 1 feature dataset so that we can draw plots and understand it visually and then extrapolate these understandings to N feature dataset.

Hence you can conclude that fitting a non-linear line to the original dataset containing only 1 independent feature is equivalent to fitting a linear line containing ’N’ polynomial features i.e.

**When we fit a 2 degree polynomial (n=2) then it is called quadratic regression**

These polynomial features X1², X1³, …… are called matured features. This process of creating matured features is called Feature Construction / Feature Creation / Feature Extraction. Matured features are nothing but another representation of the original features and hence developing the matured features is called Representation Learning. Matured features are nonlinear features.

Now we have already seen how to solve Multiple Linear Regression problem. You need to solve following convex unconstrained non-linear optimization problem …

We have already discussed in detail several methods like normal equations / GD / SGD/ MGD / … in earlier posts to solve this optimization problem . You can use any one to solve this optimization problem ….

How to decide the degree of polynomial that you want to fit (i.e. value of N)? Another way of asking this is ‘In which higher dimension we will be able to fit a linear line . We can surely fit a linear line in infinite dimension but infinite dimension is not defined !!’

Use dimensionality reduction techniques like PCA to convert features into 2 dimensions and then use data visualization methods to decide the degree of polynomial you want to fit.
Treat N value as hyper-parameter and try out different values of N and choose the N which gives the best results.

Remember you cannot choose a very high value of N cause it will lead to overfitting and you cannot choose a very low value of N cause it will lead to underfitting. More on this explained below

Hence to handle overfitting we need to convert high variance into low variance and this can only be done by allowing a slight rise in bias. As you can see bias increased from 0.09 to 0.21 i.e. by 0.12 but it cause a reduction in variance by 4.97. This is called bias-variance tradeoff.

There are several ways to handle overfitting. All these techniques are ultimately doing bias variance tradeoff . These methods are together called as REGULARIZATION TECHNIQUES .

1. Early Stopping Regulariser

We can stop early in the optimization i.e. stop much before reaching close to the minima. Hence instead of getting optimal weights you will get suboptimal weights. I urge you to introspect how this curbs overfitting, it is very easily to infer how this curbs overfitting if you have understood everything until now.

How to decide how early should you stop? Treat it like a hyper-parameter and use trial and error to identify best time to stop !! or you can also use the ELBOW method to identify best value of this hyper-parameter.

2. Penalty Regulariser : These can be implemented via two ways :

Penalize Weights :

L1 or Lasso Regularization on weights
L2 or Ridge Regularization on weights
Elastic Net Regularization on weights

You can learn in depth mathematical derivations and code implementation of these here.

Penalize Features :

3. Data Augmentation Regulariser

This basically means increasing the no of data points i.e. rows . To do this we need to add some fake data points to our dataset. These fake data points are called noise. There are several ways to generate noise.

Method 1 : You can add gaussian noise to original features and then calculate the matured polynomial features.

Method 2 : You can calculate the matured polynomial features first and then add the gaussian noise to the matured polynomial features. (this is = L2 Regularization, proof below)

Method 3 : Combination of above two methods !!

Multiple Polynomial Regression

Extrapolating the same logic which we arrived at in simple polynomial regression i.e.

Here also overfitting can occur and we will use the same techniques to handle overfitting as discussed above.

KeyNotes

Since we need to create polynomial features, time complexity of algorithm increases.
Creating polynomial features is a manual process and not an automated process. (Don’t calculate matured features after normalization of original features first calculate matured features without normalization and then do normalization of matured features)
Deciding the degrees of a polynomial is very difficult.

Model Explainability

Feature weights can be used to explain which feature is important for the model !!