K-Nearest Neighbours (KNN) for Classification
Lazy Learning
This is the easiest classification algorithm to understand and implement. It is also called instance-based learning (IBL), case-based reasoning (CBR), or lazy learning.
Table of Contents :
Intuition
Why is K always taken as an odd number and not an even number?
Because had we have taken an even number (say 6) and during majority voting a situation arises that 3 votes belong to Class 1 and 3 votes belong to Class 2 then in such a case we won’t be able to decide the class for test datapoint. Hence it is always said to take K as odd to avoid such a situation !!
Matrix Equation to find the distance between M training points and N testing points
As seen above we will have to calculate the distance measure between the test point and all the training data points (say ‘M’ training points).
Now let’s say at test time we have P data points to be classified and hence we will have to compute the distance for each of these test points with all training data points, meaning M*P time complexity when done using 2 loops. But this can be reduced significantly if we use matrix notation to compute this, as shown below ….
KeyNotes
- Huge Dataset Effect: No training time, but the saved training time cost is paid at the test time since classifying a test example requires a comparison to every single training example. In practice, we often care about the test time efficiency much more than the train time efficiency. Hence prefer not to use KNN for large datasets cause the inference time will be huge !!
- Imbalance dataset and outlier Effect: It is important to handle imbalance datasets and outliers before fitting KNN cause these can affect the accuracy of the model
- Feature Scaling Effect: Feature scaling is important to do before applying KNN because KNN uses distance measure and hence features with a larger range can dominate
- Dimensionality Issue: The higher the number of features higher is the time taken by the algorithm to compute the distance values and hence KNN computation time highly depends on the no of features, try applying dimensionality reduction techniques before applying KNN
- Missing Value Impact: KNN is heavily impacted by missing values and hence handling missing values before applying KNN is important
- Model Explainability : Unlike other classification models, KNN model is not interpretable i.e. you cannot explain which feature contributed the most or the least for a particular prediction !!
Decision Boundary
Let’s first think how will the decision boundary for a 1-NN algorithm look like. In a 1-NN algorithm, the test sample will look at the 1st nearest neighbour to it and will be assigned it’s class, as seen in below diagram …
Hence for any 2 given training datapoints (say A & B), the decision line will be a line between them at equal distance from both cause if the test sample is slightly left to decision line then it belongs to ‘A’ while if it is slightly right to to decision line it belongs to ‘B’
Hence a 1-NN decision boundary looks something like this (called piecewise linear decision boundary)
Hence if in your dataset you have M datapoints, then for a 1-NN algorithm you will have M piecewise linear decision boundaries cause we will have a boundary associated with each datapoint !!
Now let’s extrapolate this conclusion to K-NN algorithm, a K-NN algorithm where K>1 will have P piecewise linear decision boundaries, where P < M and M = no. of datapoints in the dataset.
Some people also say the above statement as K-NN algorithms where K>1 always has lesser decision boundaries compared to 1-NN algorithm !!
Hyperparameters Tuning
- K value can be 1 / 2/ …/ 5 / …. Increasing K can reduce the overfitting, and accuracy would improve. But beyond a value, accuracy starts decreasing
- Distance measures like Euclidean distance / Manhattan distance / ….
You can find the right values of these using the Validation dataset by experimenting with different combinations and then choosing the values for which you get the highest accuracy for a given combination. Try plotting this over a graph to help you choose the right combination, this curve is popularly known as the ELBOW curve
There are other techiques too to find the right hyperparameters like grid search / Bayesian Optimization Algorithm (BOA)
KNN Model follows Local Constancy Assumption ?
Firstly, let’s see what is local constancy assumption…
Yes, KNN follows Local Constancy Assumption to the T!! Now let’s see how KNN (1-NN specifically) follows local constancy assumption …
Since KNN gives prediction in lines with Local Constancy Assumption, it proves that KNN follows Local Constancy Assumption !!
Hence in datasets where the local constancy assumptions doesn’t holds good, in those cases KNN will fail to give good results.
Code Implementation
Below you can find both from-scratch implementation and library-based implementation of the KNN algorithm