Cross Sectional Tabular Data Pre-Processing Pipeline

Exploratory Data Analysis (EDA) => Feature Engineering

Sarvesh Khetan
20 min readSep 29, 2024

Prerequisites : Refer following GitHub links to get an introduction to pandas / matplotlib.

What kind of dataset does a data scientist crave for? One which has fewer columns to reduce time complexity and one which has more rows to reduce overfitting.

There are many pre-processing steps that a tabular cross section data needs to go through. As a general rule, first we need to identify the issue in the dataset (Exploratory Data Analysis (EDA)/ Data Analysis) and then we need to get rid of this issue from the dataset (Data Cleaning / Data Preprocessing / Feature Engineering). Following are some common issues that you should look for and handle it.

  1. Duplicate Rows
  2. Missing Values
  3. Outliers / Anomalies
  4. Similar Independent Features (X vs X)
  5. Categorical Independent Features
  6. Feature Scaling / Feature Transformation / Normalization
  7. Large #Features
  8. Imbalance Dataset

Following are some steps which always needs to be done, these are not issues but just a preprocessing step

  1. Feature Subset Selection (X vs Y)
  2. Train — Validation — Test Split

Duplicates Rows

Step 1 : Identify Duplicate Rows

You can create a hash value using a hash function for each row and then compare the hash value to check if there are duplicate rows in the dataset or not.

Step 2 : Handling Duplicate Rows

You should strictly delete all the duplicate row or almost duplicate rows present in your dataset because duplicate rows misleads the learning algorithm in machine learning pipeline. Major cause of duplicity is merging data from different sources.

Missing Values

Step 1 : Identify Missing Values

Missing data, or Missing values, occur when no data / no value is stored for a certain observation within a feature.

# you can determine the total number of missing values using the isnull 
# method plus the sum method on the dataframe
df.isnull().sum()

# alternatively, you can call the mean method after isnull
# to visualise the percentage of the dataset that
# contains missing values for each variable
df.isnull().mean()

Missing Data can be categorized into 3 categories, namely :

  • Data Missing Completely at Random (MCAR) : Missing data is randomly distributed; there is no pattern to what data is missing.In each column, every possible value has an equal chance of being missing.Eg : A researcher loses some survey responses due to a technical glitch in the data collection software, with no pattern in which responses are lost
  • Data Missing at Random (MAR) : the missing data is related to other observed data but not to the missing values themselves; knowing other data helps explain the gaps. Eg : In a medical study, older patients are more likely to skip questions about their mental health status, but
    their actual mental health condition does not affect whether they skip the question. Older age increases the likelihood of missing Mental Health Status data, but Mental Health Status does not cause the missingness. Missing data relates to observed data(Age) but not to the missing data itself (Mental Health Status)
  • Data Missing not at Random (MNAR) : the missing data is related to the missing values; the reason for the missing data depends on the values that are missing. Eg1 : Students who receive low CGPA on an exam choose not to report their scores, leading to missing data. Hence the column CGPA will have missing data because of low CGPA of some students and not due to some other column’s effect!! Eg2 : In a financial survey, people with very high incomes are less likely to report their income because they prefer to keep it confidential. Hence the missing data is directly related to the value of the missing data itself. Hence MNAR data goes missing due to some systematic bias, following are some bias :
  1. Missing rows in series : Data may be missing in consecutive rows or specific periods (e.g., weekends or certain months).
  2. Missing based on value : Higher / Lower values might lead to non-disclosure hence they go missing (e.g., cgpa, income). You handle such missing values by doing sensitivity analysis (explore how your results might change based on different ways you handle this missing data) or pattern mixture models (create separate models for groups with and
    without missing data to understand the impact of missingness)
  3. Boundary Conditions : Data may be missing when it hits certain thresholds (e.g., temperature data missing below a
    minimum or above a maximum). To handle missing values due to such bias you can either drop all the missing values because they are outside the boundary and you only need to perform analysis of things within the boundary or you can extrapolate the same distribution to outside boundary conditions.

Step 2 : Handling Missing Values

There are many ways to handle missing values in a dataset, below we will discuss few !! But before discussing those, let’s think and answer the question : ‘what if data points are not missing at random? What if they are meant to be missing and by handling the missing values we are losing important information?’ Hence to capture this importance of missing data, we can introduce a new feature for each feature in the dataset indicating whether the data was missing for that observation (1) or not missing (0).

Issue with the above approach of adding a missing flag feature for each feature in the dataset is that in case of small datasets it’s okay but if say your original dataset has 1M features then after adding these variables for each feature it will become a 2M feature dataset which will be very difficult to process …. hence you will have to decide based on the business requirement whether to do this or not to do this ….

Methods to Handle Missing Values in both Numerical / Categorical Variables

Methods to Handle Missing Values in Numerical Variables

Methods to Handle Missing Values in Categorical Variables

Outliers

Outliers are data points that deviate significantly from the majority of the dataset, often defined as being several standard deviations away. Example : unusually high / low values that may be indicative of errors, anomalies or rare events !!

Identifying and Handling Outliers in Categorical Variables

Some categories may appear a lot in the dataset, whereas some other categories appear only in a few number of observations, these are called rare labels. There is no rule of thumb to determine how small is a small percentage, but typically, any value below 5%. For instance a dataset with variable “city where the applicant lives”, we can imagine that cities like ‘New York’ may appear a lot in the data set just because New York has a huge population. On the other hand, smaller towns like ‘Leavenworth’ will appear only on a few occasions (if at all, population < 2000 people), just because the population there is very small.

Are Rare Labels in a categorical variable a problem? Rare values can add a lot of information at the time of making an assessment or none at all. For example, consider a stockholder meeting where each person can vote in proportion to their number of shares. One of the shareholders owns 50% of the stock, and the other 999 shareholders own the remaining 50%. The outcome of the vote is largely influenced by the shareholder who holds the majority of the stock. The remaining shareholders may have an impact collectively, but they have virtually no impact individually.

The same occurs in real life datasets. The label that is over-represented in the dataset tends to dominate the outcome, and those that are under-represented may have no impact individually, but could have an impact if considered collectively. More specifically, rare values in categorical variables tend to cause over-fitting, particularly in tree based methods.

In situations where rare labels are present in variables with only a few categories, the rare label may be adding some information. On the other hand, in variables with a high number of categories, likely there will be many labels with a low frequency, which will quite likely add noise instead of information.

To identify if there is a rare label issue, you can use the Count Plot and then to handle rare label issue you perform Re — Categorisation. Here you recategorise the observation that shows rare labels for a certain variable. These observations can be re-categorised by:

  • Remove the rare label from the dataset
  • Replacing the rare label by most frequent label
  • [most commonly used] Grouping the observations that show rare labels into a unique category (with a new label like ‘Rare’, or ‘Other’)

Identifying and Handling Outliers in Numerical Variables

Univariate Outlier Handling Techniques

Here we remove outliers from the dataset considering 1 feature at a time. You will understand more about it as you go through it.

First to identify if outliers are present or not in a feature we can simply plot box plots for each feature and see if there are outliers in that feature.

Once we know if a feature has outlier or not then to handle it there are several ways :

1. Extreme Value Analysis — Using Z Score Value / Inter-Quartile Range (IQR)

Here you first normalize each feature of the dataset using the formula “(X — Mean) / Standard_Deviation” (this is called z-score). Now you can use z-score to determine how far is a given datapoint from mean of the feature in terms of standard deviations. If it is too far i.e. z-score is highly positive / negative then we can consider it as an outlier and remove it from our dataset !! Commonly used threshold is if z-score is greater than 2 (meaning the datapoint is 2 standard deviations away from mean) or less than -2 (meaning the datapoint is 2 standard deviations away from mean) then it is an outlier !

Above Z-score based technique works well on variables which follow normal distribution. But what if the variable does not follow normal distribution?

  • You can either convert it to gaussian and then apply z score, or
  • Use IQR (= 75th quartile — 25th quantile) method instead. An outlier will sit outside the following upper [75th quantile + (IQR * 1.5)] and lower [25th quantile — (IQR * 1.5)] boundaries.

2. Smoothing by Discretization

Discretisation is the transformation of continuous variables into discrete variables. If continuous variable is converted into discrete variable of just 0 and 1 then it is called Binarization.

There are many methods to do discretisation (this process is called binning) for example, for the variable age, we could group the observations (people) into bins / intervals / buckets / groups like: 0–20, 21–40, 41–60, > 61. This grouping of the variables in ranges is called discretisation. As you can see, any outlier (extremely high) value of age would be included in the > 61 group, therefore minimising its impact.

Multivariate Outlier Handling Techniques

Here we remove outliers from the dataset considering all features at a time. There are multiple methods to do this :

Method 1 : ) Using Clustering Algorithm

  • Cluster the data into groups of different density using any clustering algorithm
  • Choose points in small cluster as candidate outliers
  • Compute the distance between candidate points and non-candidate clusters.
  • If candidate points are far from all other non-candidate points, they are outliers

Method 2 : ) Nearest Neighbour based Algorithm

Compute the distance between every pair of data points. Now once we have this, there are various ways to define outliers:

  1. Data points for which there are fewer than p neighboring points within a distance D can be termed as outliers
  2. The top n data points whose distance to the kth nearest neighbor is greatest can be termed as outliers
  3. The top n data points whose average distance to the k nearest neighbors is greatest can be termed as outliers

Method 3 : ) Isolation Forest (Ensemble Learning Method)

Method 4 : ) All Classification Algorithms : KNN / SVM / …

Similar (Redundant) Independent Features (X vs X)

Step 1 : Identify Similar Independent Features

Now there are several ways to find if two features are similar or not.

Similarity Between 2 Categorical Variables

Historically, the best method to calculate similarity between two categorical variables is Chi-Squared Test. Below I have explained the math how chi-square test helps to calculate similarity measure and make a decision whether to drop the variable or not.

Hence we will apply chi-squared test for all combinations of categorical variables present in the dataset!

Note : If we can assume the dataset follows normal distribution then we use parametric tests like chi square test as shown above but if we cannot assume the dataset to follow normal distribution then we use non parametric (NP) tests instead of chi-squared test

There are other methods too (other than statistical tests) for calculating similarity b/w categorical variables like Information Gain / GINI / ….

Similarity Between 2 Numerical Variables

Here we have two famous methods, as discussed below :

  • Correlation / Covariance b/w two variables : If two variable have >90% covariance then we can just remove either of the two cause they are almost same only. You can use Pearson (continuous numerical variable) or Polychoric (discrete numerical variables) correlation matrix to identify the variables with high correlation and select one of them using VIF (Variance Inflation Factor). Variables having higher value ( VIF > 5 ) can be dropped. Only diff between correlation and covariance is that covariance lies between (-a, a) while correlation lies between (-1,1)
  • 2-Sample T Test (also called Independent Samples T Test) : Just like we had chi-squared statistical test for comparing categorical variables, here we have T test. Null hypothesis in this case is — there is no difference between the mean of the two variables. Alternate hypothesis in this case is — there is difference between the mean of the two variables.
import scipy.stats as stats
alpha = 0.05
_,p_value = stats.ttest_ind(a=col_1, b=col_2, equal_var=False)

if p_value < alpha:
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

Similarity Between 1 Categorical Variable and 1 Numerical Variable

Here we will make use of ANOVA test (also called F-test)

Step 2: Handling Similar Independent Features

Simply drop one of the similar feature

Categorical Independent Features

Step 1 : Identify Categorial Features

You can use a simple rule that if no of uniques in a feature is <25 means it is a categorical variables else it is a numerical variable

Feature Scaling / Feature Transformation / Normalization

Step 1 : Identify Features to Scale / Normalize

What is Feature Scaling? Feature scaling is just a fancy word to say that we want to reduce values of all rows of a particular column to a scale that is uniform for all the columns.

When to do Feature Scaling?
1.
This is not a compulsory step but if your ML algorithm involves concept of gradient descent / manhattan distance / euclidean distance then this step becomes necessary. WHY? because it helps the algorithm to work faster and more accurately (you will understand this later in ML ).
2. Hence this step is necessary for algorithms like Linear Regression / ANN / CNN / KNN / PCA / … but not necessary for tree based algorithms like Decision Tree / Ada-Boost / Random Forest / Exe-Boost…..
3. If you do this step in algorithms that don’t require this step it would still not make any difference but if you don’t do this on algorithms that requires this then it will make a huge difference cause that ML algorithm will take huge amount of time for training. Hence do feature scaling ALWAYS.

How to identify features which needs scaling?
Plot data distribution of each column using box plot / histogram / distplot from seaborn or you can use df.describe() function from pandas which will give you mean / median / quartile / mode / variance / range / percentile of each col in the dataframe

Step 2: Normalize the Features

  • Feature Scaling for a column can be simply done by dividing all the numbers by highest no in that column or by 1000.
  • Or we can use complicated functions like ‘Min-Max Scalar’ which will scale down the values between 0 and 1
  • Or we can use complicated functions like ‘Mean Normalization’ which will scale down the values between -1 and 1
  • Or we can do it in a more effective way by converting all the columns from a random distribution into a Normal distribution since we know that normal distribution always lies between -1 to 1
    a1 ) It has been proved that you can convert any distribution into normal distribution
    a2 ) If the distribution is a lognormal distribution(skewed)(plot the data and see what distribution it follows) then you can apply ln(x) function to convert it into a normal distribution.
    a3 ) If the distribution is an exponential distribution(plot the data and see what distribution it follows) then you can apply e^(x) function to convert it into a normal distribution.
    a4 ) Similarly you can use any other function of your choice based on the kind of suitable transformation you want to convert a specific distribution into normal distribution but sometimes the distribution is not known in such cases you can apply random function like say you want to convert a high range (say 300) to a smaller range (say 10) ..you can develop your own functions to do this like in this case x/30 will do the necessary changes in this case
  • Or by converting all the columns from a random distribution to a Standard Normal Distribution (pref because of added advantage) (Standardization / Z-Score Normalization / Zero Mean Normalization)

Large #Features

Step 1 : Identify Large #Features

Have large no of dimension is a big issue, this is called ‘‘curse of dimensionality’. There is no general way to identify this, but in general if your features are in 1000s then it is assumed to be large.

Irrespective, even if it is small no of features, you apply the below methods because it does many other things than just reducing the no of features!!

  1. Too much of independent features is undesired because training time of supervised learning increase and sometimes more features gives less accuracy while too less independent features is also not desired because it will not be a good model since it does not take into consideration all the factors and hence here also we get low accuracy, this is called curse of dimensionality. Hence we somehow need to figure out the right amount of features and one of the stepping stone towards finding out the right amount of features is by removing redundant(same features) features
  2. Most of the supervised learning models assumes that all the independent variables i.e. x1 x2 x3 … are independent of each other i.e. correlation between any Xi and Xj = 0 and hence they were given the name independent variables and some of the dimensionality reductions techniques like PCA helps us to achieve this because now currently x1 x2 x3…. may or maynot have 0 correlation.
  3. Visualisation — specially since we cannot plot more than 3D plots we can use dimensionality reduction techniques to reduce the columns to 1 and 2 columns for making 2D plots for regression and classification problems respectively.

Step 2: Handling Large #Features

Since the no of features is huge, we handle this by reducing the features!! This process is called Dimensionality Reduction

What is dimensionality reduction? Say we have 1000 features and we need to reduce the no of features to 40.Remember here we are not talking like we pick up only best 40 features and hence reduce the dimensionality …this is feature subset selection. Here we say that without eliminating 960 features we need 40 features and this can be done by getting 40 components which has the essence of all 1000 features.As a generalised form we need N such components that contains most essence of 1000 features.

Techniques used for dimensionality reduction? Refer unsupervised learning section in machine learning.

Imbalance Dataset (only for classification tasks)

Step 1: Identify Imbalance Dataset

It is fairly easy to identify imbalance dataset, if count of datapoints in class1 is unequal to count of datapoints in class2, implies it is an imbalance dataset.

Step 2: Handle Imbalance Dataset

There are two ways to handle imbalance datasets

  • Downsampling over-represented records
    Say if class 0 has 900 records and class 1 has 100 records then we just randomly select 100 out of 900 records from class 0 and all 100 records for class 1. Now the dataset is balance. Use this method iff you have millions of records for the under-represented records so that the new dataset contains enough no of datapoints to make a ML model on it!!
from imblearn.under_sampling import NearMiss
nm = NearMiss(random_state = 42)
X_new, Y_new = nm.fit_sample(X,Y)
  • Upsampling under-represented records (Preferred)
    Generate new datapoint of the under-represented class using data augmentation techniques (conditional) discussed in unsupervised learning.

Feature Subset Selection (X vs Y)

Earlier in Dropping Similar Independent Features, we removed independent features based on how similar they were with other independent features. But now we will remove independent features based on how impactful it is on the dependent feature!!

Method 1 : Brute Force Approach

  • All subset : say you want to choose between 5 features X1, X2, X3, X4, X5 then we pass all possible combo of these 5 features to our ML algo (2⁵ possibilities) and see which combination gives best accuracy.
  • Forward chaining : say we have 5 features then at first we use only one feature in our ML algorithm and then keep increasing the no of features and then at the end select that no of features which gives best results.
  • Backward elimination : say we have 5 features then at first we use all the 5 features in our ML algorithm and then keep on decreasing the no of features and at the end select that no of features which gives best results. This is basically opposite of forward chaining.
  • Wrapper Heuristic Approach : Here we talk about feature selection algorithms like Genetic Algorithms (GA) and Particle Swarm Optimization (PSO)

Method 2 : Embedded Approach

Embedded approaches are ones in which feature selection occurs naturally as a part of machine learning algorithm like regression and decision tree. Now suppose after doing your multi linear regression you get decision equation as Y = -5X1 + 0.1X2 + 10X3 + 9. This equation shows that X2 is not an important feature and can be dropped. Hence you create a new model again without the X2 feature.

But if you use Lasso Regression then you don’t need to create the model again (go research why!!)

Method 3 : Feature Ranking Approach

Exactly same as what we saw in Dropping Similar Independent Features just that there we applied all the methods on Xi vs Xj while here we will apply the same methods on Xi vs Y so that we can understand impact of Xs on Y and if it is not impactful then just remove it from the dataset!

For instance : If a numerical independent variable X has a correlation of <=0.1 on dependent variable Y then we can drop that independent variable cause it is not affecting our dependent variable. Example:student id (independent variable) is irrelevant on the prediction of student CGPA (dependent variable), in such a case correlation will come out to be <=0.1 between the two.

Train — Validation (/Development)— Test Split

During the initial days people only did train-test split as shown below. The most common ratio used for train-test split is 80:20.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

But now people use train-validate-test split because validation set is used for hyperparameter tuning and it also helps in learning more generalized pattern by preventing overfitting which will happen if you use entire training data to train your model. The most apt ratio for train-validate-test split is 70:10:20.

--

--

Sarvesh Khetan
Sarvesh Khetan

Written by Sarvesh Khetan

A deep learning enthusiast and a Masters Student at University of Maryland, College Park.

No responses yet