Computer Vision — Object Localization Task

Non Generative CV Task

Sarvesh Khetan
6 min readOct 13, 2024

This task is almost similar to the image classification task that we saw earlier except that here we need to also add a bounding box to the classified object (see below image)

Object Localization Task

There are many proposed models to solve this task. These can be broadly classified into two types :

  1. Single — Stage Models
    a. CNN Based Architecture
    b. Transformer Based Architecture
  2. Two — Stage Models
    a. Sliding Window Based CNN
    b. Region Based CNN (RCNN)
    c. Region Proposal Network (RPN) Based CNN

Note, there will be no code implementation in this article since all these methods are outdated. Later we will discuss another superset task called Object Detection wherein we will study and implement SOTA models. These SOTA object detection models are also the best performing models on Object Localization Task.

Singe — Stage Models

1. CNN Based Architecture

In this approach we treat this task as a combination of both regression (predicting the bounding box coordinates) and classification (predicting the image class) task. Hence this model is a multi-task model.

There are several ways to predict the bounding boxes coordinates, three most widely used ways are shown below.

Hence the final architecture for this task would look something like this

2. Transformer Based Architecture

Above we saw an architecture that used CNN, similarly we can use a transformer based architecture too for the same use-case. Transformer based architecture is slower than CNN based architecture but it gives better accuracy compared to CNNs. (Remember, the loss function will remain the same to train this transformer based model also)

Instead of using vanilla transformers as shown in above diagram you can use any other efficient variant of transformer eg window attention transformer / ……

Two Stage Models

Sliding Window Based CNN

### Step 1 : Predict Bounding Boxes

Select all possible bounding boxes in an image using different position / sizes / aspect ratio, this will give you the bounding box prediction

### Step 2 : Resizing

Now we have found the different possible bounding boxes, and we need to perform classification on these bounding boxes. But these bounding boxes have different sizes while the classification model can only take a fixed size input. Hence we will have to convert all the bounding boxes into one single predefined size so that we can input it to the classification model. There are many ways to do this :

  • Reshaping : Let’s say our predefined size is (4*4) while one of our bounding box size is (16*16) then you just reshape 16*16 to 4*4 i.e. make the image pixels smaller.
  • Region of Interest (ROI) Pooling Layer :

### Step 3 : Predict Class — Classification Model

Now perform classification, for each of the reshaped bounding boxes using a neural network. Along with just classifying the bounding box we also do some correction to the already found bounding box in step 1. You can skip the correction step but researchers have found that with this small extra computation the accuracy of the overall model improves drastically!!

### Step 4 : Post Processing : Non Max Suppression (NMS)

Optimized Approach : Later researchers thought that since first we are looping over all the window and then for each window extracting features using CNN during classification step this is leading to recomputation of so many features and hence to optimize this what they did was to first pass the entire image through the CNN and extract features and then loop over these features in a sliding window fashion followed by resizing and classifying. This was called Fast Sliding Window Based CNN!!

Region Based CNN (RCNN)

### Step 1 : Predict Bounding Boxes (Anchor Boxes)

The above approach is computationally highly highly inefficient approach since in the first step you are performing classification on all the possible bounding boxes. Hence we would like to optimize the first step. Instead of looping over all possible bounding boxes (which will be in millions / billions) we will first find out potential bounding boxes using some heuristic and then perform the remaining pipeline only on these potential bounding boxes instead of all possible bounding boxes, thus saving significant time. These potential bounding boxes were called anchor boxes. One famous algorithm to find these potential bounding boxes is Selective Search Algorithm. This is not used anymore and hence I would not recommend going deep into this, but if still you want to then this is a good tutorial.

### Step 2 : Resizing

### Step 3 : Predict Class — Classification Model

### Step 4 : Post Processing : Non Max Suppression (NMS)

Better Approach : Now in the above approach we first found out all the potential bounding boxes using selective search and then ran the CNN model to convert those image sections into features to classify them, hence we were computing the same features multiple times and hence here is a scope of optimization. Hence later researchers developed a model called fast-RCNN (10x faster) in which they first run the CNN model to convert the entire image into feature and then extract only a part of this feature map based on the potential bounding boxes given by selective search method to run classification on!!

Region Proposal Network (RPN) based CNN

### Step 1 : Predict Bounding Boxes (Anchor Boxes)

Now using selective search approach we did improve the inference time of the model but researchers wanted to improve it further and hence they came up with this new method ‘Region Proposal Network (RPN)’ to find potential bounding boxes / anchor boxes. This is outdated and complicated, but if you still want to understand, I would recommend watching this video tutorial.

### Step 2 : Resizing

### Step 3 : Predict Class — Classification Model

### Step 4 : Post Processing : Non Max Suppression (NMS)

Better Approach : Now in the above approach we first found out all the potential bounding boxes using RPN and then ran the CNN model to convert those image sections into features to classify them, hence we were computing the same features multiple times and hence here is a scope of optimization. Hence later researchers developed a model called faster-RCNN (100x faster) in which they first run the CNN model to convert the entire image into feature and then extract only a part of this feature map based on the potential bounding boxes given by RPN method to run classification on.

--

--

Sarvesh Khetan
Sarvesh Khetan

Written by Sarvesh Khetan

A deep learning enthusiast and a Masters Student at University of Maryland, College Park.

No responses yet