Computer Vision — Object Detection Task
Non Generative CV Task
This is an advanced version of object localization task, in object localization you just find one object & added bounding boxes around it but here you find all the possible objects & put bounding boxes around it !! Some instance are given below :
There are many proposed models to solve object detection task. These can be broadly classified into two types :
- Two Stage Models :
All the 2-stage models discussed in object localization task can be used to solve object detection task too !! Exact same model developed for localization task just run for this task and it will work. But these are outdated and no one anymore uses these!! - Single Stage Models :
1-stage models are faster, easier to train and have the same accuracy compared to 2-stage models. Hence these are SOTA models.
Now after the model is built, we need to test how good the model is, hence we have many model evaluation techniques for object detection models. One of the most used technique is Minimum Average Precision (mAP) technique.
Single Stage Models
To build a single stage model we can apply the same approach as what we saw in single stage models for object localization task (recommended read before reading this cause we will use things that we learnt here) just that here we need to detect multiple objects and there we had only one object per image.
Now this brings with it a big issue i.e. since you don’t know beforehand how many objects you need to detect implies you don’t know how many neurons will be present in your output layer …. it will be different for different images during the test time!! This could have been done given that we know how many boxes we need to predict but in object detection we don’t know how many objects we will have to find in the test image.
To address this problem, researchers said that we will create model which will predict a set no of bounding boxes eg 100k bounding boxes for each image irrespective of how many actual objects are present in the image. This set number of bounding boxes is set so high that it will always be greater than the actual no. of bounding boxes present in the test image!! Below you can see the CNN and transformer based architecture for this type of prediction, it is just an extended version of what we saw in single stage model for object localization task!
This sounds like a good idea but this would create issue during training such a model. Let’s say the model was being trained on an image which only had 10 ground truth bounding boxes but the model will predict 100k bounding boxes. Now to compute loss to train your model you will have to decide that out of these 100k predicted bounding boxes which 10 to choose to compare it with the 10 ground truth bounding boxes!! Researchers solved this problem via two methods :
- Hungarian Matching Algorithm : the famous Detection Transformers (DETR) by Facebook used this method
- You Look Only Once (YOLO) Type Model formulation : There are multiple version of YOLO.
- YOLO — V1
- Single Shot Detection (SSD)
- YOLO — V2
- YOLO — V3
- ….
- YOLO — V9
YOLO Version V1
Now as you can see that you will predict 9 bounding boxes (one for each grid) but you have only 2 ground truth bounding boxes, this points to the same training time issue we discussed above i.e. how will you choose which 2 out of the 9 predicted bounding box should be compared with the 2 ground truth bounding boxes ??? Hence to solve this researchers devised the following trick
Hence now we exactly know out of the 9 predicted bounding boxes which 2 bounding boxes needs to be compared with the 2 ground truth bounding boxes. But this idea lead us into another trap as discussed below :
Hold on, this is not done yet, researchers pulled one last trick from under their selves. Read this carefully this is the most important trick they played !!
Now we have covered almost all the major components that went into YOLO — V1, below I will try to implement this from scratch. You can see the architecture diagram below that I will implement (I will divide the image into 7*7 grid cells and for each grid cell we will do 2 bounding box predictions)
Similarly you can have a transformer based architecture diagram (no one has yet tried this, you can try this and see if it works better than CNN based architecture or not)