Convolution neural networks (CNN) have been commonly chosen for classification and localization tasks. Although these tasks require data collecting for training, data collecting for classification task is easier and more available than the one for object detection task which requires an intensive task of annotating every object in a given image. For this reason, many works move towards exploiting CNN classification models for object detection task without the guidance of bounding box annotations of objects, which is known as weakly-supervised object localization.
One of the proposed approaches was demonstrated by Researchers from MIT to provide a simple technique for exploiting an existent classification model to perform object localization and also identifying parts of the image which have the largest contribution in discrimination task using a Global Average Pooling (GAP) layer to generate a Class Activation Map (CAM).
In this post, we will explain this method along with Global Average Pooling and Class Activation Map. And then how to apply this technique to make a classification network performs both object classification and object localization in a single forward-pass.
Convolutional neural networks usually contain a successive series of convolution and pooling layers to generate feature maps which must be an efficient representative for the input image. Then these feature maps are flattened and fed to fully connected layers to classify input images. This design was common till the proposed design in Network in Network (NIN) model, which replaces the fully connected layers with Global Average Pooling layer. The avoidance of using fully connected layers is a result of their proneness to overfit as they have the largest number of trainable parameters in the network, resulting in a reduction in the generalization ability of the network. For example, in Resnet18, the last convolution layer produces feature map of size of 512x7x7 (512 feature map each with size 7x7), consider we want to feed them to a fully connected layer to classify the current 3 classes, so the total number of trainable parameters is (7x7x512x3) + 3 = 75267.
Instead of using fully conned layers, networks include a Global Average Pooling layer after the last convolution layer to create by taking the average of each generated feature map, as shown in figure 2. Using the previous example, the last feature maps 512x7x7 passed through global average pooling resulting a vector of 512x1x1, so the total number of trainable parameters is (1x1x512x3) + 3 = 1539.
Class activation map is a weighted features maps generated for each image to indicate the discriminative regions used by the CNN to identify the class this image belongs to. To describe the procedure to generate CAM, let's first for simplicity, suppose we have a network to classify a given image to 3 classes (Car, Bicycle, and Motorbike). It generates feature maps Fk(x,y) where k is the number of kernels after the last convolution. The procedure can be described with the aid of figure 2 as follows:
Till now, we do not have a bounding box for an existent object in a given image from CAM. To get it, we apply simple post-processing steps including, thresholding the CAM image and find contours (boundaries) of including objects as shown in figure 3.
We will use a pre-trained Resnt18 to be trained on a classification dataset which has a 3 classes Car, Bicycle, and Motorbike, from a kaggle competition (here). All you need is to download the dataset from (here) and run the training script train.py via this command.
All weights will be saved every 4 epoch, on a folder, its name follows this pattern (experiment_year-month-day_HH_MM), in experiments folder.
python train.py --dataset_train_path /path/to/train --classes 3
After finishing the training, you can test your classification model as an object detection one, by running demo.py script. All you need is to provide it by, a model weights path, an input image path, and the number of classes of the model such as the following command.
python demo.py --model_path path/to/model/weights --image_path /path/to/image --numclasses 3