Convolutional Neural Network for Object
Recognition and Detection
By : Antonius Robotsoft – www.robotsoft.co.id
Basically computer vision has 4 main tasks :
1. Object Recognition/Classification
Classify the object in the image.
2.Object Detection
Are there any object that we want to detect in the image? If yes, draw the bounding box around the
image
3. Object Localization
Are there any object that we want to detect in the image? If yes, draw the bounding box around the
image and show the coordinate of the bounding box.
The (x1, y1) would be the top left corner and the (x2, y2) the bottom right.
And finally … the latest one :
4. Object Segmentation
By accommodating mask rcnn, we can get the exact pixel position for each object. This kind of
development is very important for robotic vision.
Suppose you have a small robot
And we need to instruct the robot to get passes through this woman between her tiny legs. By using
mask rcnn, the robot knows the exact position of her legs.
This kind of trick can not be accomplished by object localization which uses bounding box since we
need to know exact position of her leg.
Currently, The most suitable type of neural network to perform those 4 tasks is “convolutional
neural network”.
Previously on my post, I wrote about “Cardboard Box Detection using Retinanet (Keras)”, it’s about
train a custom model on keras retinanet for cardboard localization in the image. RetinaNet is a
convolutional neural network architecture.
Convolutional neural network is commonly used in computer vision for object detections, object
localizations, object recognitions, analyzing depth of image regions, etc…
This post will cover about convolutional neural network in general, including some math of
convnet, convnet architecture and then continue with RetinaNet architecture.
Convolutional Neural Network
“A convolutional neural network is a class of deep neural networks, most commonly applied to
analyzing visual imagery. CNN is an improved version of multilayer perceptron”. It’s a class of
deep neural network inspired by human’s visual cortex.
Basically CNN works by collecting matrix of features then predicting whether this image contains a
class or another class based on these features using softmax probabilities.
Convolutional Neural Network Architecture
Commonly, a convolutional neural network architecture consists of these layers :
1.Convolution Layer
The core idea between convolutional operation is for feature extractions or we can say filtering.
Later, the network will be trying every possible matching features from the input image compared
to the class’s image (class is an object name that we want to recognize, e.g : a car).
In order to get enlightenment of how convolutional layer operate, have a look at above image.
Based from the above picture we have an input image of 6×6 px, and we have a 3×3 px 2d
convolution kernel. The kernel will do 1 stride from top left pixel of the image until the bottom of
the image. The kernel is a 3×3 matrix of weight (each component of the matrix is a weight). This
convolutional operation is used to extract features from image. The most frequently used kernel for
convolutional is 2d convolution kernel.
Suppose we have a 6×6 pixel image with RGB color channel.
Suppose we are going to do a convolutional operation using 3×3 matrix as kernel and stride = 1 (the
number of strides defines how many pixels the kernel will step).
Here’s the RGB channel in 6×6 matrix extracted from numpy array :
#!/usr/bin/env python3
from PIL import Image
import numpy as np
im = Image.open(“6px.png”)
imgarr = np.array(im)
print(“R channel”)
print(imgarr[:,:,0])
print(“_” * 30)
print(“G channel”)
print(imgarr[:,:,1])
print(“_” * 30)
print(“B channel”)
print(imgarr[:,:,2])
print(“_” * 30)
For this example, we are going to do a 1 stride convolutional operation on red channel using this
3×3 matrix of weight (sobel) :
As an example of convolutional operation, we are going to use “The Red Channel” matrix :
Here’s the mathematical operation using convolutional operation :
#!/usr/bin/env python3
res = (46 * -1) + (67 * 0) + (161 * 1)+ (48 * -2) + (41 * 0) + (114 * 2) + (101 * -1 ) + (165 * 0 ) +
(216 * 1)
print(res)
The next 1 pixel stride :
and so on, the stride will continue until the last pixel.
The result is called convolved feature map matrix.
Since the matrix is only 4×4 pixel, There will be 2 pixel padding for bottom, right, top and left.
The Linearity
Algebraically, a convolutional operation is a linear combination. We need to introduce non linearity
hence an activation function is needed. Right after the convolutional operation, in order to introduce
non linearity we the “ReLU” activation function is used. If we keep it linear, we do not need to use
deep learning since it’s just a simple linear functions.
Mathematically, ReLU can be defined as
After ReLU, all negative pixel value from the previous convolved feature map matrix with
negative pixel value will be replaced by 0.
Why Non Linearity is Needed ?
In Math and statistic, a non linearity is commonly used to solve complex problem, meanwhile a
linear equation is simple, if we define the input of a linear equation, the output can be found by
simple algebra. Before we use activation function such as ReLU, basically the convolutional
operation is only a linear function.
Consider an example of a simple this linear equation :
Y = a.x
No matter how many layers, the final activation function will always yield the exact same predicted
output. In this condition we do not need to use deep learning with many layers, a simple one layer
neural network is enough.
By using ReLU activation function right after a convolutional operation, we can introduce the non
linearity hence the system can learn how to solve more complex problem.
Real-life image recognition is a complex problem which can’t be solved literally by computer.
For example we have trained our single layer neural network using dataset of cars and dataset of bat
logos:
class 1 is honda civic
class 2 is bat logo
Then if we give an input image with something like this (the same image resolution with dataset)
The computer will be able to answer the correct prediction since it’s just answering a literally just
the same image with the same pixels arrangement.
Unfortunately when we give this input image :
The computer will not be able to vote correctly whether this one is a bat logo or a honda civic class.
In order to solve this kind of complex problem (since the object in image might be rotated slightly
or having a different pose or a bit different form) the ideal neural network to solve this one need a
non linearity.
By having a different pose or a slightly different form, this means that the prediction can not by
simply solved by a linear regression, since
Y is no longer a.X hence we need to solve this using a non linear equation.
In convolutional neural network, we would update the weights and biases of the neurons on the
basis of the error at the output. This process is known as back-propagation. Activation functions
will introduce non linearity to the system thus making the back-propagation possible since the
gradients are supplied along with the error / loss to update the weights and biases.
2. Pooling Layer
Right after the ReLU, the next layer is a pooling layer. The pooling layer basically is used to reduce
the spatial size of the input hence reducing the number of parameters and computational
complexity.
Commonly used pooling method is max pooling.
3. Fully Connected Layer
The Fully Connected layer holds composite and aggregate information from previous layers. Before
given as input of fully connected layers, those previous multi dimensional inputs will be flatten into
a single dimensional inputs.
And finally, the prediction (voting) will be accomplished using the activation function, e.g :
softmax.
Some Examples of CNN Architectures
Lenet
The input image of lenet 5 is 32×32 px image. Here’s the summary of lenet 5 architecture :
Other than using tanh activation function, we can use ReLU as activation function.
Here’s example of lenet implementation in keras :
import keras
from keras.models import Sequential
from keras import models, layers
model = keras.Sequential()
model.add(layers.Conv2D(filters=6, kernel_size=(3, 3), activation='tanh',
input_shape=(32,32,1)))
model.add(layers.AveragePooling2D())
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='tanh'))
model.add(layers.AveragePooling2D())
model.add(layers.Flatten())
model.add(layers.Dense(units=120, activation='tanh'))
model.add(layers.Dense(units=84, activation='tanh'))
model.add(layers.Dense(units=10, activation = 'softmax'))
model.summary()
VGG16
The input image of vgg16 is 224×224 px. Here’s the summary of vgg16 architecture :
Example of implementation of vgg16 in keras :
#!/usr/bin/env python3
import keras
from keras.models
from keras.layers
from keras.layers
from keras.layers
from keras import
import Sequential
import Dense, Activation, Dropout, Flatten
import Conv2D
import MaxPooling2D
models, layers
model = keras.Sequential()
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu',
input_shape=(224,224,3)))
model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(layers.Conv2D(128, (3, 3),activation='relu',padding='same'))
model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(256, (3, 3), activation='relu',padding='same'))
model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
model.add(layers.Conv2D(512, (3, 3),activation='relu',padding='same'))
model.add(layers.Conv2D(512, (3, 3),activation='relu', padding='same'))
model.add(layers.Conv2D(512, (3, 3),activation='relu',padding='same'))
model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dense(4096, activation='relu'))
model.add(layers.Dense(1000, activation='softmax'))
model.summary()
Resnet
The main purpose of resnet architecture is to make a convolutional neural network with many layers
to train effectively.
The problem of a deep convolutional neural network is that when we increase the network depth,
there’s a vanishing gradient problem. As the network goes deeper, its performance gets saturated or
even starts degrading in accuracy.
Resnet splits a deeper network into three layer chunks and passing the input into each chunk straight
through to the next chunk, along with the residual output of the chunk minus the input to the chunk
that is reintroduced.
An implementation of resnet from keras :
https://github.com/raghakot/keras-resnet/blob/master/resnet.py
RetinaNet
The problem with a single shot detection model such as yolo is : “there is extreme foregroundbackground class imbalance problem in one-stage detector.”
RetinaNet introduce “The Focal Loss” to cover for extreme foreground-background class imbalance
problem in one-stage detector.
Retinanet is a single shot detection model just like Yolo. On RetinaNet, a commonly used backbone
is resnet50, we add a FPN (Feature Pyramid Network) for feature extraction and later the network
will use Focal lost to handle extreme foreground-background class imbalance problem.
Example implementation of RetinaNet using keras can be cloned from
https://github.com/fizyr/keras-retinanet
Example of custom object detection using Retinanet :
Reference :
https://arxiv.org/abs/-