Convolutional Neural Networks: An Introduction
Convolutional Neural Networks (CNN) are used for the majority of applications in computer vision. You can find them almost everywhere. They are used for image and video classification and regression, object detection, image segmentation, and even playing Atari games.
Understanding the convolution layer is critical in building successful vision models. In this walkthrough, we’ll walk you through the idea of convolution and explain the concept of channels, padding, stride, and receptive field.
Curious about machine learning image recognition and object detection? Read YOLO Algorithm and YOLO Object Detection: An Introduction
We can represent pictures as a matrix or set of matrices with pixel values. A color image (RGB) transformed into a tensor has three channels corresponding to Red, Blue, and Green channels with pixel values between 0 to 255. The size of a tensor is Channel × Width × Height. From the example below, we can see it’s 3 x 128 x 128.
Convolution is an operation of applying a kernel (a small matrix e.g., 3×3 with weights) over an image grid and computing the dot product.
The animation shows convolving a 3×3 kernel over a 4×4 input resulting in 2×2 output.
We can generalize this into:
- An input of size W (assuming height and width are the same and equal W)
- Kernel of size K
- Output of size (W – K) + 1.
Input, output channels, and kernels
The number of input channels in the first layer should equal to the number of channels in the input data. The user can define the number of output channels, and it’s a hyperparameter to set. The output channels from one layer become the input channels for the next layer.
We can convert an input, a 3-dimensional tensor, of size n_in × input_hight × input_width into output tensor of size n_out × output_hight × output_width by applying a set of n_out 3-dimensional kernels of size n_in × kernel_hight × kernel_width.
After each filter application, we receive an output of size 1 × output_hight × output_width. We can stack these n_out tensors together to get the output of final size n_out × output_hight × output_width.
It’s a lot to process, so re-read the last couple of paragraphs multiple times if needed. Also, feel free to refer to the image below for further clarification.
The kernel values, also called filters, are parameters and are learned by the neural network. We can represent a kernel as a weight matrix with a couple of parameter types:
- Value of 0 – untrainable
- Tied – having the same value, but are trainable
If you want a more in-depth overview of the mathematics behind this, refer to this explanation by Matthew Kleinsmith. Using convolutions instead of fully connected layers has two benefits: the network trains faster and is less prone to overfitting.
There’s a problem of shrinking dimensions – which means every layer of a neural net would have a smaller feature space. Also, the network loses information about the image corners and edges. We don’t want that. The solution to this issue is to introduce padding, which adds a frame of pixels around the image (usually 0 valued pixels).
We add padding of size P, which results in the output size of (W-K) + 1 + 2P. It is common to set padding to (K – 1) / 2, so both input and output are of the same dimensions. In practice, we almost always use an odd size kernel.
Now let’s see what happens if we stack three convolutions layers on top of each other. We apply a kernel 3×3 to the (input) image of size 7×7. An orange square in the input matrix’s top left corner is a receptive field for cell (2, 2) in the first layer. It is defined as an area of an image that is involved in the calculation of a layer.
We start with a receptive field of size 3×3, and with each convolution, the receptive field is increased by K – 1 (K = kernel size). So in the final layer, we end up with a receptive field of size 7×7 (going from 3×3 to 5×5 to 7×7). This means that larger and larger areas of an initial input image are used to calculate the features by going deeper into the network.
Sliding 1 pixel when moving the kernel means that we would need many layers to build big enough receptive fields to build complex features. One way to approach this problem is to introduce a stride. Adding a stride to a layer means skipping pixels when applying the kernel. We could move over two pixels after each kernel application. This is called stride-2 convolution.
After stride-2 convolution and padding, the output size can be calculated as (W – K + 2P) / S + 1, where S is a stride size. In the example above, we start with 5×5 input, apply 3×3 kernel, add 1 padding, and stride 2, so we end up with the output of size 3×3.
As a result, we decreased the size of activations. We already know that more sophisticated features are calculated by going deep in the network, so we don’t want to reduce the number of calculations. When we downsample the activations by adding a stride to the layer, we need to increase the number of output channels (depth of the output) to retain the calculation complexity. For example, stride-2 is halving the output size, so we need to double the output channels. The deeper in the network, the more output channels we will have.
Neural networks are tough to understand at first, with convolutions being one of the most challenging topics in the field. Still, image data is everywhere, and knowing how to work with images can give a competitive advantage to both yourself and your company.
Thankfully, we have many resources to learn these complex topics, and the subject gets easier after a couple of repetitions. This article covered the essentials needed to move forward with the practical part. Modern libraries like TensorFlow and PyTorch won’t require you to code out things like padding and stride manually, but knowing what’s going on under the hood will make the debugging process that much easier.