Convolutional Neural Networks: An Introduction

Estimated time:

time

min

<h2>tl;dr</h2> Convolutional Neural Networks (CNN) are used for the majority of applications in <a href="https://appsilon.com/computer-vision/" target="_blank" rel="noopener noreferrer">computer vision</a>. You can find them almost everywhere. They are used for image and video classification and regression, object detection, image segmentation, and even playing Atari games. Understanding the convolution layer is critical in building successful vision models. In this walkthrough, we'll walk you through the idea of convolution and explain the concept of channels, padding, stride, and receptive field. <blockquote>Curious about machine learning image recognition and object detection? Read <a href="https://appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO Algorithm and YOLO Object Detection: An Introduction</a></blockquote> We can represent pictures as a matrix or set of matrices with pixel values. A color image (RGB) transformed into a tensor has three channels corresponding to Red, Blue, and Green channels with pixel values between 0 to 255. The size of a tensor is Channel × Width × Height. From the example below, we can see it's 3 x 128 x 128. <img class="wp-image-5504 size-medium" src="https://wordpress.appsilon.com/wp-content/uploads/2020/10/01_imgpil-600x209.png" alt="" width="600" height="209" /> Source: <a href="https://discuss.pytorch.org/t/pytorch-pil-to-tensor-and-vice-versa/6312/8">PyTorch discussion forum</a> Convolution is an operation of applying a kernel (a small matrix e.g., 3x3 with weights) over an image grid and computing the dot product. <img class="size-full wp-image-5548" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021dfdf131218045902a2_1no_padding_no_strides.gif" alt="" width="244" height="259" /> Source: Vincent Dumoulin, Francesco Visin - <a href="https://arxiv.org/abs/1603.07285">A guide to convolution arithmetic for deep learning</a> The animation shows convolving a 3×3 kernel over a 4×4 input resulting in 2x2 output. We can generalize this into: <ul><li>An input of size W (assuming height and width are the same and equal W)</li><li>Kernel of size K</li><li>Output of size (W - K) + 1.</li></ul> <h3>Input, output channels, and kernels</h3> The number of input channels in the first layer should equal to the number of channels in the input data. The user can define the number of output channels, and it's a hyperparameter to set. The output channels from one layer become the input channels for the next layer. We can convert an input, a 3-dimensional tensor, of size n_in × input_hight × input_width into output tensor of size n_out × output_hight × output_width by applying a set of n_out 3-dimensional kernels of size n_in × kernel_hight × kernel_width. After each filter application, we receive an output of size 1 × output_hight × output_width. We can stack these n_out tensors together to get the output of final size n_out × output_hight × output_width. It's a lot to process, so re-read the last couple of paragraphs multiple times if needed. Also, feel free to refer to the image below for further clarification. <img class="aligncenter size-large wp-image-5506" src="https://wordpress.appsilon.com/wp-content/uploads/2020/10/02_kernels-1024x443.png" alt="" width="1024" height="443" /> The kernel values, also called filters, are parameters and are learned by the neural network. We can represent a kernel as a weight matrix with a couple of parameter types: <ul><li>Value of 0 - untrainable</li><li>Tied - having the same value, but are trainable</li></ul> If you want a more in-depth overview of the mathematics behind this, refer to this <a class="editor-rtfLink" href="https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c" target="_blank" rel="noopener noreferrer">explanation</a> by Matthew Kleinsmith. Using convolutions instead of fully connected layers has two benefits: the network trains faster and is less prone to overfitting. <h3>Padding</h3> There's a problem of shrinking dimensions - which means every layer of a neural net would have a smaller feature space. Also, the network loses information about the image corners and edges. We don't want that. The solution to this issue is to introduce padding, which adds a frame of pixels around the image (usually 0 valued pixels). <img class="wp-image-5549 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021e050b6cc5df6a37f14_2Convolution_arithmetic_-_Same_padding_no_strides_transposed.gif" alt="" width="395" height="449" /> Source: Vincent Dumoulin, Francesco Visin - <a href="https://arxiv.org/abs/1603.07285">A guide to convolution arithmetic for deep learning</a> We add padding of size P, which results in the output size of (W-K) + 1 + 2P. It is common to set padding to (K - 1) / 2, so both input and output are of the same dimensions. In practice, we almost always use an odd size kernel. <h3>Receptive field</h3> Now let's see what happens if we stack three convolutions layers on top of each other. We apply a kernel 3x3 to the (input) image of size 7x7. An orange square in the input matrix's top left corner is a receptive field for cell (2, 2) in the first layer. It is defined as an area of an image that is involved in the calculation of a layer. <img class="wp-image-5508 size-large" src="https://wordpress.appsilon.com/wp-content/uploads/2020/10/03_receptive_field-1024x297.png" alt="" width="1024" height="297" /> Source: <a href="https://www.youtube.com/playlist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r" target="_blank" rel="noopener noreferrer">Deep Learning for Computer Vision</a> Lecture 7: Convolutional Networks University of Michigan We start with a receptive field of size 3x3, and with each convolution, the receptive field is increased by K - 1 (K = kernel size). So in the final layer, we end up with a receptive field of size 7x7 (going from 3x3 to 5x5 to 7x7). This means that larger and larger areas of an initial input image are used to calculate the features by going deeper into the network. <h3>Stride</h3> Sliding 1 pixel when moving the kernel means that we would need many layers to build big enough receptive fields to build complex features. One way to approach this problem is to introduce a stride. Adding a stride to a layer means skipping pixels when applying the kernel. We could move over two pixels after each kernel application. This is called stride-2 convolution. <img class="size-full wp-image-5550" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021e2b4267bf870168f49_3Convolution_arithmetic_-_Padding_strides.gif" alt="" width="395" height="381" /> Source: Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning After stride-2 convolution and padding, the output size can be calculated as (W - K + 2P) / S + 1, where S is a stride size. In the example above, we start with 5x5 input, apply 3x3 kernel, add 1 padding, and stride 2, so we end up with the output of size 3x3. As a result, we decreased the size of activations. We already know that more sophisticated features are calculated by going deep in the network, so we don't want to reduce the number of calculations. When we downsample the activations by adding a stride to the layer, we need to increase the number of output channels (depth of the output) to retain the calculation complexity. For example, stride-2 is halving the output size, so we need to double the output channels. The deeper in the network, the more output channels we will have. <h2>Conclusion</h2> Neural networks are tough to understand at first, with convolutions being one of the most challenging topics in the field. Still, image data is everywhere, and knowing how to work with images can give a competitive advantage to both yourself and your company. Thankfully, we have many resources to learn these complex topics, and the subject gets easier after a couple of repetitions. This article covered the essentials needed to move forward with the practical part. Modern libraries like TensorFlow and PyTorch won't require you to code out things like padding and stride manually, but knowing what's going on under the hood will make the debugging process that much easier.  <h2>Learn More</h2><ul><li><a href="https://appsilon.com/pp-yolo-object-detection/" target="_blank" rel="noopener noreferrer">PP-YOLO Object Detection Algorithm: Why It's Faster Than YOLOv4</a></li><li><a href="https://appsilon.com/weight-poisoning-computer-vision/" target="_blank" rel="noopener noreferrer">Are Computer Vision Models Vulnerable to Weight-Poisoning Attacks?</a></li><li><a href="https://appsilon.com/want-to-build-an-ai-model-for-your-business-read-this/" target="_blank" rel="noopener noreferrer">Want to Build an AI Model for Your Business? Read This.</a></li></ul>