Convolutional Neural Networks: An Introduction
<h2>tl;dr</h2> <span data-preserver-spaces="true"><strong>Convolutional Neural Networks</strong> (CNN) are used for the majority of applications in <a href="https://appsilon.com/computer-vision/" target="_blank" rel="noopener noreferrer">computer vision</a>. You can find them almost everywhere. They are used for image and video classification and regression, object detection, image segmentation, and even playing Atari games.</span> <span data-preserver-spaces="true">Understanding the convolution layer is critical in building successful vision models. In this walkthrough, we'll walk you through the idea of <strong>convolution</strong> and explain the concept of <strong>channels</strong>, <strong>padding</strong>, <strong>stride</strong>, and <strong>receptive field</strong>. </span> <blockquote>Curious about machine learning image recognition and object detection? Read <a href="https://appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO Algorithm and YOLO Object Detection: An Introduction</a></blockquote> <span data-preserver-spaces="true">We can represent pictures as a matrix or set of matrices with pixel values. A color image (RGB) transformed into a <strong>tensor</strong> has three channels corresponding to </span><em><span data-preserver-spaces="true">Red</span></em><span data-preserver-spaces="true">, </span><em><span data-preserver-spaces="true">Blue</span></em><span data-preserver-spaces="true">, and </span><em><span data-preserver-spaces="true">Green</span></em><span data-preserver-spaces="true"> channels with pixel values between 0 to 255. The size of a tensor is </span><em><span data-preserver-spaces="true">Channel</span></em><strong><span data-preserver-spaces="true"> </span></strong><span data-preserver-spaces="true">× </span><em><span data-preserver-spaces="true">Width</span></em><strong><span data-preserver-spaces="true"> </span></strong><span data-preserver-spaces="true">×</span><strong><span data-preserver-spaces="true"> </span></strong><em><span data-preserver-spaces="true">Height</span></em><span data-preserver-spaces="true">. From the example below, we can see it's 3 x 128 x 128.</span> <img class="wp-image-5504 size-medium" src="https://wordpress.appsilon.com/wp-content/uploads/2020/10/01_imgpil-600x209.png" alt="" width="600" height="209" /> Source: <a href="https://discuss.pytorch.org/t/pytorch-pil-to-tensor-and-vice-versa/6312/8">PyTorch discussion forum</a> Convolution is an operation of applying a <strong>kernel</strong> (a small matrix e.g., 3x3 with weights) over an image grid and computing the dot product. <img class="size-full wp-image-5548" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021dfdf131218045902a2_1no_padding_no_strides.gif" alt="" width="244" height="259" /> Source: Vincent Dumoulin, Francesco Visin - <a href="https://arxiv.org/abs/1603.07285">A guide to convolution arithmetic for deep learning</a> <span data-preserver-spaces="true">The animation shows convolving a 3×3 kernel over a 4×4 input resulting in 2x2 output. </span> <span data-preserver-spaces="true">We can generalize this into: </span> <ul><li><span data-preserver-spaces="true">An input of size </span><em><span data-preserver-spaces="true">W</span></em><span data-preserver-spaces="true"> (assuming height and width are the same and equal W)</span></li><li><span data-preserver-spaces="true">Kernel of size </span><em><span data-preserver-spaces="true">K</span></em></li><li><span data-preserver-spaces="true">Output of size </span><em><span data-preserver-spaces="true">(W - K) + 1</span></em><span data-preserver-spaces="true">.</span></li></ul> <h3>Input, output channels, and kernels</h3> <span data-preserver-spaces="true">The number of input channels in the first layer should equal to the number of channels in the input data. The user can define the number of output channels, and it's a hyperparameter to set. The output channels from one layer become the input channels for the next layer.</span> <span data-preserver-spaces="true">We can convert an input, a 3-dimensional tensor, of size </span><em><span data-preserver-spaces="true">n_in</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">input_hight</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">input_width</span></em><span data-preserver-spaces="true"> into output tensor of size </span><em><span data-preserver-spaces="true">n_out </span></em><span data-preserver-spaces="true">× </span><em><span data-preserver-spaces="true">output_hight</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">output_width</span></em><span data-preserver-spaces="true"> by applying a set of </span><em><span data-preserver-spaces="true">n_out</span></em><span data-preserver-spaces="true"> 3-dimensional kernels of size </span><em><span data-preserver-spaces="true">n_in</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">kernel_hight</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">kernel_width</span></em><span data-preserver-spaces="true">.</span> <span data-preserver-spaces="true">After each filter application, we receive an output of size </span><em><span data-preserver-spaces="true">1</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">output_hight</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">output_width</span></em><span data-preserver-spaces="true">. We can stack these </span><em><span data-preserver-spaces="true">n_out</span></em><span data-preserver-spaces="true"> tensors together to get the output of final size </span><em><span data-preserver-spaces="true">n_out</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">output_hight</span></em><span data-preserver-spaces="true"> × </span><em><span data-preserver-spaces="true">output_width</span></em><span data-preserver-spaces="true">.</span> <span data-preserver-spaces="true">It's a lot to process, so re-read the last couple of paragraphs multiple times if needed. Also, feel free to refer to the image below for further clarification.</span> <img class="aligncenter size-large wp-image-5506" src="https://wordpress.appsilon.com/wp-content/uploads/2020/10/02_kernels-1024x443.png" alt="" width="1024" height="443" /> <span data-preserver-spaces="true">The kernel values, also called <strong>filters</strong>, are parameters and are learned by the neural network. We can represent a kernel as a weight matrix with a couple of parameter types:</span> <ul><li><span data-preserver-spaces="true">Value of 0 - untrainable</span></li><li><span data-preserver-spaces="true">Tied - having the same value, but are trainable</span></li></ul> <span data-preserver-spaces="true">If you want a more in-depth overview of the mathematics behind this, refer to this </span><a class="editor-rtfLink" href="https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">explanation</span></a><span data-preserver-spaces="true"> by Matthew Kleinsmith. Using convolutions instead of fully connected layers has two benefits: the network trains faster and is less prone to overfitting. </span> <h3>Padding</h3> <span data-preserver-spaces="true">There's a problem of shrinking dimensions - which means every layer of a neural net would have a smaller feature space. Also, the network loses information about the image corners and edges. We don't want that. The solution to this issue is to introduce padding, which adds a frame of pixels around the image (usually 0 valued pixels). </span> <img class="wp-image-5549 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021e050b6cc5df6a37f14_2Convolution_arithmetic_-_Same_padding_no_strides_transposed.gif" alt="" width="395" height="449" /> Source: Vincent Dumoulin, Francesco Visin - <a href="https://arxiv.org/abs/1603.07285">A guide to convolution arithmetic for deep learning</a> <span data-preserver-spaces="true">We add padding of size </span><em><span data-preserver-spaces="true">P,</span></em><span data-preserver-spaces="true"> which results in the output size of </span><em><span data-preserver-spaces="true">(W-K) + 1 + 2P</span></em><span data-preserver-spaces="true">. It is common to set padding to </span><em><span data-preserver-spaces="true">(K - 1) / 2</span></em><span data-preserver-spaces="true">, so both input and output are of the same dimensions. In practice, we almost always use an odd size kernel.</span> <h3>Receptive field</h3> <span data-preserver-spaces="true">Now let's see what happens if we stack three convolutions layers on top of each other. We apply a kernel 3x3 to the (input) image of size 7x7. An orange square in the input matrix's top left corner is a receptive field for cell (2, 2) in the first layer. It is defined as an area of an image that is involved in the calculation of a layer.</span> <img class="wp-image-5508 size-large" src="https://wordpress.appsilon.com/wp-content/uploads/2020/10/03_receptive_field-1024x297.png" alt="" width="1024" height="297" /> <em>Source: <a href="https://www.youtube.com/playlist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r" target="_blank" rel="noopener noreferrer">Deep Learning for Computer Vision</a> Lecture 7: Convolutional Networks University of Michigan</em> <span data-preserver-spaces="true">We start with a receptive field of size 3x3, and with each convolution, the receptive field is increased by </span><em><span data-preserver-spaces="true">K - 1</span></em><span data-preserver-spaces="true"> (</span><em><span data-preserver-spaces="true">K</span></em><span data-preserver-spaces="true"> = kernel size). So in the final layer, we end up with a receptive field of size 7x7 (going from 3x3 to 5x5 to 7x7). This means that larger and larger areas of an initial input image are used to calculate the features by going deeper into the network.</span> <h3>Stride</h3> <span data-preserver-spaces="true">Sliding 1 pixel when moving the kernel means that we would need many layers to build big enough receptive fields to build complex features. One way to approach this problem is to introduce a stride. Adding a stride to a layer means skipping pixels when applying the kernel. We could move over two pixels after each kernel application. This is called stride-2 convolution.</span> <img class="size-full wp-image-5550" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b021e2b4267bf870168f49_3Convolution_arithmetic_-_Padding_strides.gif" alt="" width="395" height="381" /> Source: Vincent Dumoulin, Francesco Visin - A guide to convolution arithmetic for deep learning <span data-preserver-spaces="true">After stride-2 convolution and padding, the output size can be calculated as </span><em><span data-preserver-spaces="true">(W - K + 2P) / S + 1</span></em><span data-preserver-spaces="true">, where </span><em><span data-preserver-spaces="true">S</span></em><span data-preserver-spaces="true"> is a stride size. In the example above, we start with 5x5 input, apply 3x3 kernel, add 1 padding, and stride 2, so we end up with the output of size 3x3. </span> <span data-preserver-spaces="true">As a result, we decreased the size of activations. We already know that more sophisticated features are calculated by going deep in the network, so we don't want to reduce the number of calculations. When we downsample the activations by adding a stride to the layer, we need to increase the number of output channels (depth of the output) to retain the calculation complexity. For example, stride-2 is halving the output size, so we need to double the output channels. The deeper in the network, the more output channels we will have.</span> <h2>Conclusion</h2> <span data-preserver-spaces="true">Neural networks are tough to understand at first, with convolutions being one of the most challenging topics in the field. Still, image data is everywhere, and knowing how to work with images can give a competitive advantage to both yourself and your company. </span> <span data-preserver-spaces="true">Thankfully, we have many resources to learn these complex topics, and the subject gets easier after a couple of repetitions. This article covered the essentials needed to move forward with the practical part. Modern libraries like </span><em><span data-preserver-spaces="true">TensorFlow</span></em><span data-preserver-spaces="true"> and </span><em><span data-preserver-spaces="true">PyTorch</span></em><span data-preserver-spaces="true"> won't require you to code out things like padding and stride manually, but knowing what's going on under the hood will make the debugging process that much easier.</span> <!--more--> <h2>Learn More</h2><ul><li><a href="https://appsilon.com/pp-yolo-object-detection/" target="_blank" rel="noopener noreferrer">PP-YOLO Object Detection Algorithm: Why It's Faster Than YOLOv4</a></li><li><a href="https://appsilon.com/weight-poisoning-computer-vision/" target="_blank" rel="noopener noreferrer">Are Computer Vision Models Vulnerable to Weight-Poisoning Attacks?</a></li><li><a href="https://appsilon.com/want-to-build-an-ai-model-for-your-business-read-this/" target="_blank" rel="noopener noreferrer">Want to Build an AI Model for Your Business? Read This.</a></li></ul>