22 August 2018

Some time ago, I was exploring the exciting world of convolutional neural networks and wondered how can we use them for image classification. (If this sounds interesting check out this post too.) Beside simple image classification, there’s no shortage of fascinating problems in computer vision, with object detection being one of the most interesting. Most commonly it’s associated with self driving cars where systems blend computer vision, LIDAR and other technologies to generate a multidimensional representation of road with all its participants. On the other hand object detection is used in video surveillance, especially  in crowd monitoring to prevent terrorist attacks, count people for general statistics or analyze customer experience with walking paths within shopping centers.

Ok, so what exactly is object detection? To answer that question let’s start with image classification. In this task we’ve got an image and we want to assign it to one of many different categories (e.g. car, dog, cat, human,…), so basically we want to answer the question “What is in this picture?”. Note that one image has only one category assigned to it. After completing this task we do something more difficult and try to locate our object in the image, so our question changes to “What is it and where it is?”. This task is called object localization. So far so good, but in a real-life scenario, we won’t be interested in locating only one object but rather multiple objects in one image. For example let’s think of a self-driving car, that in the real-time video stream has to find the location of other cars, traffic lights, signs, humans and then having this information take appropriate action. It’s a great example of object detection. In object detection tasks we are interested in finding all object in the image and drawing so-called bounding boxes around them. There are also some situations where we want to find exact boundaries of our objects in the process called instance segmentation, but this is a topic for another post.

YOLO algorithm

There are a few different algorithms for object detection and they can be split into two groups:

  1.    Algorithms based on classification – they work in two stages. In the first step, we’re selecting from the image interesting regions. Then we’re classifying those regions using convolutional neural networks. This solution could be very slow because we have to run prediction for every selected region. Most known example of this type of algorithms is the Region-based convolutional neural network (RCNN) and their cousins Fast-RCNN and Faster-RCNN.
  2.    Algorithms based on regression – instead of selecting interesting parts of an image, we’re predicting classes and bounding boxes for the whole image in one run of the algorithm. Most known example of this type of algorithms is YOLO (You only look once) commonly used for real-time object detection.

Before we go into YOLOs details we have to know what we are going to predict. Our task is to predict a class of an object and the bounding box specifying object location. Each bounding box can be described using four descriptors:

  1. center of a bounding box (bx by)
  2. width (bw)
  3. height (bh)
  4. value c is corresponding to a class of an object (f.e. car, traffic lights,…).

We’ve got also one more predicted value pc which is a probability that there is an object in the bounding box, I will explain in a moment why do we need this.

Like I said before with YOLO algorithm we’re not searching for interested regions on our image that could contain some object. Instead of that we are splitting our image into cells, typically its 19×19 grid. Each cell will be responsible for predicting 5 bounding boxes (in case there’s more than one object in this cell). This will give us 1805 bounding boxes for an image and that’s a really big number!

Majority of those cells and boxes won’t have an object inside and this is the reason why we need to predict pc. In the next step, we’re removing boxes with low object probability and bounding boxes with the highest shared area in the process called non-max suppression.

nonmax suppression


There are a few different implementations of YOLO algorithm on the web, but today I want to briefly introduce you to an open source neural network framework called Darknet. Darknet was written in C language and CUDA technology, what makes it really fast and allows you to make computations on a GPU, which is essential for real-time predictions.
darknet computer vision

Installation is very simple, just run these 3 lines (in order to use GPU modify settings in Makefile script after cloning the repository). For more details go here

git clone https://github.com/pjreddie/darknet
cd darknet

After installation, we can use a pre-trained model or build a new one from scratch. For example here’s how you can detect objects on your image using model pre-trained on COCO dataset:

./darknet detect cfg/yolov3.cfg yolov3.weights data/my_image.jpg

<code class="language-bash" data-lang="bash">./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg</code>

If you want to see more, go to
Darknet website.