Currently set to Index
Currently set to Follow

This site uses cookies. Read more.

 22 August, 2018

YOLO (“You Only Look Once”) is an effective real-time object recognition algorithm, first described in the seminal 2015 paper by Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection“. In this article we introduce the concept of object detection, the algorithm itself and one of its open source implementations.

Image classification is one of the many exciting applications of convolutional neural networks. Aside from simple image classification, there are plenty of fascinating problems in computer vision, with object detection being one of the most interesting. It is commonly associated with self-driving cars where systems blend computer vision, LIDAR and other technologies to generate a multidimensional representation of the road with all its participants. Object detection is also commonly used in video surveillance, especially in crowd monitoring to prevent terrorist attacks, count people for general statistics or analyze customer experience with walking paths within shopping centers.

Object detection

To explore the concept of object detection it is useful to begin with image classification. It goes through levels of incremental complexity. 

Image classification (1) aims at assigning an image to one of a number of different categories (e.g. car, dog, cat, human, etc.), essentially answering the question “What is in this picture?”. One image has only one category assigned to it. 

Object localization (2) then allows us to locate our object in the image, so our question changes to “What is it and where it is?”

In a real real-life scenario, we need to go beyond locating just one object but rather multiple objects in one image. For example, a self-driving car has to find the location of other cars, traffic lights, signs, humans and to take appropriate action based on this information.

Object detection (3) provides the tools for doing just that –  finding all the objects in an image and drawing the so-called bounding boxes around them. There are also some situations where we want to find exact boundaries of our objects in the process called instance segmentation, but this is a topic for another post.


YOLO algorithm

There are a few different algorithms for object detection and they can be split into two groups:

  1. Algorithms based on classification. They are implemented in two stages. First, they select regions of interest in an image. Second, they classify these regions using convolutional neural networks. This solution can be slow because we have to run predictions for every selected region. A widely known example of this type of algorithm is the Region-based convolutional neural network (RCNN) and its cousins Fast-RCNN and Faster-RCNN.
  2. Algorithms based on regression – instead of selecting interesting parts of an image, they predict classes and bounding boxes for the whole image in one run of the algorithm. Most known example of this type of algorithm is YOLO (“You Only Look Once”) and it is commonly used for real-time object detection.

To understand the YOLO algorithm, it is necessary to establish what is actually being predicted. Ultimately, we aim to predict a class of an object and the bounding box specifying object location. Each bounding box can be described using four descriptors:

  1. center of a bounding box (bxby)
  2. width (bw)
  3. height (bh)
  4. value cis corresponding to a class of an object (such as: car, traffic lights, etc.).

In addition, we have to predict the pc value, which is the probability that there is an object in the bounding box.

As we mentioned above, when working with the YOLO algorithm we are not searching for interesting regions in our image that could potentially contain an object. 

Instead, we are splitting our image into cells, typically using a 19×19 grid. Each cell is responsible for predicting 5 bounding boxes (in case there is more than one object in this cell). Therefore, we arrive at a large number of 1805 bounding boxes for one image.

Most of these cells and bounding boxes will not contain an object. Therefore, we predict the value pc, which serves to remove boxes with low object probability and bounding boxes with the highest shared area in a process called non-max suppression.

nonmax suppression

Darknet – a YOLO implementation

There are a few different implementations of the YOLO algorithm on the web. Darknet is one such open source neural network framework. Darknet was written in the C Language and CUDAtechnology, which makes it really fast and provides for making computations on a GPU, which is essential for real-time predictions.
darknet computer vision

Installation is simple and requires running just 3 lines of code (in order to use GPU it is necessary to modify the settings in the Makefile script after cloning the repository). For more details go here.

git clone
cd darknet

After installation, we can use a pre-trained model or build a new one from scratch. For example here’s how you can detect objects on your image using model pre-trained on COCO dataset:

./darknet detect cfg/yolov3.cfg yolov3.weights data/my_image.jpg

<code class="language-bash" data-lang="bash">./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg</code>

If you want to see more, go to the Darknet website



Have a look at our applications of computer vision including object detection:

Follow Us for More