This site uses cookies. Read more.

Last Updated: May 2020. This article was originally written by Michał Maj with further contributions from the Appsilon Data Science team.

What is YOLO Object Detection?

YOLO (“You Only Look Once”) is an effective real-time object recognition algorithm, first described in the seminal 2015 paper by Joseph Redmon et al. In this article we introduce the concept of object detection, the YOLO algorithm itself, and one of the algorithm’s open source implementations: Darknet. 

Image classification is one of the many exciting applications of convolutional neural networks. Aside from simple image classification, there are plenty of fascinating problems in computer vision, with object detection being one of the most interesting. It is commonly associated with self-driving cars where systems blend computer vision, LIDAR and other technologies to generate a multidimensional representation of the road with all its participants. Object detection is also commonly used in video surveillance, especially in crowd monitoring to prevent terrorist attacks, count people for general statistics or analyze customer experience with walking paths within shopping centers.

Object Detection Overview

To explore the concept of object detection it is useful to begin with image classification. It goes through levels of incremental complexity. 

Image classification (1) aims at assigning an image to one of a number of different categories (e.g. car, dog, cat, human, etc.), essentially answering the question “What is in this picture?”. One image has only one category assigned to it. 

Object localization (2) then allows us to locate our object in the image, so our question changes to “What is it and where it is?”

In a real real-life scenario, we need to go beyond locating just one object but rather multiple objects in one image. For example, a self-driving car has to find the location of other cars, traffic lights, signs, humans and to take appropriate action based on this information.

Object detection (3) provides the tools for doing just that –  finding all the objects in an image and drawing the so-called bounding boxes around them. There are also some situations where we want to find exact boundaries of our objects in the process called instance segmentation, but this is a topic for another post.


YOLO algorithm

There are a few different algorithms for object detection and they can be split into two groups:

  1. Algorithms based on classification. They are implemented in two stages. First, they select regions of interest in an image. Second, they classify these regions using convolutional neural networks. This solution can be slow because we have to run predictions for every selected region. A widely known example of this type of algorithm is the Region-based convolutional neural network (RCNN) and its cousins Fast-RCNN, Faster-RCNN and the latest addition to the family: Mask-RCNN. Another example is RetinaNet.
  2. Algorithms based on regression – instead of selecting interesting parts of an image, they predict classes and bounding boxes for the whole image in one run of the algorithm. The two best known examples from this group are the YOLO (You Only Look Once) family algorithms and SSD (Single Shot Multibox Detector). They are commonly used for  real-time object detection as, in general, they trade a bit of accuracy for large improvements in speed.

To understand the YOLO algorithm, it is necessary to establish what is actually being predicted. Ultimately, we aim to predict a class of an object and the bounding box specifying object location. Each bounding box can be described using four descriptors:

  1. center of a bounding box (bxby)
  2. width (bw)
  3. height (bh)
  4. value cis corresponding to a class of an object (such as: car, traffic lights, etc.).

In addition, we have to predict the pc value, which is the probability that there is an object in the bounding box.

As we mentioned above, when working with the YOLO algorithm we are not searching for interesting regions in our image that could potentially contain an object. 

Instead, we are splitting our image into cells, typically using a 19×19 grid. Each cell is responsible for predicting 5 bounding boxes (in case there is more than one object in this cell). Therefore, we arrive at a large number of 1805 bounding boxes for one image.

Most of these cells and bounding boxes will not contain an object. Therefore, we predict the value pc, which serves to remove boxes with low object probability and bounding boxes with the highest shared area in a process called non-max suppression.

nonmax suppression

Darknet – a YOLO implementation

There are a few different implementations of the YOLO algorithm on the web. Darknet is one such open source neural network framework  (a PyTorch implementation can be found here or with some extra fast.ai functionality here; a Keras implementation can be found here). Darknet was written in the C Language and CUDAtechnology, which makes it really fast and provides for making computations on a GPU, which is essential for real-time predictions.
darknet computer vision

Installation is simple and requires running just 3 lines of code (in order to use GPU it is necessary to modify the settings in the Makefile script after cloning the repository). For more details go here.

git clone https://github.com/pjreddie/darknet
cd darknet

After installation, we can use a pre-trained model or build a new one from scratch. For example here’s how you can detect objects on your image using model pre-trained on COCO dataset:

./darknet detect cfg/yolov3.cfg yolov3.weights data/my_image.jpg

The algorithm deals well even with object representations.

If you want to see more, go to the Darknet website


Follow Appsilon for More