This site uses cookies. Read more.

Updated on 10 October 2020

Image classification is one of the many exciting applications of convolutional neural networks. Aside from simple image classification, there are plenty of fascinating problems in computer vision, with object detection being one of the most interesting. YOLO (“You Only Look Once”) is an effective real-time object recognition algorithm, first described in the seminal 2015 paper by Joseph Redmon et al. In this article, we introduce the concept of object detection, the YOLO algorithm itself, and one of the algorithm’s open-source implementations: Darknet.

If you’re ready to live life to the fullest and Carpe Imaginem, continue reading. We promise to minimize our use of outdated slang terms.

Object detection overview

Object detection is commonly associated with self-driving cars where systems blend computer vision, LIDAR, and other technologies to generate a multidimensional representation of the road with all its participants. It is also widely used in video surveillance, especially in crowd monitoring to prevent terrorist attacks, count people for general statistics or analyze customer experience with walking paths within shopping centers.

How does object detection work

To explore the concept of object detection it’s useful to begin with image classification. Image classification goes through levels of incremental complexity. 

  1. Image classification
    aims at assigning an image to one of a number of different categories (e.g. car, dog, cat, human, etc.), essentially answering the question “What is in this picture?”. One image has only one category assigned to it.
  2. Object localization
    allows us to locate our object in the image, so our question changes to “What is it and where it is?”.
  3. Object detection
    provides the tools for doing just that –  finding all the objects in an image and drawing the so-called bounding boxes around them.

In a real real-life scenario, we need to go beyond locating just one object but rather multiple objects in one image. For example, a self-driving car has to find the location of other cars, traffic lights, signs, humans and take appropriate action based on this information.

In the case of bounding boxes, there are also some situations where we want to find the exact boundaries of our objects. This process is called instance segmentation, but this is a topic for another post.


Object detection algorithms

There are a few different algorithms for object detection and they can be split into two groups.

Algorithms based on classification

They are implemented in two stages:

  1. They select regions of interest in an image.
  2. They classify these regions using convolutional neural networks.

This solution can be slow because we have to run predictions for every selected region. A widely known example of this type of algorithm is the Region-based convolutional neural network (RCNN) and its cousins Fast-RCNN, Faster-RCNN, and the latest addition to the family: Mask-RCNN. Another example is RetinaNet.

Algorithms based on regression

Instead of selecting interesting parts of an image, they predict classes and bounding boxes for the whole image in one run of the algorithm. The two best-known examples from this group are the YOLO (You Only Look Once) family algorithms and SSD (Single Shot Multibox Detector). They are commonly used for real-time object detection as, in general, they trade a bit of accuracy for large improvements in speed.

Understanding YOLO object detection: the YOLO algorithm

To understand the YOLO algorithm, it is necessary to establish what is actually being predicted. Ultimately, we aim to predict a class of an object and the bounding box specifying object location. Each bounding box can be described using four descriptors:

  1. center of a bounding box (bxby)
  2. width (bw)
  3. height (bh)
  4. value cis corresponding to a class of an object (e.g., car, traffic lights, etc.)

To learn more about PP-YOLO (or PaddlePaddle YOLO), which is an improvement on YOLOv4, read our explanation of why PP-YOLO is faster than YOLOv4.

In addition, we have to predict the pc value, which is the probability that there is an object in the bounding box.

As we mentioned above, when working with the YOLO algorithm we are not searching for interesting regions in our image that could potentially contain an object.

Instead, we are splitting our image into cells, typically using a 19×19 grid. Each cell is responsible for predicting 5 bounding boxes (in case there is more than one object in this cell). Therefore, we arrive at a large number of 1805 bounding boxes for one image. Rather than seizing the day with #YOLO and Carpe Diem, we’re looking to seize object probability. The exchange of accuracy for more speed isn’t reckless behavior, but a necessary requirement for faster real-time object detection.

Most of these cells and bounding boxes will not contain an object. Therefore, we predict the value pc, which serves to remove boxes with low object probability and bounding boxes with the highest shared area in a process called non-max suppression.

nonmax suppression

Darknet: a YOLO implementation

There are a few different implementations of the YOLO algorithm on the web. Darknet is one such open-source neural network framework. Darknet was written in the C Language and CUDAtechnology, which makes it really fast and provides for making computations on a GPU, which is essential for real-time predictions.

If you’re curious about other examples of the YOLO algorithm in action, you can take a look at a PyTorch implementation or check out YOLOv3 with some extra fast.ai functionality. For a complete overview, explore the Keras implementation.

darknet computer vision

Installation is simple and requires running just 3 lines of code (in order to use GPU it is necessary to modify the settings in the Makefile script after cloning the repository). For more details, see the Darknet installation instructions.

git clone https://github.com/pjreddie/darknet
cd darknet

After installation, we can use a pre-trained model or build a new one from scratch. For example, here’s how you can detect objects on your image using a model pre-trained on the COCO dataset:

./darknet detect cfg/yolov3.cfg yolov3.weights data/my_image.jpg

As you can see in the image above, the algorithm deals well even with object representations.

If you want to see more, go to the Darknet website

You don’t have to build your Machine Learning model from scratch. In fact, it’s usually better not to. Read our Introduction to Transfer Learning to find out why.

This article was originally written by Michał Maj with further contributions from the Appsilon ML team.


Follow Appsilon for More

Reach out to Appsilon

Jędrzej Świeżewski, PhD
Jędrzej Świeżewski, PhD
Machine Learning Lead