Real-Time Object Detection using YOLO4 July 2022
One of the most fascinating and rapidly advancing AI disciplines is automated object recognition. This term refers to the autonomously learned recognition and classification of objects in images or videos. Thanks to tremendous performance advances in GPU-accelerated computing, AI-based object recognition algorithms are already being used in real-world applications. Some of the most prominent application areas are:
- Autonomous driving or driving assistance systems
- Optical quality control, automated barcode recognition
- Support in medical imaging processes (CT, MRT)
- Traffic-flow analysis or license-plate recognition
- Safety systems (e.g. persons in the track bed)
While the practical use of object recognition systems is still in its infancy, the potential of this technology is virtually unlimited.
The technical perspective comprises two different approaches, the single-stage and the two-stage processes.
In the two-step methods developed first, regions of interest in the image are isolated in the first step and checked for the presence of objects in the second step using an ordinary classification model. A simple algorithm for generating image sections is given by e.g. the sliding-window algorithm, in which the image sections to be classified are generated using a moving window to scan the entire image. More advanced systems use artificial intelligence to isolate interesting image sections. However, regardless of the exact structure of the first stage, the second stage always requires multiple evaluations per image. As a result, these AI systems achieve high precision, but are of limited usability for real-time applications due to the computing time required for the evaluation. An established two-stage AI architecture is given by e.g. the Region Based Convolutional Neural Network (R-CNN), which is now used in the Fast R-CNN, Faster R-CNN and Granulated R-CNN stages.
This contrasts with the single-stage methods developed later, in which the complete evaluation, i.e. object recognition, classification and output of a limiting object frame (bounding box), takes place in only one step with only one evaluation. One of the most successful AI architectures ever, YOLO ("You Only Look Once"), belongs to the class of one-step object-recognition systems and carries the property of one-step already in its name. Like no other architecture, YOLO paved the way for real-time analysis of (moving) images and thus became a standard algorithm for object-recognition tasks. These days, the YOLO AI architecture is available in the development stages YOLOv3, YOLOv4 up to YOLOR. Due to its popularity, the main features of this object recognition system will be discussed in the following.
The output vector
To understand how the YOLO architecture works, it is worthwhile to take a closer look at the output vector of the model or to construct the desired output vector (ground truth) for an example.
For this purpose, let us consider the following example photo in which birds and rocks are to be detected:
For the moment, we focus on the seagull on the left and imagine there to be no other seagulls or rocks in the photo. The general form of the YOLO output vector for a model with two classes and at most one object per image is:
Here, p_c corresponds to the confidence level that an object is present in the predicted bounding box. b_x,b_y are the coordinates of the center of the bounding box. The quantities b_h,b_w denote the height and width of the bounding box, respectively. The last components of the vector denote the class labels. They provide information about whether the object is a bird or a rock. More precisely, c_bird or c_rock are conditional probabilities, i.e. given the condition that the bounding box contains an object, how likely is it that the object is a bird or rock, respectively. For the above example photo with only the left seagull, the corresponding ground truth vector is thus
with corresponding numerical values for the coordinates and dimensions of the bounding box.
The image displayed above contains not one but several objects. Thus, a single seven-component vector is obviously not sufficient for description.
YOLO solves this problem by rasterizing the image:
Instead of using a single seven-component vector, the above example uses a 6x9 array of seven-component vectors, one vector for each cell. Responsible for recognition, classification and bounding of an object is in each case the cell in which the center of the object lies. The bounding box can, of course, extend beyond the cell itself. In actual practice, the vector, which is seven-component in our example, is multiplied again so that several objects can be detected in the same cell. In order to keep this blog article as comprehensible as possible, we will refrain from more detailed explanations at this point.
The network architecture
With a dataset of images prepared as before, in which bounding boxes and class labels for these boxes have been manually added, the system can now be trained.
The actual architecture consists of several groups of convolutional layers, followed by a smaller number of fully-connected layers .
where the output layer in this case has the dimension 6x9x7. For the initial training of the AI architecture, a large number of prepared images is required as well as a computer system/cluster with sufficient GPU power. Due to these circumstances, it often makes sense in practice to work with pre-trained systems that have already been trained on a large dataset and are freely available. If the desired class is not contained in the pre-trained model, it is often sufficient to modify the output layer of the system and re-train it with a few hundred examples of the new class. This approach also ensures the practical applicability of the technology on a smaller scale, since the labor-intensive step of data preparation is reduced to a minimum.
If a system trained in this way is applied to images not contained in the training set, it is not uncommon for the algorithm to output several bounding boxes with different probabilities per object.
This can happen, for example, if an object extends over several cells.
A simple global non-maximum suppression is clearly out of the question, since in this operation only the object with the highest probability would survive.
A non-maximum suppression must therefore be performed in the individual groups. But how to identify these groups algorithmically?
The concept used for this purpose in YOLO is known as Intersection Over Union (IOU).
This method divides the area of the average of two bounding boxes by that of the union:
By appropriately choosing a threshold value, this concept can be used to identify the individual groups that belong together and then determine the final bounding boxes and class probabilities via group-wise non-maximum suppression. As already mentioned, the forward path of the network has to be traversed only once per image, which makes the network suitable for almost all object detection tasks, especially real-time applications.