This chapter explains how to use object detection based on deep learning.
With object detection we want to find the different instances in an image and assign them to a class. The instances can partially overlap and still be distinguished as distinct. This is illustrated in the following schema.
Object detection leads to two different tasks: Finding the instances and classifying them. In order to do so, we use a combined network consisting of three main parts. The first part, called backbone, consists of a pretrained classification network. Its task is to generate various feature maps, so the classifying layer is removed. These feature maps encode different kinds of information at different scales, depending how deep they are in the network. See also the chapter Deep Learning. Thereby, feature maps with the same width and height are said to belong to the same level. In the second part, we take different feature maps providing features of different levels and combine them. As a result we obtain feature maps containing information of lower and higher levels. These are the feature maps we will use in the third part. This second part is also called feature pyramid and together with the first part it constitutes the feature pyramid network. The third part consists of additional networks, which get the selected feature maps as input and learn how to localize and classify potential objects. Additionally this third part includes the reduction of overlapping predicted bounding boxes. An overview of the three parts is shown in the following figure.
Let us have a look what happens in this third part.
In object detection, the location in the image of an instance is given by a
rectangular, axis parallel bounding box.
Hence, the first task is to find a suiting bounding box for every single
instance.
To do so, the network generates reference bounding boxes and learns, how to
modify them to fit the instances best possible.
While the bounding boxes with instances are all rectangular,
they may have different sizes and side ratios.
Thus, the network has to learn where such bounding boxes may be and which
shape they may have.
Within the approach taken in HALCON, the network proposes
for every pixel of every feature map of the feature pyramid a set of
reference bounding boxes.
The shape of those boxes is affected by the parameter 'aspect_ratios'
and the size by the parameter 'num_subscales'
,
see the illustration below and
.
In this way get_dl_model_param
'aspect_ratios'
times 'num_subscales'
reference bounding boxes are generated for every pixel mentioned before.
These reference bounding boxes are the base positions of potential objects.
The network predicts offsets how to modify the reference bounding boxes in order to obtain bounding boxes fitting the potential instances better. It learns this by comparing the proposed bounding boxes with the corresponding ground truth bounding boxes, thus the information where in the image we find single instances. An illustration is shown in the figure below. The better these reference bounding boxes represent the shape of the different ground truth bounding boxes, the easier the network can learn them.
(1) | (2) |
As mentioned before, feature maps of different levels are used.
Depending on the size your instances have in comparison to the total image
it is beneficial to include early feature maps (where the feature map is not
very compressed and therefore small features are still visible) and deeper
feature maps (where the feature map is very compressed and only large
features are visible) or not.
This can be controlled by the parameters 'min_level'
and
'max_level'
, which determine the levels of the feature pyramid.
With these bounding boxes we have the localization of a potential instance,
but the instance is not classified yet.
Hence, the second task consists of classifying the content of the image
part within the bounding boxes. For more information about classification
in general, see the chapter Deep Learning / Classification and the
“Solution Guide on Classification”
.
Most probably the network will find several promising bounding boxes for a
single object. The reduction of overlapping predicted bounding boxes is
done by non-maximum suppression, set over the parameters
'max_overlap'
and 'max_overlap_class_agnostic'
using
.
An illustration is given in the figure below.
set_dl_model_param
(1) | (2) | (3) |
As output you get bounding boxes proposing possible localizations of objects and confidence values, expressing the affinity of this image part to one of the classes.
In HALCON, object detection with deep learning is implemented within the more general deep learning model. For more information to the latter one, see the chapter Deep Learning / Model.
The following sections are introductions to the general workflow needed for object detection, information related to the involved data and parameters, and explanations to the evaluation measures.
In this paragraph, we describe the general workflow for an object
detection task based on deep learning.
Thereby we assume, your dataset is already labeled, see also the section
“Data” below.
Have a look at the HDevelop example series
detect_pills_deep_learning
for an application.
Note, this example is split into the four parts 'Preparation',
'Training', 'Evaluation', and 'Inference', which give guidance on
possible implementations.
This part covers the creation of a DL object detection model and the
adaptation the data for this model.
The single steps are also shown in the HDevelop example
detect_pills_deep_learning_1_prepare.hdev
.
Create a model using the operator
Thereby you will have to specify at least the backbone and the number
of classes to be distinguished. Further parameters can be set over
the dictionary
.
Their values should be well chosen for the specific task, not least
to possibly reduce memory consumption and runtime.
See the operator documentation for more information.
Note, after creation of the model, its underlying network architecture
is fixed to the specified input values.
As a result the operator returns a handle DLModelDetectionParam
'DLModelHandle'
.
Alternatively you can also use
to read in a model you have already saved with
.
write_dl_model
The information what is to be found on which image of your training dataset needs to be read in and converted. This is done by the procedure
read_dl_dataset_from_coco
.
Thereby a dictionary
is created, which serves as
a database and stores all necessary information about your data.
For more information about the data and the way it is transferred,
see the section “Data” below and the chapter
Deep Learning / Model.
DLDataset
Split the dataset represented by the dictionary
. This can be done using the procedure
DLDataset
split_dl_dataset
.
The resulting split will be saved over the key
in
each sample entry of split
.
DLDataset
The network imposes requirements on the images, as e.g., the image width and height. You can retrieve every single value using the operator
or you can retrieve all necessary parameter using the procedure
create_dl_preprocess_param_from_model
.
Now you can preprocess your dataset. For this, you can use the procedure
preprocess_dl_dataset
.
In case of custom preprocessing, this procedure offers guidance on the implementation. We recommend to preprocess and store all images used for the training before starting the training, since this speeds up the training significantly.
To visualize the preprocessed data, the procedure
dev_display_dl_data
is available.
This part covers the training of a DL object detection model.
The single steps are also shown in the HDevelop example
detect_pills_deep_learning_2_train.hdev
.
Set the training parameters and store them
in the dictionary
.
These parameters include:
TrainingParam
the hyperparameters, for an overview see the section “Model Parameters and Hyperparameters” below and the chapter Deep Learning.
parameters for possible data augmentation
Train the model. This can be done using the procedure
train_dl_model
.
The procedure expects:
the model handle DLDetectionHandle
the dictionary with the data information DLDataset
the dictionary with the training parameter
TrainingParam
the information, over how many epochs the training shall run.
During the training you should see how the total loss minimizes.
In this part we evaluate the object detection model.
The single steps are also shown in the HDevelop example
detect_pills_deep_learning_3_evaluate.hdev
.
Set the model parameters which may influence the evaluation.
The evaluation can conveniently be done using the procedure
evaluate_dl_model
.
This procedure expects a dictionary
with the evaluation parameters.
Set the parameter GenParamEval
to Detailed
'TRUE'
to
get the data necessary for the visualization.
You can visualize your evaluation results using the procedure
dev_display_detection_detailed_evaluation
.
This part covers the application of a DL object detection model.
The single steps are also shown in the HDevelop example
detect_pills_deep_learning_4_infer.hdev
.
Request the requirements the network imposes on the images using the operator
or the procedure
create_dl_preprocess_param_from_model
.
Set the model parameter described in the section “Model Parameters and Hyperparameters” below, using the operator
Thereby, one should set the
according to the number of images to be inferred.
batch_size
Generate a data dictionary
for each image.
This can be done using the procedure
DLSample
gen_dl_samples_from_images
.
Every image has to be preprocessed as done for the training. For this, you can use the procedure
preprocess_dl_samples
.
Apply the model using the operator
Retrieve the results from the dictionary
'DLResultBatch'
.
We distinguish between data used for training and evaluation, consisting of images with their information about the instances, and data for inference, which are bare images. For the first ones, you provide the information defining for each instance to which class it belongs and where it is in the image (via its bounding box).
As a basic concept, the model handles data over dictionaries, meaning it
receives the input data over a dictionary
and
returns a dictionary DLSample
and DLResult
,
respectively. More information on the
data handling can be found in the chapter Deep Learning / Model.
DLTrainResult
The dataset consists of images and corresponding information. They have to be provided in a way the model can process them. Concerning the image requirements, find more information in the section “Images” below.
The training data is used to train and evaluate a network for your
specific task. With the aid of this data the network can learn which
classes are to be distinguished, how such examples look like, and how to
find them.
The necessary information is provided by telling for each object in
every image to which class this object belongs to and where it is
located. This is done by providing a class label and an enclosing
axis-aligned bounding box for every object.
There are different ways possible, how to store and retrieve this
information.
How the data has to be formated in HALCON for a DL model is explained
in the chapter Deep Learning / Model.
To format your data accordingly in the case of object detection, you
most conveniently use the procedure
create_dl_dataset_from_coco
.
It reads the standard COCO data format and creates a dictionary
.
Latter one works as a database for the information needed.
For further information on the needed part of the COCO data format,
please refer to the documentation of the procedure.
DLDataset
You also want enough training data to split it into three subsets, used for training, validation and testing the network. These subsets are preferably independent and identically distributed, see the section “Data” in the chapter Deep Learning.
Note, that in object detection the network has to learn how to find possible locations and sizes of the instances. That is why also the later important instance locations and sizes need to appear representatively in your training dataset.
Regardless of the application, the network poses requirements on the
images regarding e.g., the image dimensions.
The specific values depend on the network itself and can be queried with
.
In order to fulfill these requirements, you may have to preprocess your
images.
Standard preprocessing of the entire dataset and therewith also the
images is implemented in get_dl_model_param
preprocess_dl_dataset
and
in preprocess_dl_samples
for a single sample, respectively.
In case of custom preprocessing these procedures offer guidance on the
implementation.
As training output, the operator
will
return a dictionary train_dl_model_batch
with the current value of
the total loss as well as values for all other losses included in your
model.
DLTrainResult
As inference and evaluation output, the operator
will return a dictionary
apply_dl_model
for every image.
For object detection, this dictionary will include for every detected
instance its bounding box and confidence value of the assigned class.
Thereby several instances may be detected for the same object in the
image, see the explanation to the non-maximum suppression above.
The resulting bounding boxes are determined over the top left corner
(DLResult
, bbox_row1
) and bottom right corner
(bbox_col1
, bbox_row2
), given in pixel centered
sub-pixel accurate coordinates.
For more information to the coordinate system, see the chapter
Transformations / 2D Transformations.
Further information on the output dictionary can be found in the chapter
Deep Learning / Model.
bbox_col2
Next to the general DL hyperparameters explained in Deep Learning, there is a further hyperparameters relevant for object detection:
'class_weights'
, for further information see the
documentation of the operator
.
get_dl_model_param
The hyper parameters are set using
.
set_dl_model_param
For an object detection model, there are two different types of model parameters:
Parameter defining your architecture. They can not be
changed anymore once your model is created. These parameters are all
set using the operator
when creating
your model.
create_dl_model_detection
Parameter influencing your evaluation results. Those only relevant for object detection are
'max_num_detections'
'max_overlap'
'max_overlap_class_agnostic'
'min_confidence'
They are explained in more detail in
.
To set them you can use get_dl_model_param
when
creating your model or create_dl_model_detection
afterwards.
set_dl_model_param
For object detection, the following evaluation measures are supported in HALCON. Note that for computing such a measure for an image, the related ground truth information is needed.
Mean average precision,
and average precision (AP)
of a class for an IoU threshold, mAP
ap_iou_classname
The AP value is an average of maximum precisions at different recall values. In simple words it tells us, if the objects predicted for this class are generally correct detections or not. Thereby we pay more attention to the predictions with high confidence values. The higher the value, the better.
To count a prediction as a hit, we want both correct, its top-1 classification and its localization. The measure, telling us the correctness of the localization is the intersection over union, IoU: an instance is localized correctly if the IoU is higher than the demanded threshold. The IoU is explained in more detail below. For this reason, the AP value depends on the class and on the IoU threshold.
You can obtain the specific AP values, the averages over the classes, the averages over the IoU thresholds, and the average over both, the classes and the IoU thresholds. The latter one is the mean average precision, mAP, a measure to tell us how well instances are found and classified.
True Positives, False Positives, False Negatives
The concept of true positive, false positives, and false negatives is explained in Deep Learning. It applies for object detection with the exception that there are different kinds of false positives, as e.g.:
An instance got classified wrongly.
An instance was found where the is only background.
An instance was localized badly, meaning the IoU between the instance and its ground truth is lower than the evaluation IoU threshold.
There is a duplicate, thus at least two instances overlap mainly
with the same ground truth bounding box, but they overlap not more
than 'max_overlap'
with each other, so none of them got
suppressed.
Note, these values are only available from the detailed evaluation.
This means, in evaluate_dl_model
the parameter
has to be set to Detailed
'TRUE'
.
(1) | (2) |
create_dl_model_detection