When I started experimenting with machine learning almost all the tutorials I followed used one of the few compact standard datasets available. Understandable, as the datasets had to be well documented with reliable results. They also had to be small enough to allow a wide range of computer systems to train the models within the lifetime of the scholar. The emphasis was, however, that for a practical example and production use the number of images required to train a model could not be measured in thousands but tens of thousands — months of work to generate a dataset — a herculean task.
I started planning to automate the process of image generation, realizing that the quality and performance of an AI model was almost entirely dependent upon the dataset used during training. Improvements in model performance were usually achieved by modifying the dataset and repeating the training. The more you change, the more you train, the more you learn about the strengths and weaknesses of your model — the better the model gets. To speed this process up you had to automate image generation — synthetic images had to be, at least for some use cases, the better solution.
The ideal pipeline
An ideal pipeline should generate isolated images from geometry data, material settings, camera settings, light settings, and animations. I call these collections of images along with their associated image classes “imagesets”. Material, camera, and light settings are usually known, or at least rough estimations, at the start of the model design process — spotlight or more ambient light, angle of view of the camera lens, etc. These parameters can be fine-tuned to determine how realistic the synthetic images appear to us — how they compare to the real object in its detection environment. These parameters are important however, as they can be modified even after the first models have been trained, we do not have to spend much time finding the right light settings at this stage. The real magic of synthetic images starts with the animation settings.
The animation settings define the number of views of the object generated, the distance of the object from the camera and its angle of rotation (z, a, b and c). If the object has an A and a B-side which we require as two separate image classes, we define two separate animations. Another advantage of synthetic images — the resulting model will have zero bias between the object sides and image classes.
I continued the model generation by merging several hundred isolated images into carefully selected background images to create a dataset. During the merging of the isolated image and backgrounds, the isolated image can be subjected to a number of transformations. These increase the number of image permutations which enhances the robustness of the model.
As soon as a dataset is completed, it can be scheduled for training. Ten minutes or ten hours later the resulting model can be tested. By modifying animations, transformations and the number of backgrounds used, one can directly influence training time and therefore the number of training iterations possible per day. This is another real benefit of synthetic images. Start with a lightweight fast model with known limitations but fast training times. Modify, train, test, and document until sufficient, then remove the limitations and roll out your completed model.
It must be noted that datasets can be trained and tested separately and merged into multiclass models at a later date. This simplifies the training and testing process. One model can be designed to detect a human hand, a second model can be designed to detect a tool, for example, a hammer. The resulting models can be merged into one dataset allowing a system to monitor the use of a tool during an assembly process.
Synthetic images are not just an alternative method of training object detection models. They enable the training process giving the model designer more variables to adjust, explore and test the model. They accelerate model design and consequently the learning curve of the model designer. After two years of working with machine learning, I have come to the conclusion that synthetic images have become a catalyst for object detection with machine learning.