Justyna Toton May 17, 2024 - Artificial Intelligence

Synthetic data 101: applications, advantages, and challenges

In the tech-powered world, data is everything. Machine learning models are data-centric, which means that their performance heavily relies on the volume and quality of data used for training. Consequently, training effective models requires vast amounts of high-quality data.

Unfortunately, live data collection is often relatively expensive and time-consuming. In high-risk environments, such as mining sites, it is also dangerous. How does the industry handle data scarcity? The secret lies in artificially generated synthetic data that closely mimics real data.

When algorithms understand the patterns, correlations, and behaviors of real data sets, they can create new data points of the same characteristics. In such a way, they are an accurate, yet far more easily obtained alternative to natural data. At Robotec.ai, we are especially interested in the potential of synthetic data for the robotics industry.

Autonomous robots, vehicles, and machines cannot safely navigate their environment without effective perception algorithms. The catch is that training such algorithms requires huge amounts of annotated data. Once they have the training material, it is possible to train for different purposes: detection, segmentation, and tracking. They transform different industries from mining and warehouse logistics to agriculture and automotive.

Our Machine Learning & Cloud team at Robotec.ai mastered the process of synthetic data creation. They rely on the capabilities of our simulation platform for confined spaces – RoSi. The platform supports the most widely used sensors like lidar, camera, depth camera, and GNSS, among others. As it contains randomization and generation features for scenes and scenarios, it can be used to generate virtually unlimited amounts of annotated data. This is how you can leverage the power of data for safe and cost-effective automation.

How do you generate synthetic data?

  • First, you create a 3D simulation of the scene with attention to objects and their physics.
  • In the next step, you simulate sensors – lidars, cameras, and radars.
  • Next, it is time for proper scenario randomization with object placement/spawning and movement. You can also randomize the trajectory of the vehicle with sensors. In later steps, the simulation has to be configured by randomization and domain adaptation. Proper scenario randomization with object placement/spawning and movement, simulation of different trajectories of the vehicle with sensor, and controlling the object sizes and class distributions are crucial to obtaining useful synthetic datasets.
  • Once everything is configured, you record the scenarios.
  • In the end, you convert the data to the required format, suitable for perception model training.

The datasets are provided under a Creative Commons Attribution 4.0 International Public License (“CC BY 4.0”). 


The range of industries benefiting from synthetic data is vast. Our Machine Learning Engineer Magdalena Kotynia shortly describes a few of them:

  • Autonomous vehicles – synthetic data generated from the 3D simulations of the autonomous vehicles with the perception stack such as lidars and cameras can help us improve 3D detection models, which brings us closer to safer and cheaper autonomous cars.
  • Agriculture – synthetic data gathered from the 3D simulations of the orchards or fields is used to improve the systems and machines for crop condition monitoring and management.
  • Mining – 3D simulations of mines help plan the optimal workflow and infrastructure of the mining area. Performance estimation, safety validation, productivity assessment, and optimization are easier than ever.
  • Warehouse – we can easily and efficiently optimize the “work” of robots in the warehouses using the lidar and camera synthetic data gathered from 3D simulations based on ROS 2.
    A screenshot from the orchard simulation including the sensor view.

    A screenshot from the orchard simulation including the sensor view.

    Advantages of using synthetic data

    Synthetic data solves the problem of data scarcity while decreasing the need for live data collection. It is essential in high-risk industries like mining. Autonomous robots must learn to navigate paths that may pose a safety threat to human workers. If certain facilities are deemed dangerous for the data-gathering team, the generation of synthetic data helps to minimize the use of natural data. What are the other advantages of synthetic data?

  • Operational continuity in facilities: no operations are disrupted to collect data when you take advantage of digital twins. It is especially important for warehouses, where any disruptions may translate to lost revenue.
  • Minimized privacy concerns: everything from logos and car plates to faces can be a subject of legal dispute. Synthetic data eliminates the risk of infringing on someone’s right to privacy. Without the need for complicated, privacy-compliant licensing, data sharing is easier.
  • Faster, easier, and cheaper changes to the environment: synthetic data can be modified while natural data would have to be collected again to provide different results.
  • Environmentally friendly: testing vehicles in real life wastes more resources than computing power required for simulation.

Real data problems solved by synthetic data

In the real world, you cannot fully control the environment around you. We are subjected to the whims of weather and cannot modify every single element of our surroundings. Here is how synthetic data helps us control the uncontrollable:

🔴 Weather conditions: real (natural) data is subject to natural limitations such as weather conditions. Heavy rain, fog, or snow can completely sabotage the data collection process if they scatter or absorb laser beams. In such a situation, sensors cannot map out their surroundings accurately.

➡️ How do we solve it? It is possible to simulate extreme weather conditions such as snow.

🔴 Discrepancies between different lidar systems: lidar systems may have different ranges, resolutions, and scanning patterns. Differences between them may lead to discrepancies in data quality affecting the system performance.

➡️ At Robotec.ai, we use our Robotec GPU Lidar to easily configure lidar patterns and range. With synthetic data, it is possible to build robust models with simulated outputs from various lidar systems.

🔴 Hard to obtain Ground Truth (GT): collecting high-quality GT often involves extensive fieldwork with specialized equipment or extensive preparation for different data collection scenarios. It is a complex, time-consuming, and most of all – expensive process. Higher accuracy usually comes at a higher cost. There is a trade-off between the desire for highly accurate ground truth data and the need to minimize costs.

➡️Point cloud segmentation results in accurate segmentation of objects, linking objects with points they “belong” to. Consequently, it boosts the data quality.

➡️Easily defined bounding boxes of the objects in the simulation can serve as the Ground Truth. As a result, we do not have to manually annotate objects at the level of the data frame.

🔴Imbalanced classes: some classes of objects may be underrepresented in the training set (for example certain types of vehicles, animals, or people with disabilities ). Why does it matter? It may impact the system’s performance when rare but important objects must be detected (what if a giant bear crosses the road?).

➡️ In simulation, it is possible to add many rare objects to balance the distribution of classes in the dataset.

A visual example of point-cloud segmentation.

Synthetic data challenges

Generating synthetic data comes with its own set of challenges connected to:

🔴Proper randomization: realistic randomness is a key to effective simulation. This includes varying object placements to enhance authenticity; for instance, trees in orchards should differ in size, spacing, and species to more accurately reflect the real world.

Context-aware spawning (placing objects in contextually appropriate locations) is essential to ensure realistic scenarios. In mining, where safety is a priority, it is necessary for accurate UWB (Ultra-Wideband) signal simulations.

🔴Accurate physics: creating a realistic copy of the real world requires creating complex world models of the overall environment respecting the physical properties of materials, terrain types, and weather conditions. Fortunately, creating a precise 1:1 physics simulation is not always necessary and should be tailored to the specific requirements of the use case. That is not all: accurate sensor and vehicle models are just as important. Sensors must be simulated with their specific characteristics (for example range or field of view), while the physical behavior of vehicles (like acceleration and interaction with surfaces) must also be accordingly modeled for accuracy.

🔴Perception: sensors must be similar to their physical counterparts without compromising efficiency. All sensors have their maximum detection range, which may result in incomplete information about the environment. Additionally, smooth surfaces are enemies of lidars, which measure their reflections off surfaces. Mirror-like reflections send the laser beams away from the sensor and result in incorrect data points. The jitter of incident angles (a slight irregular variation of angles) caused by mechanical imperfections is another thing that can impact the accuracy of the measurement. Robotec GPU Lidar allows us to create noise that imitates the noise in real data, such as noise from limited sensor range, mirror reflection (instead of diffusion reflection) produced by sensing lasers on smooth surfaces, and jitter of the incident angles. It results in a reduced difference between the real and synthetic domains.

🔴Computational resources: recording synthetic data scenes may take a lot of time and computational power. Fortunately, this process can be significantly time-optimized by leveraging services like AWS Batch.

At Robotec.ai, our Machine Learning & Cloud team collaborates with RGS, 3D modeling, and Simulation Engines teams to effectively tackle these challenges.

Synthetic data = real possibilities

The generation of synthetic data opens up new possibilities for many industries. In high-risk facilities, where live data collection raises safety concerns, synthetic data is a safer path to effective automation. At Robotec.ai, we assist companies with safe & human-centered automation powered by synthetic data. The team lead of our Machine Learning & Cloud department – Bartłomiej Boczek – recorded a special video to shortly introduce synthetic data. It is short but packed with information. Show it to your colleagues to discuss the potential of synthetic data for your company’s needs!


The article was created in collaboration with the Machine Learning & Cloud department. Special thanks to:
✦ Bartłomiej Boczek, Machine Learning & Cloud team lead, Robotec.ai
✦ Magdalena Kotynia, Machine Learning Engineer, Robotec.ai
✦ Maciej Majek, Machine Learning Engineer, Robotec.ai

🔴 Curious to see what we could do together? Contact us at office@robotec.ai! 

🔴 Interested in our latest projects? Follow us on LinkedIn to always stay updated!


Share post: