Welcome to Weijie Semiconductor

NVIDIA launches open-source physics AI dataset

Wednesday, January 8, 2025

The initial version of standardized synthetic data is expected to become the world's largest dataset of this kind and is currently available as an open source version for robot developers.
Training autonomous robots and vehicles to effectively interact with the physical world requires massive amounts of high-quality data. In order to give researchers and developers an advantage, NVIDIA has released a large open source dataset to help build the next generation of physical AI. This commercial grade, pre validated dataset was officially released at NVIDIA GTC, the global AI conference held in San Jose, California. It will help researchers and developers overcome the challenge of starting from scratch and smoothly launch physical AI projects. Developers can use datasets for model pre training, testing, and validation, or for post training to optimize the world base model and accelerate deployment processes.
The initial dataset can now be downloaded through the Hugging Face platform, providing developers with 15 TB of data, including over 320000 robot training trajectories, as well as up to 1000 Universal Scene Description (OpenUSD) resources including the SimReady resource set. In addition, special data supporting the development of end-to-end autonomous vehicle will be released soon, including clips with a length of 20 seconds, covering various traffic scenarios in more than 1000 cities in the United States and more than 20 European countries/regions.
The NVIDIA Physics AI dataset contains hundreds of SimReady assets that can be used to build rich scenarios.
In the future, this dataset is expected to develop into the world's largest unified and open-source physical AI development dataset. It can provide support for a variety of AI development models, including autonomous navigation robots that can safely cross the warehouse environment, surgical assistant robots, and autonomous vehicle that can shuttle freely in complex traffic scenes such as the construction area.
The NVIDIA Physical AI Dataset program plans to use a series of subsets of real and synthetic data to train, test, and validate physical AI across multiple platforms, including the NVIDIA Cosmos World Model Development Platform, NVIDIA DRIVE AV Software Stack, NVIDIA Isaac AI Machine Development Platform, and NVIDIA Metropolis Smart City Application Framework.
The Berkeley DeepDrive Research Center at the University of California, Berkeley, the Carnegie Mellon Security AI Laboratory, and the Contextual Robotics Institute at the University of California, San Diego have started using this dataset for the first time.
Henrik Christensen, the director of multiple robot and autonomous vehicle laboratories at the University of California, San Diego, said: "With this dataset, we can do a lot of work, such as training predictive AI models, which can help autonomous vehicle better track the movements of vulnerable road users such as pedestrians, thus improving safety. Compared with existing open source resources, this dataset can provide diverse scenes and longer video clips, which will significantly promote the research progress of robotics and autonomous vehicle."
Meet the needs of physical AI data
The NVIDIA Physics AI dataset can help developers expand AI performance during pre training, with massive amounts of data supporting the construction of more powerful AI models. During the pre training phase, richer data can be used to train AI models to improve their performance in specific use cases.
To build a diverse scenario dataset that accurately reflects the physical characteristics and dynamic changes of the real world, a significant amount of time needs to be invested in data collection, organization, and annotation, which has become a bottleneck for most developers in advancing their projects. For academic researchers and small enterprises, deploying fleets to collect AI data of autonomous vehicle for several months is unrealistic and costly, and since most of the collected videos are conventional road scenes, only 10% of the data can be used for training.
But this scale of data collection is crucial for building secure and accurate commercial grade models. NVIDIA Isaac GR00T robot model requires thousands of hours of video clips for post training, such as GR00T N1. This model is trained on a humanoid robot dataset containing a large amount of real data and synthetic data. The end-to-end AI model of NVIDIA DRIVE AV autonomous vehicle needs tens of thousands of hours of driving data to develop.
This open-source dataset contains thousands of hours of multi view videos, with unprecedented levels of scene diversity, data scale, and coverage area. This will bring breakthroughs to the field of security research, especially in emerging research directions such as identifying abnormal behavior and evaluating model generalization. This technology contributes to NVIDIA Halos' full stack autonomous vehicle safety system.
In addition to utilizing the NVIDIA Physical AI dataset to help meet data needs, developers can also further advance AI development through tools such as NVIDIA NeMo Curator, which can efficiently process large datasets used for training and customizing models. Using NeMo Curator, it only takes two weeks to process 20 million hours of video on NVIDIA Blackwell GPU, while using an unoptimized CPU workflow takes 3.4 years.
Robot developers can also use the new NVIDIA Isaac GR00T blueprint to generate synthetic motion trajectories, which is a reference workflow built on NVIDIA Omniverse and NVIDIA Cosmos. With a small amount of human demonstration data, robot synthetic motion trajectories can be generated on a large scale.
University laboratories use datasets for AI development
The UC San Diego Robotics Laboratory includes teams focused on medical applications, humanoid robots, and home assistance technology. Christensen expects that robot data in physical AI datasets can help develop semantic AI models to understand the environment of spaces such as homes, hotel rooms, or hospitals.
He said, "One of our core goals is to achieve deep scene understanding capability. If the robot is asked to organize groceries, it will know exactly which items need to be refrigerated and which are suitable for storage in the freezer
In the field of autonomous vehicle, Christensen's laboratory can use data sets to train AI models to understand the intentions of different road users and predict the best response actions. His research team can also use this dataset to support the development of digital twins, simulate extreme situations, and challenging weather conditions. These simulation scenarios can be used to train and test autonomous driving models in rare real-world environments.
Berkeley DeepDrive, the leading AI research center of auto drive system, uses this dataset to develop autonomous vehicle strategy models and world basic models.
Wei Zhang, co director of Berkeley DeepDrive, said: "Data diversity is very important for training basic models. This data set can support public and private sector teams to carry out cutting-edge research and help them develop autonomous vehicle and robot AI models."
Researchers at the Safety AI Laboratory at Carnegie Mellon University plan to use this data set to advance their work in evaluating and certifying the safety of autonomous vehicle. The team plans to test the performance of a physical AI base model trained on this dataset in a rare scenario simulation environment and compare its performance with an autonomous driving model trained on an existing dataset.
Ding Zhao, Associate Professor and Head of the Security AI Laboratory at Carnegie Mellon University, said, "This dataset covers different types of roads and geographic locations, infrastructure, and weather environments, and its diversity provides important support for training models with physical world causal reasoning capabilities, especially in understanding and handling extreme cases and long tail problems
Please access the NVIDIA Physical AI dataset through Hugging Face. Join the OpenUSD learning path and robot basic learning path courses to master the basic knowledge.

Leave your comment