Terra: Autonomous Driving Fundamental World Model

Jun 8, 2024 /

pxiaoer

Header illustration for the Terra blog article 'Terra: Autonomous Driving Fundamental World Model', depicting a human explorer navigating through a surreal sea of data, symbolizing the journey through the foundational world of autonomous driving via the Terra model.

Preface:

Terra is a foundational world model for autonomous driving being developed by JoinAI. It can provide high-quality and diverse data generation for autonomous driving models, including generating long-tail data that is difficult to collect in the real world. This article will start with the background and development of autonomous driving and end-to-end driving, introduce the design concept of the Terra model and what Terra can bring to autonomous driving. We hope you enjoy it.

Introduction to Autonomous Driving Background

Autonomous driving technology has made significant progress in recent years. From initial driver assistance systems to today's advanced autonomous driving features, this technology is steadily advancing towards the goal of fully autonomous driving. In this process, deep learning has played a key role, especially with the rise of large models in recent years, bringing new directions for the development of autonomous driving.

Looking back at the evolution of autonomous driving technology, we can see several important technical milestones:

2022 was the year of application of BEV (Bird's Eye View) technology, which simplifies the vehicle's perception and understanding of the surrounding environment by simulating the effect of looking vertically downward from directly above.
2023 was the year when OCC (Occupancy Network) began to be applied in urban scenarios. This technology divides the world into tiny cubes or voxels, predicting whether each voxel is occupied, thus providing more precise object shape information, helping vehicles understand complex scenes and avoid collisions.
In 2024, the industry expects to see the application of end-to-end large models, where all modules are neural networked, directly inputting raw data into the neural network system, and the model system directly outputs driving instructions.

Tesla FSD's Technological Development

In the development trend of autonomous driving, Tesla has always been at the forefront. Since 2021, Tesla has been continuously simplifying the architecture of its autonomous driving system, pursuing "a simpler, more universal, and lower-cost architecture to achieve stronger scalability". In 2021, Tesla introduced BEV technology based on Transformer, processing information from 8 cameras under a Transformer backbone, solving the problem of information loss due to image stitching in pure BEV perspective. In 2022, to further address issues such as object height and occlusion, Tesla introduced occupancy network technology.

In early 2023, Tesla began exploring more end-to-end autonomous driving solutions. They wondered if, given enough data, could they train a foundation model with generalization capabilities to handle all scenarios? Based on this idea, Tesla began to try replacing more rules with neural networks and further simplifying the model architecture.

Currently, Tesla's latest FSD Beta V12.3 version is moving towards a completely end-to-end direction. Although specific architectural details have not been disclosed, it is generally believed in the industry that compared to earlier versions, the V12 series uses more neural networks to replace rules in modules such as perception, prediction, and planning. Tesla's goal is to achieve an architecture of "unified backbone network for perception, prediction, and planning, more thoroughly end-to-end".

World Models

Tesla is also actively exploring the application of "world models". Tesla defines a world model as "a universal model that can understand and predict various complex situations in the real world". Yann LeCun believes that world models are closer to AGI, and that true artificial intelligence systems should be able to learn how the world works and generate content based on this understanding.

At the CVPR conference in 2023, Tesla researchers introduced their exploration of world models. They believe that occupancy network technology is an important foundation for building world models, and world models can be applied not only to autonomous driving but also to fields such as robotics. The end-to-end generative model being developed by Tesla is described as a universal model capable of understanding and predicting various complex situations in the real world.

From public information, it appears that Tesla used their data flywheel to build world models, and mentioned that world models improved corner case data and data where human and model decisions differ.

Multimodal Large Models

Meanwhile, breakthroughs in the field of multimodal large models by companies like OpenAI have also brought new ideas to the development of autonomous driving technology.

The Sora model released by OpenAI demonstrates AI's amazing ability in video generation, which could potentially be applied to simulation and prediction of autonomous driving scenarios.

However, Sora currently has some limitations in understanding and simulating the laws of motion in the physical world, which also reflects the challenges faced by current AI models in fully understanding and simulating the real world. Nevertheless, OpenAI is also investing in embodied intelligence companies like Figure to obtain sufficient data, aiming to use better multimodal large models to improve the simulation and understanding capabilities of the physical world.

Elon Musk's xAI released Grok-1.5V, which is also believed to be involved in the development of Tesla's end-to-end autonomous driving system. The end-to-end intelligent driving driven by large models is considered a smarter and more powerful solution, expecting multimodal large models to provide stronger reasoning, decision-making, and interaction capabilities for intelligent driving systems.

Example from xAI grok-1.5v official website

Future Trends in Autonomous Driving

Similar to Tesla's approach of researching end-to-end intelligent driving models, there is a British autonomous driving company called Wayve, which has conducted relatively in-depth research on end-to-end and embodied intelligence.

Both companies are trying to structurally represent things and predict the future through models, transforming the autonomous driving decision problem into predicting the next frame of video.

Wayve's GAIA-1 uses a Transformer + world model architecture, predicting the next image token based on past image, text, and action tokens. Although Tesla has not disclosed its specific model architecture, from their recruitment information, it can be seen that they are exploring various generative model architectures including diffusion models, VAE, autoregressive models, and GANs.

Recently, Wayve also announced their 4D scene reconstruction model PRISM-1, which, along with the multimodal intelligent driving large model LINGO-2 and data generation world model GAIA-1 announced last year, strings together their entire data-driven end-to-end intelligent driving solution.

Wayve PRISM-1 video demonstration

The current development of autonomous driving technology may follow two parallel paths: Companies like OpenAI continue to advance the development of multimodal foundation models, enhancing AI's ability to understand and simulate the world. Autonomous driving companies like Tesla use the massive data accumulated in actual road environments to develop more vertical world models and end-to-end autonomous driving systems. The intersection and fusion of these two paths may bring about a qualitative leap in autonomous driving technology. In terms of the current state, Tesla is indeed leading the way.

Challenges in Autonomous Driving

Despite the significant developments in autonomous driving technology, fully autonomous driving still faces many challenges.

First is the issue of data. Although companies like Tesla have a large amount of actual road data, how to effectively use this data to train intelligent driving models remains a complex problem.

End-to-end models have a very strong demand for high-quality data, with the main requirement being high-quality video data, including various long-tail scenario data such as oncoming vehicles, crossing motor vehicles, pedestrians suddenly appearing, adverse weather conditions, etc. Long-tail data is difficult to collect in reality, and most manufacturers are still relying on traditional simulation to obtain some data.

Secondly, how to ensure the interpretability and safety of the model is also an important issue, especially in a completely end-to-end system, how to ensure that AI's decision-making process is understandable and controllable.

In addition, legal and ethical issues also need to be resolved, such as how AI should make decisions in emergency situations, and how to define responsibility when accidents occur.

The Original Intention of Terra's Appearance

The training and testing of autonomous driving systems require a large amount of data, but data collection in the real world is costly, time-consuming, and difficult to cover all scenarios. Therefore, JoinAI has been exploring the use of data generation methods to obtain more diverse data from the beginning.

With the development of GenAI, we chose to integrate the two paths, adopting a transformer+diffusion+world model architecture to train Terra to provide synthetic data for autonomous driving models. Subsequently, we also tested data generation for fields such as robotics and industrial vision, all of which yielded good results.

Introduction to Terra-1 Model

Terra-1 is a foundational world model specifically designed to generate high-quality synthetic data for the autonomous driving domain. It consists of two core modules: the Diffusion-Transformer module responsible for generating high-quality image data, and the world model module used to simulate environmental dynamics and physical rules. Terra-1 can generate visually and physically highly realistic driving environments and scenarios.

What Can Terra-1 Do

Terra-1 mainly generates high-quality, efficient training and testing data for autonomous driving systems, primarily simulating complex driving environments, including traffic flow, weather conditions, road types, etc., and generating driving scenario data under rare or extreme conditions.

Terra-1 effect video in Q2 2024

This pure model-based training data generation by Terra-1 not only significantly reduces the cost and difficulty of data collection but also creates more diverse and extreme scenarios, thereby enhancing the robustness of the model.

What Terra-1 Brings

Terra-1 technology shows enormous potential and application prospects in the field of autonomous driving. By reducing the cost and time of data collection, it provides an economically efficient solution for autonomous driving systems. This technology not only enhances system performance but also improves its safety, which is crucial for the stable operation of autonomous vehicles in complex traffic environments.

Terra-1 has a wide range of applications, including scene understanding, risk assessment, and simulation testing. These tasks are critical for autonomous vehicles to correctly understand their surroundings, predict potential risks, and conduct testing and validation in simulated environments. As end-to-end development progresses, the demand for high-quality synthetic data will continue to grow, and Terra-1 can provide this data for end-to-end systems, supporting the continuous optimization and improvement of autonomous driving systems.

Looking to the future, as the Terra model iterates, Terra can exist as a foundational world model, not only in the field of autonomous driving but potentially expanding to fields such as robotics and industrial vision through fine-tuning, thereby expanding its application scope and influence.

Development of End-to-End Autonomous Driving in China

Chinese intelligent driving manufacturers have always invested a lot of resources in the research and development of end-to-end intelligent driving. This year, Tesla's FSD is expected to enter China, making domestic manufacturers more actively implement end-to-end solutions and realize that the implementation progress of end-to-end will, to some extent, reshuffle the intelligent driving industry.

Considering end-to-end autonomous driving models, the most obvious advantage is the lossless transmission of information, with a much higher ceiling than traditional solutions. If it can comprehensively understand complex traffic environments like humans, it will bring more possibilities.

More and more manufacturers are choosing to rely on model + data-driven end-to-end solutions, which will also allow domestic manufacturers to iterate more efficiently and achieve the goal of driving nationwide more quickly, whether in urban areas or rural roads.

JoinAI's Expectations for Terra

The primary goal of Terra-1 is to provide diverse and rich training data for autonomous driving AI, generating various complex road situations, weather conditions, and traffic scenarios, as well as extreme situations that are difficult to capture in the real world, fundamentally eliminating Corner Cases for autonomous driving.

Admittedly, Terra-1 is still in the research and development iteration stage, but we are full of confidence in its potential and future. We believe that through continuous optimization and improvement, the Terra series of models will become an important driver of autonomous driving technology, providing powerful support for large model-driven end-to-end autonomous driving.

Reference Resources:

World Models https://worldmodels.github.io/
Planning-oriented Autonomous Driving https://arxiv.org/pdf/2212.10156
sora https://openai.com/index/sora/
Grok-1.5 Vision Preview https://x.ai/blog/grok-1.5v
Scaling GAIA-1: 9-billion parameter generative world model for autonomous driving https://wayve.ai/thinking/scaling-gaia-1/
Building Vision Foundation Models for Autonomous Driving by Phil Duan: https://www.youtube.com/watch?v=OKDRsVXv49A
Foundation Models for Autonomy by Ashok Elluswamy https://www.youtube.com/watch?v=6x-Xb_uT7ts&t=472s&ab_channel=WADatCVPR