Redefining

How Physical AI

Training Begins

JoinAI builds next-generation training infrastructure for embodied intelligence, autonomous driving, industrial vision, and other Physical AI scenarios through generative world modeling, spatial composition, and training-oriented systems.
Today, we are standingon the fault lineof a new era.

AI no longer lives only in pixels and text.
It is moving into the physical world we inhabit.

Sensors gave machines eyes.
Robotic arms gave them hands and touch.

But seeing the world and reaching its boundaries do not mean truly learning how the world works.

From perception to understanding to action, the key lies in real command of space.



As AI entersthe physical world,the way it learns must change.​

It must learn the structure and rules of the physical world from vision, space, time, and signals of interaction.

And this cannot stop at surface-level representation. What it needs is physically grounded, high-precision understanding and control — together with object-level generation, editing, and reconstruction.

More importantly, it needs spatial scenes that can iterate as quickly as models do — and training data that can keep expanding on top of them.



That is why

next-generation Physical AI

can no longer rely on simulation alone.

Traditional Simulation

Define the world first. Then run the world.

Built on manually constructed scenes, rules, and parameters, traditional simulation uses engines to produce test environments, scene outcomes, or training data. At its core, it is an explicitly constructed,

simulator-centric system.

Generative AI-Enhanced Simulation

The simulator remains the center. AI only improves how it is built. By introducing generative AI into the simulation framework, this approach improves the efficiency and realism of asset creation, scene construction, and content expansion. But fundamentally, it is still a simulation-centric system — only with stronger tools.

The problem is that even when generative AI is added, enhanced simulation is still only a linear extension of the existing path.

It can be useful — even essential.

But it still does not change the underlying nature of construction.



If the foundation doesn’t change, neither do the bottlenecks.

Cross-Embodiment Transfer Remains Costly

Traditional simulation often relies on point-to-point engineering for specific robots and environments, making transfer costly and generalization fragile. In contrast, Video-first and World-action approaches offer a stronger path toward cross-embodiment transfer by learning more general patterns of physical dynamics.

Task Relevance Is Not Built In

A runnable world is not automatically a valuable training world. Traditional simulation is effective at constructing environments, but it does not naturally answer which spatial relations, interactions, or future states matter most for learning.

The Content Gap Still Persists

The Sim-to-Real challenge is not only about visual fidelity. It is also about the gap in object variety, interaction patterns, and real-world content distribution. Better rendering alone does not automatically close the Content Gap.

Long-Horizon Modeling Remains Fragile

For occlusion tracking, task-phase switching, and failure recovery, traditional simulation struggles to form naturally continuous world representations. Video and World-model approaches place much greater emphasis on predicting future world states over time.

Long-Tail Coverage Remains Expensive

Expanding 3D assets, rules, and parameters by hand to cover long-tail scenarios is expensive and slow. Even with generative AI, gains remain limited as long as the system is still centered on manual construction.

Adding AI Does Not Change the Route

Adding generative AI to an old framework does not automatically change its nature. The real difference is not whether AI is used, but whether AI remains an external enhancement tool or becomes the system core for organizing world representation and training loops.

Generation Doesn’t End with a Result Image
What Matters Is a Spatial Representation You Can Enter
3d generate
One Prompt,
and Spatial Composition Begins
 

Object-Level Editing

Shouldn’t Be the Hard Part

Before
Before
After
After

Prompt: “add a glass and place it on magzine”

Image data

Should Never Stop at Images.

Original render
Depth map
Segmentation map
robot_view_camera_params
{ "camera_name": "Robot-View", "timestamp_seconds": 222.375000, "location": { "x": -2.763041, "y": -2.039186, "z": 1.500000 }, "quaternion": { "qw": 0.6509729028, "qx": 0.3388750553, "qy": -0.3136486709, "qz": -0.6025134921 }, "robot view_parameters": { "class_name": "RobotView Parameters", "extrinsic": [ 0.07720405608415604, -0.5718645453453064, 0.816707193851471, -0.0, -0.9970153570175171, -0.04428243264555931, 0.0632418692111969, 0.0, 3.725290742551124e-09, -0.8191520571708679, -0.5735765099525452, -0.0, -1.8197823762893677, -0.4416574239730835, 3.245922565460205, 1.0 ], "intrinsic": { "height": 2160, "intrinsic_matrix": [ 2306.217300415039, 0.0, 0.0, 0.0, 2306.217300415039, 0.0, 2088.0, 1080.0, 1.0 ], "width": 4176 }, "version_major": 1, "version_minor": 0 }, "camera_details": { "sensor_width_mm": 36.0, "sensor_height_mm": 27.0, "focal_length_mm": 19.881183624267578, "sensor_fit": "HORIZONTAL", "fx_pixels": 2306.217300415039, "fy_pixels": 2306.217300415039, "cx_pixels": 2088.0, "cy_pixels": 1080.0 }, "notes": { "coordinate_system": "Extrinsic matrix uses computer vision convention (Y-up, -Z forward)", "extrinsic_format": "Column-major 4x4 matrix (world-to-camera transformation)", "intrinsic_format": "Row-major 3x3 matrix [fx, 0, 0, 0, fy, 0, cx, cy, 1]", "quaternion_order": "WXYZ (scalar first)", "location_units": "Blender units (meters)" } }

Video data

Shouldn’t Stop at Video Alone

robot_view_camera_params
{ "camera_name": "Robot-View", "timestamp_seconds": 222.375000, "location": { "x": -2.763041, "y": -2.039186, "z": 1.500000 }, "quaternion": { "qw": 0.6509729028, "qx": 0.3388750553, "qy": -0.3136486709, "qz": -0.6025134921 }, "robot view_parameters": { "class_name": "RobotView Parameters", "extrinsic": [ 0.07720405608415604, -0.5718645453453064, 0.816707193851471, -0.0, -0.9970153570175171, -0.04428243264555931, 0.0632418692111969, 0.0, 3.725290742551124e-09, -0.8191520571708679, -0.5735765099525452, -0.0, -1.8197823762893677, -0.4416574239730835, 3.245922565460205, 1.0 ], "intrinsic": { "height": 2160, "intrinsic_matrix": [ 2306.217300415039, 0.0, 0.0, 0.0, 2306.217300415039, 0.0, 2088.0, 1080.0, 1.0 ], "width": 4176 }, "version_major": 1, "version_minor": 0 }, "camera_details": { "sensor_width_mm": 36.0, "sensor_height_mm": 27.0, "focal_length_mm": 19.881183624267578, "sensor_fit": "HORIZONTAL", "fx_pixels": 2306.217300415039, "fy_pixels": 2306.217300415039, "cx_pixels": 2088.0, "cy_pixels": 1080.0 }, "notes": { "coordinate_system": "Extrinsic matrix uses computer vision convention (Y-up, -Z forward)", "extrinsic_format": "Column-major 4x4 matrix (world-to-camera transformation)", "intrinsic_format": "Row-major 3x3 matrix [fx, 0, 0, 0, fy, 0, cx, cy, 1]", "quaternion_order": "WXYZ (scalar first)", "location_units": "Blender units (meters)" } }
robot_view_camera_params
{ "video_id": "CupMove.mp4", "task": "Robotic Pick and Place", "scene": "Dining room, round wooden table with magazines and decor", "steps": [ { "start_frame": 0, "end_frame": 90, "skill": "Pick", "description": "The robotic gripper descends from above, aligns with the white patterned cup on the table, and closes its jaws to grasp the cup." }, { "start_frame": 90, "end_frame": 180, "skill": "Transport", "description": "The robotic arm lifts the cup vertically, moves it horizontally to the left side of the table, and lowers it towards the surface." }, { "start_frame": 180, "end_frame": 210, "skill": "Place", "description": "The gripper opens to release the cup onto the table surface and the arm retracts upwards to complete the task." } ] }
Physical Info Background

A Learnable World Should Never Output Just One Kind of Data

No Physical Signal Can Be Missing

Simulaix & Terra

Bringing World Knowledge into Training

——Made Simple

Simulaix Demo Interface
One Capability, Infinite Possibilities.
Across different Physical AI scenarios,
it answers different problems
and reveals different forms of value.

Embodied Manipulation

Embodied interaction can become part of the training interface.

Home Spaces

Training data can finally be defined proactively.

Open Roads

Long-tail scenarios no longer need to be left to chance.

Sensitive Spaces

Some training data should never rely on real-world collection.

What ultimately defines the ceiling is the edge case.

For example, try petting a cat.

The only boundary for Physical AI

should be imagination.

As AI moves from the world of language into the physical world,

the real question is no longer just whether data exists,

but whether there is an effective way to construct, represent,

generate, and continuously expand world data.

That is exactly what JoinAI is building:

the infrastructure Physical AI needs to learn the physical world.

We don’t stop at data. We deliver outcomes.

Explore How Much More Your AI Could Become
JoinAI
Better data, Better AI,
For everyone, Forever.
JoinAI Logo
 二维码
 二维码
 二维码
Substack
JoinAI (Hangzhou Join Intelligence Technology Co., Ltd.)
Room 448, 4th Floor, Building 4, No. 66 Dongxin Avenue, Binjiang District, Hangzhou, Zhejiang 310000
China | © 2026 JoinAI. All rights reserved.浙ICP备2021040718号-2
Privacy & Cookie