LivingWorld: Interactive 4D World Generation with Environmental Dynamics

Hyeongju Mun ^1* In-Hwan Jin ^1* Sohyeong Kim ¹ Kyeongbo Kong ^1†

¹Pusan National University
^*Equal Contribution ^†Corresponding Author

Overview of LivingWorld. LivingWorld generates dynamic 4D scenes from a single image by progressively expanding the scene while maintaining globally coherent environmental dynamics. We demonstrate results with scene-scale motions such as water, clouds, and smoke, rendered under moving viewpoints. Our method preserves spatial and temporal consistency across views, enabling interactive and physically plausible 4D world generation.

Abstract

We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Additional dynamic results are provided in the supplementary video.

Method

Interactive 4D World Generation

LivingWorld constructs a globally consistent motion field during scene expansion. Starting from a single image, we progressively expand the scene and estimate motion from user interactions. To ensure consistency across views, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities. We further represent motion using a compact hash-based field, enabling efficient and stable propagation of dynamics. This allows temporally coherent and view-consistent 4D world generation.

Geometry-Aware Alignment

Our alignment module resolves directional and scale ambiguities across views. By matching overlapping regions between newly expanded and existing scenes, we enforce globally consistent motion during scene growth. This enables stable integration of motion across the entire scene.

Hash-based Motion Field

We represent motion as a continuous hash-based field, enabling efficient querying and stable propagation of dynamics. The field supports bidirectional motion integration, producing long and temporally coherent 4D sequences during rendering.

Results

WonderPlay generates dynamic 3D scenes from a single image and input actions. It predicts the physical consequences of the input actions. Here, we present video results rendered with a moving camera, overlaid with action visualizations.

Integration with Object-Centric Motion

While LivingWorld primarily models scene-scale environmental dynamics such as water, clouds, and smoke, it can also incorporate object-centric motion. Rather than representing all motion within a single field, we align object and environmental motions in world coordinates by enforcing consistent scale and spatial alignment. This allows independently modeled motions to be coherently integrated within the same scene, maintaining consistency across viewpoints and scene expansion.

Comparisons with Video Generation Models

We compare our method with video generation models under identical camera motion prompts. While these models often suffer from inconsistent 3D geometry, unnatural motion dynamics, and limited camera controllability, our method maintains a coherent 3D representation with physically consistent dynamics and precise control over camera motion.