LivingWorld: Interactive 4D World Generation with Environmental Dynamics

1Pusan National University
*Equal Contribution Corresponding Author

World overview

Overview of LivingWorld. LivingWorld generates dynamic 4D scenes from a single image by progressively expanding the scene while maintaining globally coherent environmental dynamics. We demonstrate results with scene-scale motions such as water, clouds, and smoke, rendered under moving viewpoints. Our method preserves spatial and temporal consistency across views, enabling interactive and physically plausible 4D world generation.

Abstract

We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Additional dynamic results are provided in the supplementary video.

Method

Interactive 4D World Generation

LivingWorld constructs a globally consistent motion field during scene expansion. Starting from a single image, we progressively expand the scene and estimate motion from user interactions. To ensure consistency across views, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities. We further represent motion using a compact hash-based field, enabling efficient and stable propagation of dynamics. This allows temporally coherent and view-consistent 4D world generation.


LivingWorld Pipeline
Geometry-aware Alignment

Geometry-Aware Alignment

Our alignment module resolves directional and scale ambiguities across views. By matching overlapping regions between newly expanded and existing scenes, we enforce globally consistent motion during scene growth. This enables stable integration of motion across the entire scene.

Hash-based Motion Field

Hash-based Motion Field

We represent motion as a continuous hash-based field, enabling efficient querying and stable propagation of dynamics. The field supports bidirectional motion integration, producing long and temporally coherent 4D sequences during rendering.

Results

WonderPlay generates dynamic 3D scenes from a single image and input actions. It predicts the physical consequences of the input actions. Here, we present video results rendered with a moving camera, overlaid with action visualizations.



Integration with Object-Centric Motion

While LivingWorld primarily models scene-scale environmental dynamics such as water, clouds, and smoke, it can also incorporate object-centric motion. Rather than representing all motion within a single field, we align object and environmental motions in world coordinates by enforcing consistent scale and spatial alignment. This allows independently modeled motions to be coherently integrated within the same scene, maintaining consistency across viewpoints and scene expansion.

Comparisons with Baseline Methods

We compare LivingWorld against video and 4D scene generation baselines. Video models often suffer from inconsistent 3D geometry, unnatural motion dynamics, and limited camera controllability, while 4D scene baselines are limited in scene scale, motion diversity, or runtime. LivingWorld maintains a coherent 3D representation with physically consistent dynamics and precise control over camera motion.

Quantitative Evaluation

We evaluate LivingWorld against video- and 4D-scene-generation baselines on 60 expanded scenes. VBench measures imaging quality, aesthetic, motion smoothness, and temporal flicker; PhysReal is a GPT-based assessment of physical realism; Runtime is per-scene generation time. LivingWorld attains the best or comparable scores across all dimensions while being orders of magnitude faster.

Category Method VBench (↑) PhysReal (↑) Runtime (↓)
Imaging Aesthetic Motion Flicker
Video Veo 3.1 0.694 0.625 0.992 0.979 0.622 140
CogVideoX 0.677 0.611 0.991 0.983 0.575 1510
Tora 0.649 0.609 0.992 0.976 0.571 550
4D Scene 4DGS-Cinemagraphy 0.637 0.604 0.996 0.988 0.605 1980
PerpetualWonder 0.553 0.553 0.979 0.972 0.554 3580
LivingWorld 0.673 0.639 0.995 0.989 0.655 12

Bold: best   underline: second best   Runtime is in seconds.

Human Study (2AFC)

Values indicate the percentage of participants who preferred LivingWorld over each baseline. Outer arcs show 95% Wilson confidence intervals over participants.

95-participant Study

CogVideoX

68% ±9.0
Imaging
73% ±8.6
Aesthetic
66% ±9.1
Motion
78% ±8.0
Flicker

Tora

77% ±8.2
Imaging
79% ±7.9
Aesthetic
76% ±8.3
Motion
84% ±7.2
Flicker

Veo 3.1

58% ±9.5
Imaging
66% ±9.1
Aesthetic
72% ±8.7
Motion
75% ±8.4
Flicker

4DGS-Cinematic

73% ±8.6
Imaging
75% ±8.4
Aesthetic
68% ±9.0
Motion
74% ±8.5
Flicker

PerpetualWonder

82% ±7.5
Imaging
87% ±6.6
Aesthetic
85% ±7.0
Motion
92% ±5.4
Flicker

camera prompt icon "Pan to the right"



Interactivity User Study

We conduct an interactivity user study to evaluate our GUI for interactive 4D world generation. Participants used our interface to author and explore dynamic scenes, and rated their experience on a 7-point Likert scale (1 = Strongly Disagree, 7 = Strongly Agree) across three dimensions: Usability, Controllability, and Usefulness. LivingWorld achieves consistently positive ratings across all dimensions, suggesting that the added 4D controls remain practically usable.

1 3 5 7 Usability 5.65 (±1.25) Controllability 5.90 (±1.42) Usefulness 6.00 (±1.21) Avg. 5.85 (± 1.30)
Interactivity user study survey form
Survey form distributed to participants in the interactivity user study.

GUI Demo

Citation

@article{mun2026livingworld,
                title={LivingWorld: Interactive 4D World Generation with Environmental Dynamics},
                author={Mun, Hyeongju and Jin, In-Hwan and Kim, Sohyeong and Kong, Kyeongbo},
                journal={arXiv preprint arXiv:2604.01641},
                year={2026}
                }