ICLR_3D-MOM

Optimizing 4D Gaussians for Dynamic Scene Video
from Single Landscape Images

International Conference on Learning Representations (ICLR), 2025

In-Hwan Jin ^1* Haesoo Choo ^2* Seong-Hun Jeong ¹
Heemoon Park ³ Junghwan Kim ⁴ Oh-joon Kwon ⁵ Kyeongbo Kong ^1†

¹ Pusan National University ² Pukyong National University
³ Busan MBC ⁴ Korea University ⁵ DM Studio

^* Equal contribution ^† Corresponding author

Paper arXiv Code

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

Sample Image

❮ ❯

Overview Video

Dynamic Scene Video

Recently, a field known as dynamic scene video has emerged, which creates videos with natural animations from specific camera perspectives using a combination of single image animation and 3D photography. These methods utilize Layered Depth Images (LDIs), which are created by dividing a single image into multiple layers based on depth, to represent a pseudo 3D space. However, there are limitations when attempting to discretely separate most elements, including fluids, in a continuous landscape, and 3D space cannot be fully represented this way. Therefore, achieving complete 4D space virtualization through explicit representation is necessary, and we propose this approach for the first time.

Abstract

To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics.

Method overview

Our goal is to optimize 4D Gaussians to represent a complete 3D space, including animation, from a single image. (a) A depth map is estimated from the given single image, and it is converted into a point cloud. For optimizing the 3D Gaussians, multi-view RGB images are rendered according to the defined camera trajectory. (b) Similarly, multi-view motion masks are rendered for the input motion mask. These are utilized to estimate multi-view 2D motion maps along with the rendered RGB images. 3D motion is obtained by unprojecting the estimated 2D motion into the 3D domain. In this context, the proposed 3D Motion Optimization Module (3D-MOM) ensures consistent 3D motion across multi-views. (c) Using the optimized 3D Gaussians and generated 3D motion, 4D Gaussians are optimized for changes in position, rotation, and scaling over time.

3D Motion Optimization Module

To maintain consistency of motion across multi-views, 3D motion is defined from the point cloud and projected into 2D images using camera parameters. The L1 loss between the projected 2D motion and the estimated 2D motion map as the ground truth is computed, minimizing the sum of losses for multi-view to optimize the 3D motion.

Experiments Results

Results of Holynski Dataset

We present qualitative comparison with other baseline methods and diffusion-based methods. In this case, our proposed model, as an explicit representation, is projected to 2D video for comparison of results. The process of separating the input image into LDIs in 3D-Cinemagraphy (Li et al., 2023) leads to artifacts on animated regions and fails to provide natural motion which results in reduced realism. Similarly, Make-It-4D (Shen et al., 2023) also utilizes LDIs to represent 3D for multi-view generation, which results in lower visual quality. Additionally, due to unclear layer separation, objects appear fragmented or exhibit ghosting effects, where objects seem to leave behind afterimages. Likewise, DynamiCrafter (Xing et al., 2025) and Motion-I2V (Shi et al., 2024), though capable of producing cinemagraphy, encounter challenges in accurately rendering the desired views due to limited capabilities in view manipulation. In contrast, the proposed model represents a complete 3D space with animations, providing less visual artifact and high rendering quality from various camera viewpoints. Therefore, our method provides more photorealistic results compared to others for various input images.

❮ ❯

❮ ❯

Results of "In-the-Wild" Dataset

Additionally, to compare the performance of our method with baseline models, we use our “in-the-wild” dataset, which we collect of global landmarks from online sources. The figure above demonstrate that our model outperforms baseline models by producing more realistic and stable videos across a variety of complex scenarios.

❮ ❯

Quantitative Results

We show the quantitative results of our method compared to other baselines on reference and non-reference metrics. Our approach outperforms the other baseline on all metrics in the context of view generation. In particular, our method achieved the highest scores in PSNR, SSIM, and LPIPS, indicating that the generated views are of high fidelity and perceptually similar to the ground truth views. Furthermore, we demonstrated that our proposed method outperforms existing methods on non-reference metrics by locally measuring the extent of noise and distortion in images using PIQE (Venkatanath et al., 2015). Additionally, we conduct a user study on the generated videos, confirming that our model not only outperform in quantitative metrics but also surpasses in user experiences across four visual aspects.

Application: 4D Scene Generation

3D Scene Generation + Dynamic Scene Video

Our framework is designed to seamlessly incorporate 3D scene generation models, facilitating straightforward spatial expansion. The figure below shows the results of incorporating the 3D Scene Generation model, LucidDreamer (Chung et al., 2023), into our method. LucidDreamer converts single images into point clouds and progressively fills empty areas with an inpainting model, enabling spatio temporal expansion when incorporated into our framework. This incorporation enables the creation of videos with more natural motion and expansive views.

❮ ❯

❮ ❯

Comparative Result with VividDream

Additionally, we conduct comparisons with recent work on the 4D Scene Generation model, VividDream (Lee et al., 2024). Unlike our approach, VividDream directly generates multi-view videos through the T2V model without utilizing motion estimation for temporal expansion. We render the comparisons using the closest matching images and cameras available, as the code and data were not disclosed. Since VividDream (Lee et al., 2024) generates videos independently from multiple views, this approach results in motion ambiguity that leads to blurred reconstructions in fluid scenes, failing to accurately capture various motions. In contrast, our method estimates consistent 3D motion based on 2D motion, subsequently generating videos that achieve high-quality video with more natural motion.

3D Motion Optimization Module

Independently estimated 2D motions from multi-view images can yield different motion values for the same region in 3D space. Table II shows the EPE results comparing multi-view flows with and without the 3D Motion Optimization Module. The results indicate that without 3D Motion Optimization, the estimated flows significantly differ at the same positions, whereas our 3D motion module achieves remarkable consistency across all viewpoints with almost no variance. Similarly, the visualized motions in Fig (a) demonstrate that 3D motion accurately represents motion information in 3D space, ensuring consistency when projected to different viewpoints. Lastly, Fig (b) illustrates that directly using these 2D motions to animate viewpoint videos leads to the inability of 4D Gaussians to represent natural motion.

❮ ❯

Single Image Animation

In our model, it is crucial to utilize a single image animation model that precisely estimate 2D motion from Multi-view images and generate multi-view videos by accurately reflecting 3D motion. The figure below shows the results of trained 4D Gaussians using animated videos by different single image animation models, SLR-SFS (Fan et al., 2023), Text2Cinemagraph (Mahapatra et al., 2023) and StyleCineGAN (Choi et al., 2024). The Eulerian flows estimated by each model enable the generation of consistent 3D motion through 3D-MOM, which facilitates the restoration of natural motion in 4D scene. Additionally StyleCineGAN (Choi et al., 2024) can generate natural videos not only of fluids like water but also of clouds and smoke, allowing for the generate various motions when utilized to our framework.

❮ ❯

3D Motion Initialization

To verify the effect of 3D motion in training 4D Gaussians, we compared the results of our method with and without 3D motion initialization. When applying animation to fluids, repeated patterns occurred. In Table III, it shows EPE score of estimated optical flow from each rendered video. It indicates that the explicit representation of 4D Gaussians, which train multi-views and motion jointly, finds it difficult to capture the overall motion accurately when trained only with viewpoint videos. Figure shows the estimated optical flow from the rendered video of 4D Gaussians. This demonstrates that it is difficult to accurately learn motion without 3D motion initialization. In contrast, our method shows that it can learn the overall 3D motion.

❮ ❯

4D Gaussian Splatting

Since our framework utilizes 4D Gaussians to model complete 3D spaces with motion, the expressiveness of the model itself significantly influences the quality of the final results. The figure below shows the results of implementing the Deformable-3D (Yang et al., 2024) within our framework. Compared to previous results using 4D-GS (Wu et al., 2024), it reconstructs low-fidelity 4D scenes and generates videos with reduced realism. Therefore, by utilizing 4D-GS (Wu et al., 2024), our framework is capable of producing more immersive Dynamic Scene Videos. This experiment demonstrates the adaptability of our model, and as advancements are made in the field of 4D Gaussians, the performance of our framework also improves.

❮ ❯

Effect of Two-stage training

To achieve faster and more stable results with our algorithm, we separated the 4D Gaussians learning process by viewpoints and time axis. In step 1, we trained 3D Gaussians using all viewpoints, and in step 2, we trained 4D Gaussians using videos from sampled view-points. Figure shows the results of training 4D Gaussians with animated videos for all viewpoints, while the bottom shows the results of our two-stage training approach trained on only three viewpoint videos. This demonstrates that our training method produces results almost identical to those obtained by training with videos from all viewpoints. Additionally, as shown in Table IV , which was evaluated on a sample validation set, our method not only maintains high performance but also achieves a significant efficiency improvement. It is over 30 times faster in generating videos and requires about one-third less time to train the 4D Gaussians, demonstrating an optimal balance between speed and accuracy.

❮ ❯

Multi-view Rendered RGB Images

Long Video Results

Recent diffusion-based T2V models capable of simultaneously generating multi-angle images and cinemagraphy, similar to Dynamic Scene Videos, have emerged (Shi et al., 2024), (Xing et al., 2025). However, these models experience a sharp increase in computational load with the number of frames, limiting them to a maximum of 30 frames per inference and requiring lengthy inference times for each new view. In contrast, our framework can reconstruct long durations using explicit 4D Gaussians, allowing for the creation of novel view videos in a shorter time and at a lower cost. Fig. 14 demonstrates that our framework can produce long videos maintaining natural motion and high-fidelty, capable of generating up to 330 frames. The high compatibility of our framework ensures that as the field of 4D Gaussians advances, the performance of our framework also improves, enabling the production of longer Dynamic Scene Videos.

BibTeX

@inproceedings{jinoptimizing,
        title={Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images},
        author={Jin, In-Hwan and Choo, Haesoo and Jeong, Seong-Hun and Heemoon, Park and Kim, Junghwan and Kwon, Oh-joon and Kong, Kyeongbo},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025}
    }