HiSplat: Hierarchical 3D Gaussian Splatting for Generalizable Sparse-View Reconstruction


Shengji Tang1,2, Weicai Ye*2,3, Peng Ye*2, Weihao Lin1, Yang Zhou1,2,
Tao Chen1, Wanli Ouyang2,

1Fudan University, 2Shanghai AI Lab, 3State Key Lab of CAD & CG, Zhejiang University,
*Corresponding Authors

Abstract

Reconstructing 3D scenes from multiple viewpoints is a fundamental task in stereo vision. Recently, advances in generalizable 3D Gaussian Splatting have enabled high-quality novel view synthesis for unseen scenes from sparse input views by feed-forward predicting per-pixel Gaussian parameters without extra optimization. However, existing methods typically generate single-scale 3D Gaussians, which lack representation of both large-scale structure and texture details, resulting in mislocation and artefacts. In this paper, we propose a novel framework, HiSplat , which introduces a hierarchical manner in generalizable 3D Gaussian Splatting to construct hierarchical 3D Gaussians via a coarse-to-fine strategy. Specifically, HiSplat generates large coarse-grained Gaussians to capture large-scale structures, followed by fine-grained Gaussians to enhance delicate texture details. To promote inter-scale interactions, we propose an Error Aware Module for Gaussian compensation and a Modulating Fusion Module for Gaussian repair. Our method achieves joint optimization of hierarchical representations, allowing for novel view synthesis using only two-view reference images. Comprehensive experiments on various datasets demonstrate that HiSplat significantly enhances reconstruction quality and cross-dataset generalization compared to prior single-scale methods. The corresponding ablation study and analysis of different-scale 3D Gaussians reveal the mechanism behind the effectiveness.

pipeline

Fig. 1: Overview of HiSplat. HiSplat constructs hierarchical 3D Gaussians which can better represent large-scale structures (more accurate location and less crack), and texture details (fewer artefacts and less blurriness).


pipeline

Fig. 2: Framework of HiSplat. For simplicity, the situation with two input images is illustrated. HiSplat utilizes a shared U-Net backbone to extract different-scale features. With these features, three processing stages predict pixel-aligned Gaussian parameters with different scales, respectively. Error aware module and modulating fusion module perceive the errors in the early stages and guide the Gaussians in the later stages for compensation and repair. Finally, the fusing hierarchical Gaussians can reconstruct both the large-scale structure and texture details.

Qualitative Comparison

We provide a part of qualitative comparison results of PixelSplat, MVSplat, HiSplat (Ours) and Ground Truth.


Quantitative Results

We provide quantitative comparison results. For Novel View Synthesis (RealEstate10K, ACID), we train and test the models on the same dataset. To evaluate cross-dataset generalization, we train the models on RealEstate10K and directly test them on target datasets (DTU, ACID, Replica) without fine-tuning.

Tab. 1: Evaluation on RealEstate10K and ACID. Compared with the previous NeRF-based and generalizable Gaussian-Splatting-based method, HiSplat can consistently obtain higher rendering quality of unseen views.



Tab. 2: Evaluation on cross-dataset generalization. We train models on RealEstate10K, and test them on object-centric dataset DTU, outdoor dataset ACID, and indoor dataset Replica in a zero-shot setting. Compared with previous methods, HiSplat can better handle various scenes with different distributions and scales.