REVIVE 3D

Refinement via Encoded Voluminous Inflated Prior for Volume Enhancement

REVIVE 3D generates voluminous 3D assets from flat images.

Conference on Computer Vision and Pattern Recognition CVPR 2026

Hankyeol Lee Wooyeol Baek Seongdo Kim Jongyoo Kim^†

Yonsei University

Paper Coming Soon Code

Abstract

Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, using the prior's geometric cues to leverage the backbone's pretrained 3D knowledge. Furthermore, our framework supports image-conditioned 3D editing. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.

Flat Image to 3D Generation

Explore four generated meshes per page. We provide twelve image-conditioned generation results in total.

Comparison with Baselines

Editing Results

Method

Stage 1

Inflated Prior

We recover missing volume from the foreground silhouette and add part-aware local cues through superimposing.

Stage 2

3D Latent Refinement

We refine the inflated prior in the backbone latent space to obtain geometry that is both more plausible and more image-consistent.

Overview of our method. Stage 1 generates the Inflated Prior. We create a Base 3D from the Silhouette Mask and Detail 3D from Segmentation Masks, then combine them via superimposing. Stage 2 refines the Inflated Prior by encoding the mesh, injecting noise, denoising it with the image condition, and decoding the result into the Refined 3D mesh.

Quantitative Results

Uni3D and ULIP show that our results are the most semantically aligned with the input image, while Compactness (C) and Normal Anisotropy (NA) show that they are the most voluminous and the least flat.

Models	Uni3D ↑	ULIP ↑	C ↑	NA ↓
Trellis	0.2736	0.1241	0.1748	0.1282
DrawingSpinUp	0.2335	0.1164	0.1604	0.1332
Hunyuan3D-Omni	0.2816	0.1257	0.1707	0.1120
Direct3D	0.2796	0.1315	0.2012	0.1019
Hunyuan3D-2.1	0.2759	0.1193	0.1408	0.1347
Ours (Hunyuan3D-2.1)	0.3043	0.1265	0.2179	0.0767
Ours (Direct3D)	0.3097	0.1375	0.2178	0.0908