OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Adam Sun; Tiange Xiang; Scott Delp; Li Fei-Fei; Ehsan Adeli

. Author manuscript; available in PMC: 2025 Jun 26.

Published in final edited form as: Adv Neural Inf Process Syst. 2024;37:92184–92209.

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Adam Sun ^1,^*, Tiange Xiang ^1,^*, Scott Delp ¹, Li Fei-Fei ^1,^†, Ehsan Adeli ^1,^†

PMCID: PMC12199745 NIHMSID: NIHMS2086915 PMID: 40575631

Abstract

Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, human 3D Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

1. Introduction

Rendering 3D humans from monocular in-the-wild videos has been a persistent challenge, with significant implications in virtual/augmented reality, healthcare, and sports. Given a video of a human moving around a scene, this task involves reconstructing the appearance and geometry of the human, allowing for the rendering of the human from novel views.

When faced with the problem of human reconstruction from monocular video, several works based on neural radiance fields (NeRFs) have achieved promising results [37, 57, 19, 9]. 3D Gaussian splatting [24] further improves upon NeRF-based rendering methods for better performance. By representing the human not as an implicit radiance field but as a set of explicit 3D Gaussians, methods like GauHuman [14] and 3DGS-Avatar [46] are able to render humans comparable in quality to NeRF methods, while taking only a few minutes to train and less than a second to render.

While most human rendering studies assume clean, unobstructed environments, real-world settings like hospitals, stadiums, and construction sites involve frequent occlusions. Current methods struggle in these conditions, often producing artifacts, floating elements, or incomplete body parts. Solutions like OccNeRF [64] and Wild2Avatar [63] attempt to address occlusions but are limited by high computational demands and lengthy training times, making them impractical and restricting their real-world applicability.

In this work, we introduce OccFusion, an efficient yet high quality method for rendering occluded humans. To gain improved training and rendering speed, OccFusion represents the human as a set of 3D Gaussians. Like almost all other human rendering methods [64, 63, 67], OccFusion assumes accurate priors such as human segmentation masks and poses are provided for each frame, which can be obtained with state-of-the-art off-the-shelf estimators such as SAM [25] and HMR 2.0 [8]. However, to ensure complete and high-quality renderings under occlusion, OccFusion proposes to utilize generative diffusion priors, more specifically pose-conditioned Stable Diffusion 1.5 [48] with ControlNet [76] plugins, to aid in the reconstruction process.

Our approach consists of three stages: (1) The Initialization Stage: we utilize segmentation and pose priors to inpaint occluded human visibility masks into complete human occupancy masks to supervise later stages. (2) The Optimization Stage: we initialize a set of 3D Gaussians and optimize them based on observed regions of the human, applying pose-conditioned Score-Distillation Sampling (SDS) to help ensure completeness of the modeled human body in both the posed and canonical space. (3) The Refinement Stage: we utilize pretrained generative models to inpaint unobserved regions of the human with context from partial observations and renderings from the previous stage, further improving the quality of the renderings. Despite taking only 10 minutes to train, our method outperforms the state-of-the-art in rendering humans from occluded videos.

In summary, our contributions are: (i) We propose OccFusion, the first method to combine Gaussian splatting with diffusion priors for the rendering of occluded humans from monocular videos. Multiple novel components are proposed along with a three-stage pipeline consisting of Initialization, Optimization, and Refinement stages. (ii) We demonstrate that OccFusion achieves state-of-the-art efficiency and rendering quality of occluded humans on both simulated and real-world occlusions.

2. Related Work

2.1. Neural Human Rendering

Traditional methods to reconstruct humans usually require dense arrays of cameras [10, 4, 2] or depth information [71, 50, 4, 5], both of which are unobtainable for in-the-wild scenes. To solve this problem, Neural Radiance Fields (NeRFs) [37] have recently been used to model dynamic humans from monocular videos [57, 9, 19, 17, 72, 52]. These methods achieve high-quality novel view synthesis by parametrizing the human body using an SMPL [35] pose prior and modeling it as a radiance field. However, since NeRFs depend on large Multi-Layer Perceptrons (MLPs), they are computationally expensive, usually taking days to train and minutes to render [24, 14, 46]. To speed up NeRF-based models, multi-resolution hash encoding [40, 43, 6, 18], and generalizability [41, 2, 13, 27] have been proposed. However, these methods either face a rendering bottleneck [14] or an expensive pre-training process, both of which affect their efficiency.

Point-based rendering methods like 3D Gaussian splatting [24] greatly accelerate the rendering of static and dynamic scenes. Recently, there have been an abundance of works applying 3D Gaussian splatting to human rendering tasks [46, 14, 26, 38, 74, 33, 68, 20, 29, 28, 42, 12]. Like NeRF-based approaches, Gaussian splatting-based approaches represent the human in a canonical space and use Linear Blend Skinning (LBS) to transform the human into the posed space. Gaussian splatting methods achieve state-of-the-art performance of dynamic humans with fast training times and real-time rendering, causing them to be the more desired method [46, 14].

2.2. Occluded Human Rendering

Reconstructing complex scenes in the wild is a well-studied problem. NeRF-W [36] and other works [3, 47, 75] are able to account for photometric variation and transient occluders, allowing them to render consistent representations from unconstrained image collections. However, these works are not designed to handle dynamic objects like humans.

Rendering humans in occluded settings, on the other hand, is relatively understudied. Sun et al. [51] utilize a layer-wise scene decoupling model to decouple humans from occluding objects. OccNeRF [64] combines geometric and visibility priors with surface-based rendering to train a human NeRF model, while Wild2Avatar [63] proposes an occlusion-aware scene parametrization scheme to decouple the human from the background and occlusions. While these works provide decent renderings of humans free of occlusions, they are slow and impractical due to their usage of NeRFs. A concurrent work to ours is OccGaussian [67], which also proposes to model occluded humans with 3D Gaussians by performing an occlusion feature query in occluded regions. We provide comparisons to their published results in Table 1.

Table 1:

Quantitative comparison on the ZJU-MoCap and OcMotion datasets. LPIPS values are scaled by ×1000. We color cells that have the best and second best metric values.

Methods	ZJU-MoCap [44]			OcMotion [15]

	PSNR↑	SSIM↑	LPIPS↓	PSNR^*↑	SSIM^*↑	LPIPS^*↓

HumanNeRF [57]	20.67^‡	0.9509^‡	-	-	-	-
3DGS-Avatar [46]	17.29^†	0.9410^†	63.25^†	9.788^†	0.7203^†	188.1^†
GauHuman [14]	21.55	0.9430	55.88	15.09	0.8525	107.1

OccNeRF [64]	22.40^‡	0.9562^‡	43.01^‡	15.71	0.8523	82.90
OccGaussian [67]	23.29^‡	0.9482^‡	41.93^‡	-	-	-
Wild2Avatar [63]	-	-	-	14.09^§	0.8484^§	93.31^§

OccGauHuman	22.71	0.9492	54.60	18.85	0.8863	86.53
OccFusion	23.96	0.9548	32.34	18.28	0.8875	82.42

Open in a new tab

Metrics calculated on visible pixels only.

^†

Model trained for 5k iterations with ×3 training time.

^‡

Results taken from OccGaussian [67], using ×5 training frames.

^§

Model trained under the default setting [63] using ×2 training frames.

2.3. Generative Diffusion Priors

Inferring the appearance of unobserved regions of 3D scenes requires the usage of generative models. The recent success of 2D diffusion models has made them the preferred model to use for generation [54, 21, 34, 48, 32]. To lift 2D diffusion models for 3D content generation, DreamFusion [45] proposed Score Distillation Sampling (SDS), a commonly used method for utilizing a pre-trained 2D diffusion model to supervise 3D content generation [30, 70, 53, 55].

Diffusion models can also be used as priors for training NeRFs and Gaussian splatting, combining reconstruction with generation [61, 78, 79, 62, 66, 73]. ReconFusion [61] uses SDS in conjunction with multi-view conditioning to synthesize the appearance of unobserved regions of a scene from sparse views, while BAGS [78] utilizes SDS to supervise a Gaussian splatting model.

3. Preliminaries

Before introducing our method, we provide an overview of key fundamentals in 3D human modeling using SMPL (subsection 3.1). Then, we discuss 3D Gaussian splatting, and how it can be applied to human modeling (subsection 3.2). Finally, we propose OccGauHuman, a simple improvement of GauHuman [14] that is better designed for occluded human rendering (subsection 3.3).

3.1. 3D Human Modeling

SMPL [35] is a model that parametrizes the human body with a 3D surface mesh. To transform between the canonical space to a pose space, the Linear Blend Skinning (LBS) algorithm is used. Given a 3D point $x_{c}$ in the canonical space and the shape $β$ and pose $θ$ parameters of the human, a point in the posed space can be calculated as:

x_{p} = \sum_{k = 1}^{K} w_{k} (G_{k} (J, θ) x_{c} + b_{k} (J, θ, β)),

(1)

where $J$ contains $K$ joint locations, $G_{k}$ and $b_{k}$ are the transformation matrix and translation vector, and $w_{k} \in [0, 1]$ are a set of skinning weights. The SMPL representation is commonly used as a geometric prior for human rendering [64, 63, 46, 14, 57, 19, 72].

3.2. Human Rendering with 3D Gaussian Splatting

3D Gaussian splatting.

3D Gaussian splatting [24] models a scene as a set of 3D Gaussians $П$ . Each Gaussian is defined by its 3D location $p_{i}$ , opacity $o_{i} \in [0, 1]$ , center $μ_{i}$ , covariance matrix $Σ_{i}$ , and spherical harmonic coefficients. The i-th Gaussian is defined as $o_{i} e^{- \frac{1}{2} {(p - μ_{i})}^{T} Σ_{i}^{- 1} (p - μ_{i})}$ . During rendering, these 3D Gaussians are mapped from the 3D world space and projected to the 2D image space via $α$ -blending, with the color of each pixel being calculated across the $N$ 3D Gaussians as:

C = \sum_{j = 1}^{N} c_{j} α_{j} \prod_{k = 1}^{j - 1} (1 - α_{k}),

(2)

where $c_{j}$ is the color and $α$ is the $z$ -depth ordered opacity. During the training process, 3D Gaussians are adaptively controlled via densification (splitting and cloning) and pruning until they achieve the optimal density to adequately represent the scene.

GauHuman [14].

In the line of work that uses 3D Gaussian splatting for human rendering [46, 26, 29, 12], GauHuman is a representative approach due to its balance between efficiency and rendering quality. After initializing 3D Gaussians on the vertices of the SMPL mesh, GauHuman learns a representation of the human in canonical space and utilizes LBS to transform each individual Gaussian into the posed space. A pose refinement module $M L P_{Φ_{pose}}$ and an LBS weight field module $M L P_{Φ_{l b s}}$ are used to learn the LBS transformation, and a merge operation based on KL divergence is used along with splitting, cloning, and pruning to help the 3D Gaussians reach convergence.

We base our method on GauHuman due to its fast training and state-of-the-art representative ability. GauHuman’s code is distributed under the S-Lab license and can be accessed here.

3.3. OccGauHuman: An Improved Baseline for Occlusion Handling

In common human rendering tasks, videos are captured in a clean environment, with every pixel in the image belonging to either the human or the background. By using a semantic segmentation model such as SAM [25] to preprocess a video, we can train the human rendering model only on pixels labeled as “human”. However, occlusions in the videos may lead to sparse observations of the human. As a result, fitting NeRF-based human rendering models on only the visible human pixels results in an incomplete geometry with lots of artifacts [64, 63].

Gaussian splatting-based rendering models [24] are especially suitable for human modeling tasks due to their explicit geometry and point-based representation. In this section, we present three straightforward tweaks of GauHuman [14] to make it perform better on videos with occlusions: (1) Firstly, as discussed above, we train the model on visible human pixels only, ensuring that occlusions do not result in learned sparsity on the human model. (2) We adjust the loss weights to put more weight on the mask loss computed between rendered human occupancy maps and the segmentation masks — we found that this helps learn more crisp human boundaries. (3) We disable the densification and pruning of 3D Gaussians during training — this helps maintain a rather complete human geometry based on the SMPL initialization.

The resulting OccGauHuman model serves as an improved baseline for occluded human reconstruction and as a starting point for our method. Benefits brought by our updates compared to the original GauHuman are presented in Table 1, as well as in Figure 7.

Figure 7: — Qualitative comparisons on **simulated occlusions** in the ZJU-MoCap dataset [44] (left column) and **real-world occlusions** in the OcMotion dataset [16] (right column). GH denotes GauHuman [14] and OGH denotes OccGauHuman.

4. OccFusion

In our approach, we train a Gaussian splatting-based human rendering model on the visible pixels of a human. However, recovering occluded content for a dynamically moving human is not trivial — humans are usually in challenging poses, and complex occlusions can cause additional issues. It is also essential to preserve a consistent human appearance and geometry across different frames. Considering these challenges, we propose our method OccFusion in multiple separate stages. In the Initialization stage (section 4.1), we inpaint occluded binary human masks for more reliable geometric guidance. In the Optimization stage (section 4.2), we use the inpainted masks to train a human rendering model based on GauHuman [14] while using Score Distillation Sampling (SDS) constraints on both the posed space and canonical space. In the Refinement stage (section 4.3), we fine-tune the trained model from the Optimization Stage with in-context inpainting to further refine the appearance of the human. An overview of our OccFusion is shown in Figure 2.

Figure 2: — **OccFusion** achieves occluded human rendering via three sequential stages. In the **Initialization Stage**, we recover complete binary human masks ${\hat{M}}$ from occluded partial observations ${I}$ with the help of segmentation priors ${M}$ and pose priors ${P}$ . ${\hat{M}}$ will be further used to help optimize the 3D Gaussians $Π$ in subsequent stages. In the **Optimization Stage**, we apply ${P}$ conditioned SDS on both posed human and canonical human to enforce the human occupancy to remain complete. In the **Refinement Stage**, we use the coarse human renderings ${\hat{I}}$ from the Optimization Stage to help generate missing RGB values in ${I}$ through our proposed in-context inpainting. Through this process, both the appearance and geometry of the human are fine-tuned to be in high fidelity. Training of all three stages takes only **10 minutes** on a single Titan RTX GPU.

4.1. Initialization Stage: Recovering Human Geometry from Partial Observations

Generative diffusion models [48] have demonstrated promise to be used as priors for different tasks [22, 53]. The most straightforward method is to utilize a precomputed segmentation prior $M$ and pose prior $P$ to condition $Φ$ [39, 76] to inpaint 1 − $M$ — the image regions that are not occupied by the human. However, there are two significant barriers to such a straightforward approach.

Conditioned human generation cannot handle challenging poses.

It is true that a conditioned diffusion prior $Φ$ is able to generate detailed images while staying consistent with the condition. However, since diffusion models like $Φ$ are usually overfitted on more commonly seen poses, $Φ$ usually fails to generate reasonable images when conditioned on challenging poses (see Figure 3 middle column). We attribute this limitation to the inappropriate 2D representation of $P$ — when joints occlude each other, it is impossible to tell which joints are closest to the camera when they are projected to 2D. So, we propose to simplify the 2D representation of $P$ . We apply a Z-buffer test on the depth map rendered from the SMPL mesh [35] and then calculate the distance $d$ between its z-axis location and the corresponding 2D z-buffer. Given a pre-defined threshold $σ$ , we deem a joint is self-occluded if $d > σ$ . Self-occluded joints are ignored when projecting 3D joints onto the 2D canvas for conditioning $Φ$ (see Figure 3 right column). Our simplification improves the generation quality of $Φ$ for challenging poses.

Figure 3: — Stable Diffusion 1.5 generations [48] conditioned on a challenging pose $P$ . While conditioning on the original pose results in multiple limbs and other abnormalities, our method of simplifying pose by removing self-occluded joints results in more feasible generations.

Per-frame inpainting cannot guarantee cross-frame consistency.

Compared to image generation models, video generation models [11, 60, 7] are less accessible and much more expensive to run. Without an explicit modeling of object motion in the video, frame-by-frame generation with an image generative model leads to cross-frame inconsistency, which is not desirable for human reconstruction (see Figure 4 middle column). Instead of inpainting the occluded parts of the human directly with $Φ$ , we claim that it is more feasible to inpaint binary human masks since small variations in the human silhouette are more acceptable (see Figure 4 right column). We first inpaint the RGB image $I$ and then rely on an off-the-shelf segmentation model [23] to obtain the inpainted binary human masks ${\hat{M}}$ , which is used to assist the training of the rendering model in subsequent stages.

Figure 4: — While generative models provide inconsistent inpainting results, the binary masks that can be extracted from these generated images are much more consistent.

4.2. Optimization Stage: Enforcing Human Completeness with SDS Regularization

After obtaining the inpainted masks ${\hat{M}}$ that outline a reasonable human silhouette, we build a Gaussian splatting model similar to the one described in section 3.2 for human rendering. The 3D Gaussians $Π$ are initiated as the SMPL mesh vertices, which are able to be deformed to adapt to different poses through SMPL-based LBS (Equation 3.1). With the help of ${\hat{M}}$ , the training of $Π$ consists of multiple photometric loss terms $ℒ_{photo}$ :

λ_{r g b} L_{1} (M \cdot I, M \cdot I^{'}) + λ_{mask} L_{2} (\hat{M}, A) + λ_{ssim} S S I M (M \cdot I, M \cdot I^{'}) + λ_{lpips} L P I P S (M \cdot I, M \cdot I^{'}),

(3)

where $L_{1}$ is the L-1 loss, $L_{2}$ is the L-2 loss, $S S I M (\cdot)$ is the $S S I M$ function [56], LPIPS is the VGG-based perceptual loss [77], $I^{'}$ is the rendered image from $Π$ , and $A$ is the rendered human occupancy map. Each of the loss terms is scaled by a weight hyperparameter $λ$ .

Even with the supervision of ${\hat{M}}$ , geometry inconsistency still exists. Although inconsistent human masks affect the training of $Π$ much less than inconsistent images, human completeness cannot be guaranteed without further steps.

Using diffusion priors to enforce human completeness.

We build off of the insights from [53, 59, 70] and apply Score Distillation Sampling (SDS) [45] to improve the quality of human renderings and reduce artifacts. Instead of applying SDS on RGB images $I^{'}$ , which causes appearance inconsistency, we apply it directly to the rendered human occupancy maps $A$ so that diffusion scores are propagated to encourage complete $A$ :

ℒ_{S D S}^{(P)} = E_{t, ϵ} [w (t) (ϵ_{ϕ} (A; t, P) - ϵ) \frac{\partial A}{\partial Π}],

(4)

where $t$ is a scheduled time stamp, $w (\cdot)$ is a weighting function, $ϵ (\cdot)$ is the UNet noise estimator in $Φ$ , and $ϵ$ is the injected Gaussian noise.

Using diffusion priors to regularize canonical pose.

In-the-wild videos often involve very sparse observations of the human, with only incomplete regions of the human visible in each frame. To further enforce completeness, we propose to render the human in the canonical Da-pose $\hat{P}$ with the human oriented at a random angle $\in \{k \frac{π}{9}, k \in Z\}$ . Applying SDS on the canonical renderings serves as regularization and is randomly activated during training. Overall, at each training step in the Optimization Stage, the 3D Gaussians $Π$ are optimized towards:

\nabla_{Π} [ℒ_{photo} + ρ \cdot λ_{pose} ℒ_{SDS}^{(P)} + (1 - ρ) \cdot λ_{can} ℒ_{SDS}^{(\hat{P})}],

(5)

where $ρ$ is a random variable that has a 75% chance to be 1 and 0 otherwise. The Optimization stage results in a complete and coherent geometry regardless of the viewing angle.

4.3. Refinement Stage: Refining Human Appearance via In-context Inpainting

As shown in Figure 6 Exp. C and D, applying diffusion priors on rendered human occupancy maps is not able to recover the missing appearances of the human. This motivates the need for a subsequent stage that keeps refining $Π$ for better appearance.

The refinement of the appearance of 3D objects is not a new topic [53, 31, 70]. However, no existing generative models are capable of handling the consistency of appearance of a human across different frames and poses. We attribute this difficulty to the denoising process used in generative priors — random noise is injected to rendering at each SDS step which leads to uncertain results. This is infeasible for reconstruction tasks, which require frame-consistent representations that agree with all observations.

Our approach focuses on generating inpainted images of the occluded human offline to use as references. We first identify the occluded regions to be inpainted $R$ by using the rendered human occupancy masks $A$ from the Optimization Stage and pre-computed human visibility masks $M : R = (1 - M) \cdot A$ . In order to encourage the generated regions to be more consistent with the partial observations, we propose in-context references inspired by in-context learning in language models [1]. Although renderings from the Optimization Stage lack sharp and high-fidelity details, they resemble complete human geometries and possess good enough features that can be used as a coarse reference to guide $Φ$ to inpaint similar contents at occluded body regions. To achieve this, we stack $\hat{I}$ and $I$ together as a single image input to $Φ$ with an additional prompt phrase — “the same person standing in two different rooms”.

We use the inpainted RGB images ${\tilde{I}}$ along with other priors to finetune $Π$ via photometric losses. Since diffusion models still tend to be somewhat inconsistent, we smooth training by putting more weight on perceptual loss terms and use L1 loss for the pixel-wise loss terms for its high robustness to variance:

\nabla_{Π} [λ_{r g b} L_{1} (M \cdot I, M \cdot I^{'}) + λ_{m a s k} L_{2} (\hat{M}, A) + λ_{g e n} L_{1} (\tilde{I}, R \cdot I^{'}) + λ_{l p i p s} L P I P S (I, I^{'})] .

(6)

We train our entire pipeline for only 10 minutes on a single TITAN RTX GPU. More implementation details are provided in supplementary materials.

5. Experiments

In this section, we conduct quantitative and qualitative evaluation of our approach against state-of-the-art methods. Then, we conduct ablation studies of our entire pipeline, demonstrating that each stage is necessary for optimal performance. More experiments and results can be found in supplementary materials.

5.1. Datasets and Evaluation

ZJUMoCap.

ZJU-MoCap [44] is a dataset consisting of 6 dynamic humans captured with a synchronized multi-camera system. Since the humans are in a lab environment free of occlusions, we follow OccNeRF’s [64] protocol to simulate occlusion of the human, masking out the center 50% of the human pixels for the first 80 % of frames. To challenge OccFusion on videos with even sparser frames, we use only 100 frames from the first camera with a sampling rate of 5 to train the models and use the other 22 cameras for evaluation.

OcMotion.

OcMotion [15] comprises of 48 videos of humans interacting with real objects in indoor environments. Experiments are conducted on the same 6 sequences adopted by Wild2Avatar [63], which are selected to provide a diverse coverage of real-world occlusions. We form sparser subsequences by sampling only 50 frames from each sequence to train the models.

Evaluation.

We compare our OccFusion to OccNeRF [64], OccGaussian [67], and Wild2Avatar [63], the state-of-the-art in occluded human rendering. We also compare our results to GauHuman [14], HumanNeRF [57], and 3DGS-Avatar [46], popular human rendering methods not designed for occlusion. For fairness of comparison, all methods use the same set of segmentation masks and pose priors. We train GauHuman and OccGauHuman for 10 minutes each. We evaluate the methods both quantitatively and qualitatively. For our quantitative evaluations, we calculate the Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) metrics against the ground truth images. Since no ground truth is provided for OcMotion, we calculate the metrics on visible pixels only. For qualitative evaluations, we render the human from novel views and assess the quality of the renderings.

5.2. Results on Simulated and Real-world Occlusions

We provide quantitative metrics averaged over all the sequences in Table 1. Overall, methods designed for occluded human rendering tend to outperform their traditional counterparts. Among those methods, OccFusion consistently performs up to par or better than the state-of-the-art on both datasets while significantly beating all the baselines on LPIPS.

Qualitative results on novel view synthesis can be found in Figure 5. OccNeRF [64] has trouble generating unseen regions and renders significant discoloration and floaters when faced with occlusion. On the other hand, OccGauHuman’s renderings are blurry and occasionally incomplete. We observe that OccFusion is the only method to consistently render sharp and high-quality renderings free of occlusions.

5.3. Additional Studies

Ablation studies.

We study the effect of each of our proposed components by adding them one by one and report average metrics on ZJU-MoCap in Table 2. Each stage plays a part towards optimal performance. Qualitative results on our ablations are included in Figure 6. We can see that the Initialization Stage helps enforce completeness for the initially incomplete human. The SDS regularization provided in the Optimization Stage helps remove floaters and artifacts in the posed and canonical space, further improving the shape of the human and enforcing completeness of the body. Finally, the Refinement Stage helps make the renderings more detailed in less observed regions, improving the rendering quality and greatly reducing the LPIPS.

Table 2:

Ablation results on the ZJU-MoCap [44] dataset. LPIPS values are scaled by ×1000.

Exp.	Methods	PSNR↑	SSIM↑	LPIPS↓	Train time

-	GauHuman [14]	21.55	0.9430	55.88	10 mins

A	OccGauHuman	22.54	0.9457	54.88	2 mins
B	+ Init Stage generated masks ${\hat{M}}$	23.52	0.9516	52.35	5 mins
C	+ Posed space SDS	23.90	0.9510	55.47	7 mins
D	+ Canonical space SDS (Optim Stage)	23.91	0.9514	55.35	7 mins

E	+ Refinement Stage	23.96	0.9548	32.34	10 mins

Open in a new tab

Does the proposed OccGauHuman perform better than GauHuman [14] in rendering occluded humans?

In section 3.3, we present a simple upgrade for the state-of-the-art 3DGS based human rendering model GauHuman [14] to help it better handle occlusions. Our improvements are straightforward but effective. We show quantitative results in Figure 1 (Left) and Table 1. As shown in Figure 7, our improved OccGauHuman reconstructs a more complete human body than the vanilla GauHuman.

Figure 1: — Reconstructing humans from monocular videos frequently fails under occlusion. In this paper, we introduce **OccFusion**, a method that combines 3D Gaussian splatting with 2D diffusion priors for modeling occluded humans. Our method outperforms the state-of-the-art in rendering quality and efficiency, resulting in clean and complete renderings free of artifacts.

6. Discussions and Conclusion

Limitations.

Recovering occluded dynamic humans is challenging. As mentioned in section 4.3, reconstructing a 3D human requires adhering to multiple consistencies. However, even with the state-of-the-art generative models, it is still impossible to perfectly maintain those consistencies for 4D content (3D + motion) generation. Although our proposed methods are specifically designed to eliminate potential variances when using generative priors, we can still observe some generations are less coherent (e.g. Figure 4 and Figure 8), which may hurt the training of the rendering model on all stages. Moreover, we found that conditioning generative models with 2D poses is weak — the pose of the generated human does not always align with the condition pose, which may introduce even more uncertainty for training. In future work, we hope to train our own consistency-aware diffusion model specifically finetuned on human data.

Societal Impacts.

Being able to reconstruct a human from an occluded monocular video can have a great societal impact. For example, having a high-fidelity 3D reconstruction of a human can help telemedicine practitioners become more immersed in the 3D space. While our research could lead to privacy concerns if humans are reconstructed without their consent, we believe that the benefits can be harnessed responsibly with appropriate safeguards.

Conclusion.

In this work, we propose OccFusion, one of the first works that utilize 3D Gaussian splatting for occluded human rendering. Our approach consists of three stages: the Initialization, Optimization, and Refinement stages. By combining the efficiency and representative ability of 3D Gaussian splatting with the generation capabilities of diffusion priors, our method achieves state-of-the-art in occluded human rendering quality as measured by the PSNR, SSIM, and LPIPS metrics while only taking around 10 minutes to train. We hope our work inspires further exploration into the capabilities of diffusion priors to aid in human reconstruction.

Supplementary Material

Supplement

NIHMS2086915-supplement-Supplement.pdf^{(3.2MB, pdf)}

7. Acknowledgment

This work was partially funded by the NIH Grant R01AG089169 and P41EB027060, Panasonic Holdings Corporation, the Gordon and Betty Moore Foundation, the Jaswa Innovator Award, Stanford HAI, Stanford HAI graduate fellowship, and Stanford Wu Tsai Human Performance Alliance.

A. Table of Symbols

For notation simplicity, we adopted alphabetic symbols in this paper to represent essential components in our framework. For better symbol-name correspondences, here we justify the implications of all symbols used in the paper in Table 3.

Table 3:

Table of symbols.

Symbols	Explanations

Preliminaries

$x_{c}$	3D points in the canonical human space
$x_{P}$	3D points in the posed human space
$w$	skinning weights used in LBS
$G$	transformation matrix used in LBS
$b$	translation vector used in LBS
$J$	3D locations of human joints
$θ$	pose parameters used in SMPL [35]
$β$	shape parameters used in SMPL [35]
$p$	center of a 3D Gaussian
$o$	opacity of a 3D Gaussian
$μ$	mean value of a 3D Gaussain
$Σ$	covariance matrix of a 3D Gaussain

OccFusion

$Π$	optimizable human 3D Gaussians
$Φ$	a pretrained generative model [48], used as prior
$M$	precomputed binary human mask, used as prior
$P$	precomputed human pose, used as prior
$\hat{P}$	the canonical articulation of $P$
$I$	input image with occluded human
$\hat{M}$	Init Stage generated complete human mask
$Δ$	SDS gradients, used as a guidance in the Optim Stage
$\hat{I}$	Optim Stage rendered human RGB image
$A$	Π rendered human occupancy map in all stages
$C$	Refine Stage rendered human RGB image
$R$	inpainting mask computed by $(1 - M) \cdot A$
$ρ$	a random variable $\in [0, 1]$ controls Optim Stage SDS

Open in a new tab

B. Implementation Details

OccFusion requires several priors. We run SAM [25] to get all the human masks ${M}$ . While we follow previous work [64, 63] and use the ground truth poses provided by ZJU-MoCap and OcMotion, pose priors $P$ can be obtained via occlusion-robust SMPL prediction/optimization methods such as HMR 2.0 [8] and SLAHMR [69] for in-the-wild videos. Improving the quality of priors is not the focus of this work. We use the pre-trained Stable Diffusion 1.5 model [48] with ControlNet [76] plugins for SDS in all the stages.

In the Initialization Stage, instead of inpainting incomplete human masks directly, we run the pretrained diffusion model to inpaint RGB images with 10 inference steps and 1.0 ControlNet conditioning scale. We use the positive prompt — “clean background, high contrast to the background, a person only, plain clothes, simple clothes, natural body, natural limbs, no texts, no overlay” and the negative prompt — ““multiple objects, occlusions, complex pattern, fancy clothes, longbody, lowres, bad anatomy, bad hands, bad feet, missing fingers, cropped, worst quality, low quality, blurry”. After inpainting the RGB images, we then run SAM-HQ [23] with $P$ as the prompts to get ${\hat{M}}$ .

In the Optimization Stage, we train the 3D human Gaussian $Π$ from scratch by following the objective Equation 5. We set $λ_{rgb} = 1 e^{4}$ , $λ_{mask} = 2 e^{4}, λ_{ssim} = 1 e^{3}$ , and $λ_{lpips} = 1 e^{3}$ . At each training step, we random switch the SDS regularization on either posed human space or the canonical Da-pose space with a probability of 75% and 25%. When applying SDS regularization on the canonical human space, we randomly rotate the human horizontally with a uniformly sampled degree in $\{k \frac{π}{9}, k \in Z\}$ . We set the SDS loss weights as $λ_{pose} = 2 e^{5}$ and $λ_{can} = 2 e^{5}$ . In this stage, we train $Π$ for 1200 steps.

In the Refinement Stage, we first generate the RGB human inpaintings via the proposed in-context inpainting method. We run the pretrained diffusion model with conditions on $M$ , 10 inference steps, and 0.3 ControlNet conditioning scale. We did not use positive prompts for the inpainting but used the same negative prompts as in the Optimization Stage. During training, we set the loss weights as $λ_{rgb} = 1$ and $λ_{mask} = 0.1$ , $λ_{gen} = 0.1$ , and $λ_{lpips} = 0.2$ . In this stage, we finetune $Π$ for another 1800 steps with Gaussian densification and pruning enabled for the first 1000 steps.

C. Additional Studies

Effectiveness of in-context inpainting.

We provide comparisons of the human in the Refinement Stage with and without in-context inpainting and provide qualitative comparisons in Figure 8. While renderings from the Optimization stage are less detailed in occluded areas, our proposed in-context inpainting is able to generate the missing content and greatly increase the rendering quality in these areas.

Figure 8: — Comparison of the inpainted human in the Refinement Stage with and without using the proposed in-context inpainting technique. Major differences are highlighted with red arrows.

Figure 9: — Applying SDS on RGB images vs. on human occupancy maps. As mentioned in Sec. 4.1 and Fig. 4 of the main paper, generated RGB appearances are much more inconsistent than generated silhouettes. As a result, applying SDS on RGB leads to defective rendering results.

Figure 10: — Training with complete unoccluded masks vs. with inpainted masks in the Optim. stage. Although inpainted masks are slightly more inconsistent compared to the complete masks, our training pipeline converges to the same level of rendering quality.

Figure 11: — Novel view synthesis results from InstantMesh [65] conditioned on the least occluded frame. Discrepancies are circled in red.

Applying SDS on RGB vs on Human Occupancy Maps.

We include additional experiments comparing the rendering results of applying SDS on RGB vs. on human occupancy maps (as proposed). It is clear that applying SDS on RGB leads to defective renderings as well as inferior quantitative results. This experiment validates our claim made in Sec. 4.1 and Fig. 4 in the main paper.

Robustness of training to inpainted masks.

For in-the-wild occluded videos, there are no ground truth masks for the occluded body regions due to unknown human/garment deformations. Relying on the state-of-the-art pre-trained priors brought by the Segment Anything model (SAM) [25] and Stable Diffusion [48], the segmented/inpainted masks are expected to be reasonable and coherent across frames. To test the robustness of our method to variances in the in-painted masks, we add comparison experiments on ZJU-MoCap that supervise using the complete SAM masks obtained from the unoccluded humans with minimum variances. Please see the qualitative results in Figure 10. We find that using the inpainted $\hat{M}$ leads to a good enough rendering quality comparable to using masks derived from the unoccluded images, validating the robustness of our model.

Can existing generative models recover an occluded human?

While there are works for using generative diffusion models to render 3D humans conditioned on single [58] and multiple [49] images, none are able to condition on a monocular video of the person.

Since [58] has not released code, we include results from InstantMesh [65]. We use the provided segmentation mask to mask the least occluded frame onto a white background and use it as conditioning. Novel view synthesis results are included in Figure 11. InstantMesh is unable to recover a complete human geometry and fails to generate a reasonable appearance from the single image.

D. Video Studies

For a more comprehensive presentation of the results, we include video renderings on all the training frames for both datasets. For the ZJU-MoCap videos (named with the prefix zju), from left to right, we show the occluded human, OccGauHuman rendering, Optimization Stage rendering, Refinement Stage rendering, and the reference. For the OcMotion videos (named with the prefix ocmotion), without references for real-world occlusions, from left to right, we show the occluded human, OccGauHuman rendering, Optimization Stage rendering, and Refinement Stage rendering.

References

[1].Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared D, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [Google Scholar]
[2].Chen Mingfei, Zhang Jianfeng, Xu Xiangyu, Liu Lijuan, Cai Yujun, Feng Jiashi, and Yan Shuicheng. Geometry-guided progressive nerf for generalizable and efficient neural human rendering. In European Conference on Computer Vision, pages 222–239. Springer, 2022. [Google Scholar]
[3].Chen Xingyu, Zhang Qi, Li Xiaoyu, Chen Yue, Feng Ying, Wang Xuan, and Wang Jue. Hallucinated neural radiance fields in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12952, 2022. [Google Scholar]
[4].Collet Alvaro, Chuang Ming, Sweeney Pat, Gillett Don, Evseev Dennis, Calabrese David, Hoppe Hugues, Kirk Adam, and Sullivan Steve. High-quality streamable free-viewpoint video. ACM Transactions on Graphics (ToG), 34(4):1–13, 2015. [Google Scholar]
[5].Dou Mingsong, Khamis Sameh, Degtyarev Yury, Davidson Philip, Fanello Sean Ryan, Kowdle Adarsh, Escolano Sergio Orts, Rhemann Christoph, Kim David, Taylor Jonathan, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (ToG), 35(4):1–13, 2016. [Google Scholar]
[6].Geng Chen, Peng Sida, Xu Zhen, Bao Hujun, and Zhou Xiaowei. Learning neural volumetric representations of dynamic humans in minutes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8770, 2023. [Google Scholar]
[7].Girdhar Rohit, Singh Mannat, Brown Andrew, Duval Quentin, Azadi Samaneh, Rambhatla Sai Saketh, Shah Akbar, Yin Xi, Parikh Devi, and Misra Ishan. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. [Google Scholar]
[8].Goel Shubham, Pavlakos Georgios, Rajasegaran Jathushan, Kanazawa Angjoo, and Malik Jitendra. Humans in 4d: Reconstructing and tracking humans with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023. [Google Scholar]
[9].Guo Chen, Jiang Tianjian, Chen Xu, Song Jie, and Hilliges Otmar. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12858–12868, 2023. [Google Scholar]
[10].Guo Kaiwen, Lincoln Peter, Davidson Philip, Busch Jay, Yu Xueming, Whalen Matt, Harvey Geoff, Orts-Escolano Sergio, Pandey Rohit, Dourgarian Jason, et al. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics (ToG), 38(6):1–19, 2019. [Google Scholar]
[11].Gupta Agrim, Yu Lijun, Sohn Kihyuk, Gu Xiuye, Hahn Meera, Fei-Fei Li, Essa Irfan, Jiang Lu, and Lezama José. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023. [Google Scholar]
[12].Hu Liangxiao, Zhang Hongwen, Zhang Yuxiang, Zhou Boyao, Liu Boning, Zhang Shengping, and Nie Liqiang. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. arXiv preprint arXiv:2312.02134, 2023. [Google Scholar]
[13].Hu Shoukang, Hong Fangzhou, Pan Liang, Mei Haiyi, Yang Lei, and Liu Ziwei. Sherf: Generalizable human nerf from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9352–9364, 2023. [Google Scholar]
[14].Hu Shoukang and Liu Ziwei. Gauhuman: Articulated gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973, 2023. [Google Scholar]
[15].Huang Buzhen, Shu Yuan, Ju Jingyi, and Wang Yangang. Occluded human body capture with self-supervised spatial-temporal motion prior. arXiv preprint arXiv:2207.05375, 2022. [Google Scholar]
[16].Huang Buzhen, Zhang Tianshu, and Wang Yangang. Object-occluded human shape and pose estimation with probabilistic latent consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [DOI] [PubMed] [Google Scholar]
[17].Jiang Boyi, Hong Yang, Bao Hujun, and Zhang Juyong. Selfrecon: Self reconstruction your digital avatar from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5605–5615, 2022. [Google Scholar]
[18].Jiang Tianjian, Chen Xu, Song Jie, and Hilliges Otmar. Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16922–16932, 2023. [Google Scholar]
[19].Jiang Wei, Yi Kwang Moo, Samei Golnoosh, Tuzel Oncel, and Ranjan Anurag. Neuman: Neural human radiance field from a single video. In European Conference on Computer Vision, pages 402–418. Springer, 2022. [Google Scholar]
[20].Jung HyunJun, Brasch Nikolas, Song Jifei, Perez-Pellitero Eduardo, Zhou Yiren, Li Zhihao, Navab Nassir, and Busam Benjamin. Deformable 3d gaussian splatting for animatable human avatars. arXiv preprint arXiv:2312.15059, 2023. [Google Scholar]
[21].Karnewar Animesh, Mitra Niloy J, Vedaldi Andrea, and Novotny David. Holofusion: Towards photo-realistic 3d generative modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22976–22985, 2023. [Google Scholar]
[22].Ke Bingxin, Obukhov Anton, Huang Shengyu, Metzger Nando, Daudt Rodrigo Caye, and Schindler Konrad. Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023. [Google Scholar]
[23].Ke Lei, Ye Mingqiao, Danelljan Martin, Tai Yu-Wing, Tang Chi-Keung, Yu Fisher, et al. Segment anything in high quality. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
[24].Kerbl Bernhard, Kopanas Georgios, Leimkühler Thomas, and Drettakis George. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023. [Google Scholar]
[25].Kirillov Alexander, Mintun Eric, Ravi Nikhila, Mao Hanzi, Rolland Chloe, Gustafson Laura, Xiao Tete, Whitehead Spencer, Berg Alexander C, Lo Wan-Yen, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. [Google Scholar]
[26].Kocabas Muhammed, Chang Jen-Hao Rick, Gabriel James, Tuzel Oncel, and Ranjan Anurag. Hugs: Human gaussian splats. arXiv preprint arXiv:2311.17910, 2023. [Google Scholar]
[27].Kwon Youngjoong, Kim Dahun, Ceylan Duygu, and Fuchs Henry. Neural human performer: Learning generalizable radiance fields for human performance rendering. Advances in Neural Information Processing Systems, 34:24741–24752, 2021. [Google Scholar]
[28].Li Mengtian, Yao Shengxiang, Xie Zhifeng, Chen Keyu, and Jiang Yu-Gang. Gaussianbody: Clothed human reconstruction via 3d gaussian splatting. arXiv preprint arXiv:2401.09720, 2024. [Google Scholar]
[29].Li Mingwei, Tao Jiachen, Yang Zongxin, and Yang Yi. Human101: Training 100+ fps human gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258, 2023. [Google Scholar]
[30].Lin Chen-Hsuan, Gao Jun, Tang Luming, Takikawa Towaki, Zeng Xiaohui, Huang Xun, Kreis Karsten, Fidler Sanja, Liu Ming-Yu, and Lin Tsung-Yi. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. [Google Scholar]
[31].Lin Yuanze, Clark Ronald, and Torr Philip. Dreampolisher: Towards high-quality text-to-3d generation via geometric diffusion. arXiv preprint arXiv:2403.17237, 2024. [Google Scholar]
[32].Liu Ruoshi, Wu Rundi, Van Hoorick Basile, Tokmakov Pavel, Zakharov Sergey, and Vondrick Carl. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023. [Google Scholar]
[33].Liu Yang, Huang Xiang, Qin Minghan, Lin Qinwei, and Wang Haoqian. Animatable 3d gaussian: Fast and high-quality reconstruction of multiple human avatars. arXiv preprint arXiv:2311.16482, 2023. [Google Scholar]
[34].Liu Yuan, Lin Cheng, Zeng Zijiao, Long Xiaoxiao, Liu Lingjie, Komura Taku, and Wang Wenping. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. [Google Scholar]
[35].Loper Matthew, Mahmood Naureen, Romero Javier, Pons-Moll Gerard, and Black Michael J. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023. [Google Scholar]
[36].Martin-Brualla Ricardo, Radwan Noha, Sajjadi Mehdi SM, Barron Jonathan T, Dosovitskiy Alexey, and Duckworth Daniel. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. [Google Scholar]
[37].Mildenhall Ben, Srinivasan Pratul P, Tancik Matthew, Barron Jonathan T, Ramamoorthi Ravi, and Ng Ren. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. [Google Scholar]
[38].Moreau Arthur, Song Jifei, Dhamo Helisa, Shaw Richard, Zhou Yiren, and Pérez-Pellitero Eduardo. Human gaussian splatting: Real-time rendering of animatable avatars. arXiv preprint arXiv:2311.17113, 2023. [Google Scholar]
[39].Mou Chong, Wang Xintao, Xie Liangbin, Wu Yanze, Zhang Jian, Qi Zhongang, and Shan Ying. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024. [Google Scholar]
[40].Müller Thomas, Evans Alex, Schied Christoph, and Keller Alexander. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. [Google Scholar]
[41].Pan Xiao, Yang Zongxin, Ma Jianxin, Zhou Chang, and Yang Yi. Transhuman: A transformer-based human representation for generalizable neural human rendering. In Proceedings of the IEEE/CVF International conference on computer vision, pages 3544–3555, 2023. [Google Scholar]
[42].Pang Haokai, Zhu Heming, Kortylewski Adam, Theobalt Christian, and Habermann Marc. Ash: Animatable gaussian splats for efficient and photoreal human rendering. arXiv preprint arXiv:2312.05941, 2023. [Google Scholar]
[43].Peng Bo, Hu Jun, Zhou Jingtao, Gao Xuan, and Zhang Juyong. Intrinsicngp: Intrinsic coordinate based hash encoding for human nerf. IEEE Transactions on Visualization and Computer Graphics, 2023. [DOI] [PubMed] [Google Scholar]
[44].Peng Sida, Zhang Yuanqing, Xu Yinghao, Wang Qianqian, Shuai Qing, Bao Hujun, and Zhou Xiaowei. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9054–9063, 2021. [Google Scholar]
[45].Poole Ben, Jain Ajay, Barron Jonathan T, and Mildenhall Ben. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. [Google Scholar]
[46].Qian Zhiyin, Wang Shaofei, Mihajlovic Marko, Geiger Andreas, and Tang Siyu. 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. arXiv preprint arXiv:2312.09228, 2023. [Google Scholar]
[47].Ren Weining, Zhu Zihan, Sun Boyang, Chen Jiaqi, Pollefeys Marc, and Peng Songyou. Nerf on-the-go: Exploiting uncertainty for distractor-free nerfs in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8931–8940, 2024. [Google Scholar]
[48].Rombach Robin, Blattmann Andreas, Lorenz Dominik, Esser Patrick, and Ommer Björn. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [Google Scholar]
[49].Shao Ruizhi, Zheng Zerong, Zhang Hongwen, Sun Jingxiang, and Liu Yebin. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In European Conference on Computer Vision, pages 702–720. Springer, 2022. [Google Scholar]
[50].Su Zhuo, Xu Lan, Zheng Zerong, Yu Tao, Liu Yebin, and Fang Lu. Robustfusion: Human volumetric capture with data-driven visual cues using a rgbd camera. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 246–264. Springer, 2020. [Google Scholar]
[51].Sun Guoxing, Chen Xin, Chen Yizhang, Pang Anqi, Lin Pei, Jiang Yuheng, Xu Lan, Yu Jingyi, and Wang Jingya. Neural free-viewpoint performance rendering under complex human-object interactions. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 4651âĂŞ4660, New York, NY, USA, 2021. Association for Computing Machinery. [Google Scholar]
[52].Sun Wenzhang, Che Yunlong, Huang Han, and Guo Yandong. Neural reconstruction of relightable human model from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 397–407, 2023. [Google Scholar]
[53].Tang Jiaxiang, Ren Jiawei, Zhou Hang, Liu Ziwei, and Zeng Gang. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023. [Google Scholar]
[54].Wang Guangcong, Chen Zhaoxi, Loy Chen Change, and Liu Ziwei. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9065–9076, 2023. [Google Scholar]
[55].Wang Haochen, Du Xiaodan, Li Jiahao, Yeh Raymond A, and Shakhnarovich Greg. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. [Google Scholar]
[56].Wang Zhou, Bovik Alan C, Sheikh Hamid R, and Simoncelli Eero P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. [DOI] [PubMed] [Google Scholar]
[57].Weng Chung-Yi, Curless Brian, Srinivasan Pratul P, Barron Jonathan T, and Kemelmacher-Shlizerman Ira. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 16210–16220, 2022. [Google Scholar]
[58].Weng Zhenzhen, Liu Jingyuan, Tan Hao, Xu Zhan, Zhou Yang, Yeung-Levy Serena, and Yang Jimei. Single-view 3d human digitalization with large reconstruction models. arXiv preprint arXiv:2401.12175, 2024. [Google Scholar]
[59].Weng Zhenzhen, Wang Zeyu, and Yeung Serena. Zeroavatar: Zero-shot 3d avatar generation from a single image. arXiv preprint arXiv:2305.16411, 2023. [Google Scholar]
[60].Jay Zhangjie Wu Yixiao Ge, Wang Xintao, Lei Stan Weixian, Gu Yuchao, Shi Yufei, Hsu Wynne, Shan Ying, Qie Xiaohu, and Shou Mike Zheng. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023. [Google Scholar]
[61].Wu Rundi, Mildenhall Ben, Henzler Philipp, Park Keunhong, Gao Ruiqi, Watson Daniel, Srinivasan Pratul P, Verbin Dor, Barron Jonathan T, Poole Ben, et al. Reconfusion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023. [Google Scholar]
[62].Wynn Jamie and Turmukhambetov Daniyar. Diffusionerf: Regularizing neural radiance fields with denoising diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4180–4189, 2023. [Google Scholar]
[63].Xiang Tiange, Sun Adam, Delp Scott, Kozuka Kazuki, Fei-Fei Li, and Adeli Ehsan. Wild2avatar: Rendering humans behind occlusions. arXiv preprint arXiv:2401.00431, 2023. [Google Scholar]
[64].Xiang Tiange, Sun Adam, Wu Jiajun, Adeli Ehsan, and Fei-Fei Li. Rendering humans from object-occluded monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3239–3250, 2023. [Google Scholar]
[65].Xu Jiale, Cheng Weihao, Gao Yiming, Wang Xintao, Gao Shenghua, and Shan Ying. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024. [Google Scholar]
[66].Yang Xiaofeng, Chen Yiwen, Chen Cheng, Zhang Chi, Xu Yi, Yang Xulei, Liu Fayao, and Lin Guosheng. Learn to optimize denoising scores for 3d generation: A unified and improved diffusion prior on nerf and 3d gaussian splatting. arXiv preprint arXiv:2312.04820, 2023. [Google Scholar]
[67].Ye Jingrui, Zhang Zongkai, Jiang Yujiao, Liao Qingmin, Yang Wenming, and Lu Zongqing. Occgaussian: 3d gaussian splatting for occluded human rendering. arXiv preprint arXiv:2404.08449, 2024. [Google Scholar]
[68].Ye Keyang, Shao Tianjia, and Zhou Kun. Animatable 3d gaussians for high-fidelity synthesis of human motions. arXiv preprint arXiv:2311.13404, 2023. [Google Scholar]
[69].Ye Vickie, Pavlakos Georgios, Malik Jitendra, and Kanazawa Angjoo. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023. [Google Scholar]
[70].Yi Taoran, Fang Jiemin, Wang Junjie, Wu Guanjun, Xie Lingxi, Zhang Xiaopeng, Liu Wenyu, Tian Qi, and Wang Xinggang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models, 2024.
[71].Yu Tao, Zheng Zerong, Guo Kaiwen, Liu Pengpeng, Dai Qionghai, and Liu Yebin. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5746–5756, 2021. [Google Scholar]
[72].Yu Zhengming, Cheng Wei, Liu Xian, Wu Wayne, and Lin Kwan-Yee. Monohuman: Animatable human neural field from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16943–16953, 2023. [Google Scholar]
[73].Yu Zhongrui, Wang Haoran, Yang Jinze, Wang Hanzhang, Xie Zeke, Cai Yunfeng, Cao Jiale, Ji Zhong, and Sun Mingming. Sgd: Street view synthesis with gaussian splatting and diffusion prior. arXiv preprint arXiv:2403.20079, 2024. [Google Scholar]
[74].Yuan Ye, Li Xueting, Huang Yangyi, Shalini De Mello Koki Nagano, Kautz Jan, and Iqbal Umar. Gavatar: Animatable 3d gaussian avatars with implicit mesh learning. arXiv preprint arXiv:2312.11461, 2023. [Google Scholar]
[75].Zhang Dongbin, Wang Chuming, Wang Weitao, Li Peihao, Qin Minghan, and Wang Haoqian. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. arXiv preprint arXiv:2403.15704, 2024. [Google Scholar]
[76].Zhang Lvmin, Rao Anyi, and Agrawala Maneesh. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. [Google Scholar]
[77].Zhang Richard, Isola Phillip, Alexei A Efros Eli Shechtman, and Wang Oliver. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. [Google Scholar]
[78].Zhang Tingyang, Gao Qingzhe, Li Weiyu, Liu Libin, and Chen Baoquan. Bags: Building animatable gaussian splatting from a monocular video with diffusion priors. arXiv preprint arXiv:2403.11427, 2024. [Google Scholar]
[79].Zhou Zhizhuo and Tulsiani Shubham. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS2086915-supplement-Supplement.pdf^{(3.2MB, pdf)}

[R1] [1].Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared D, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [Google Scholar]

[R2] [2].Chen Mingfei, Zhang Jianfeng, Xu Xiangyu, Liu Lijuan, Cai Yujun, Feng Jiashi, and Yan Shuicheng. Geometry-guided progressive nerf for generalizable and efficient neural human rendering. In European Conference on Computer Vision, pages 222–239. Springer, 2022. [Google Scholar]

[R3] [3].Chen Xingyu, Zhang Qi, Li Xiaoyu, Chen Yue, Feng Ying, Wang Xuan, and Wang Jue. Hallucinated neural radiance fields in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12952, 2022. [Google Scholar]

[R4] [4].Collet Alvaro, Chuang Ming, Sweeney Pat, Gillett Don, Evseev Dennis, Calabrese David, Hoppe Hugues, Kirk Adam, and Sullivan Steve. High-quality streamable free-viewpoint video. ACM Transactions on Graphics (ToG), 34(4):1–13, 2015. [Google Scholar]

[R5] [5].Dou Mingsong, Khamis Sameh, Degtyarev Yury, Davidson Philip, Fanello Sean Ryan, Kowdle Adarsh, Escolano Sergio Orts, Rhemann Christoph, Kim David, Taylor Jonathan, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (ToG), 35(4):1–13, 2016. [Google Scholar]

[R6] [6].Geng Chen, Peng Sida, Xu Zhen, Bao Hujun, and Zhou Xiaowei. Learning neural volumetric representations of dynamic humans in minutes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8759–8770, 2023. [Google Scholar]

[R7] [7].Girdhar Rohit, Singh Mannat, Brown Andrew, Duval Quentin, Azadi Samaneh, Rambhatla Sai Saketh, Shah Akbar, Yin Xi, Parikh Devi, and Misra Ishan. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. [Google Scholar]

[R8] [8].Goel Shubham, Pavlakos Georgios, Rajasegaran Jathushan, Kanazawa Angjoo, and Malik Jitendra. Humans in 4d: Reconstructing and tracking humans with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023. [Google Scholar]

[R9] [9].Guo Chen, Jiang Tianjian, Chen Xu, Song Jie, and Hilliges Otmar. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12858–12868, 2023. [Google Scholar]

[R10] [10].Guo Kaiwen, Lincoln Peter, Davidson Philip, Busch Jay, Yu Xueming, Whalen Matt, Harvey Geoff, Orts-Escolano Sergio, Pandey Rohit, Dourgarian Jason, et al. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics (ToG), 38(6):1–19, 2019. [Google Scholar]

[R11] [11].Gupta Agrim, Yu Lijun, Sohn Kihyuk, Gu Xiuye, Hahn Meera, Fei-Fei Li, Essa Irfan, Jiang Lu, and Lezama José. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023. [Google Scholar]

[R12] [12].Hu Liangxiao, Zhang Hongwen, Zhang Yuxiang, Zhou Boyao, Liu Boning, Zhang Shengping, and Nie Liqiang. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. arXiv preprint arXiv:2312.02134, 2023. [Google Scholar]

[R13] [13].Hu Shoukang, Hong Fangzhou, Pan Liang, Mei Haiyi, Yang Lei, and Liu Ziwei. Sherf: Generalizable human nerf from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9352–9364, 2023. [Google Scholar]

[R14] [14].Hu Shoukang and Liu Ziwei. Gauhuman: Articulated gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973, 2023. [Google Scholar]

[R15] [15].Huang Buzhen, Shu Yuan, Ju Jingyi, and Wang Yangang. Occluded human body capture with self-supervised spatial-temporal motion prior. arXiv preprint arXiv:2207.05375, 2022. [Google Scholar]

[R16] [16].Huang Buzhen, Zhang Tianshu, and Wang Yangang. Object-occluded human shape and pose estimation with probabilistic latent consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [DOI] [PubMed] [Google Scholar]

[R17] [17].Jiang Boyi, Hong Yang, Bao Hujun, and Zhang Juyong. Selfrecon: Self reconstruction your digital avatar from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5605–5615, 2022. [Google Scholar]

[R18] [18].Jiang Tianjian, Chen Xu, Song Jie, and Hilliges Otmar. Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16922–16932, 2023. [Google Scholar]

[R19] [19].Jiang Wei, Yi Kwang Moo, Samei Golnoosh, Tuzel Oncel, and Ranjan Anurag. Neuman: Neural human radiance field from a single video. In European Conference on Computer Vision, pages 402–418. Springer, 2022. [Google Scholar]

[R20] [20].Jung HyunJun, Brasch Nikolas, Song Jifei, Perez-Pellitero Eduardo, Zhou Yiren, Li Zhihao, Navab Nassir, and Busam Benjamin. Deformable 3d gaussian splatting for animatable human avatars. arXiv preprint arXiv:2312.15059, 2023. [Google Scholar]

[R21] [21].Karnewar Animesh, Mitra Niloy J, Vedaldi Andrea, and Novotny David. Holofusion: Towards photo-realistic 3d generative modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22976–22985, 2023. [Google Scholar]

[R22] [22].Ke Bingxin, Obukhov Anton, Huang Shengyu, Metzger Nando, Daudt Rodrigo Caye, and Schindler Konrad. Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023. [Google Scholar]

[R23] [23].Ke Lei, Ye Mingqiao, Danelljan Martin, Tai Yu-Wing, Tang Chi-Keung, Yu Fisher, et al. Segment anything in high quality. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]

[R24] [24].Kerbl Bernhard, Kopanas Georgios, Leimkühler Thomas, and Drettakis George. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023. [Google Scholar]

[R25] [25].Kirillov Alexander, Mintun Eric, Ravi Nikhila, Mao Hanzi, Rolland Chloe, Gustafson Laura, Xiao Tete, Whitehead Spencer, Berg Alexander C, Lo Wan-Yen, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. [Google Scholar]

[R26] [26].Kocabas Muhammed, Chang Jen-Hao Rick, Gabriel James, Tuzel Oncel, and Ranjan Anurag. Hugs: Human gaussian splats. arXiv preprint arXiv:2311.17910, 2023. [Google Scholar]

[R27] [27].Kwon Youngjoong, Kim Dahun, Ceylan Duygu, and Fuchs Henry. Neural human performer: Learning generalizable radiance fields for human performance rendering. Advances in Neural Information Processing Systems, 34:24741–24752, 2021. [Google Scholar]

[R28] [28].Li Mengtian, Yao Shengxiang, Xie Zhifeng, Chen Keyu, and Jiang Yu-Gang. Gaussianbody: Clothed human reconstruction via 3d gaussian splatting. arXiv preprint arXiv:2401.09720, 2024. [Google Scholar]

[R29] [29].Li Mingwei, Tao Jiachen, Yang Zongxin, and Yang Yi. Human101: Training 100+ fps human gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258, 2023. [Google Scholar]

[R30] [30].Lin Chen-Hsuan, Gao Jun, Tang Luming, Takikawa Towaki, Zeng Xiaohui, Huang Xun, Kreis Karsten, Fidler Sanja, Liu Ming-Yu, and Lin Tsung-Yi. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. [Google Scholar]

[R31] [31].Lin Yuanze, Clark Ronald, and Torr Philip. Dreampolisher: Towards high-quality text-to-3d generation via geometric diffusion. arXiv preprint arXiv:2403.17237, 2024. [Google Scholar]

[R32] [32].Liu Ruoshi, Wu Rundi, Van Hoorick Basile, Tokmakov Pavel, Zakharov Sergey, and Vondrick Carl. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023. [Google Scholar]

[R33] [33].Liu Yang, Huang Xiang, Qin Minghan, Lin Qinwei, and Wang Haoqian. Animatable 3d gaussian: Fast and high-quality reconstruction of multiple human avatars. arXiv preprint arXiv:2311.16482, 2023. [Google Scholar]

[R34] [34].Liu Yuan, Lin Cheng, Zeng Zijiao, Long Xiaoxiao, Liu Lingjie, Komura Taku, and Wang Wenping. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. [Google Scholar]

[R35] [35].Loper Matthew, Mahmood Naureen, Romero Javier, Pons-Moll Gerard, and Black Michael J. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023. [Google Scholar]

[R36] [36].Martin-Brualla Ricardo, Radwan Noha, Sajjadi Mehdi SM, Barron Jonathan T, Dosovitskiy Alexey, and Duckworth Daniel. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. [Google Scholar]

[R37] [37].Mildenhall Ben, Srinivasan Pratul P, Tancik Matthew, Barron Jonathan T, Ramamoorthi Ravi, and Ng Ren. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. [Google Scholar]

[R38] [38].Moreau Arthur, Song Jifei, Dhamo Helisa, Shaw Richard, Zhou Yiren, and Pérez-Pellitero Eduardo. Human gaussian splatting: Real-time rendering of animatable avatars. arXiv preprint arXiv:2311.17113, 2023. [Google Scholar]

[R39] [39].Mou Chong, Wang Xintao, Xie Liangbin, Wu Yanze, Zhang Jian, Qi Zhongang, and Shan Ying. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024. [Google Scholar]

[R40] [40].Müller Thomas, Evans Alex, Schied Christoph, and Keller Alexander. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. [Google Scholar]

[R41] [41].Pan Xiao, Yang Zongxin, Ma Jianxin, Zhou Chang, and Yang Yi. Transhuman: A transformer-based human representation for generalizable neural human rendering. In Proceedings of the IEEE/CVF International conference on computer vision, pages 3544–3555, 2023. [Google Scholar]

[R42] [42].Pang Haokai, Zhu Heming, Kortylewski Adam, Theobalt Christian, and Habermann Marc. Ash: Animatable gaussian splats for efficient and photoreal human rendering. arXiv preprint arXiv:2312.05941, 2023. [Google Scholar]

[R43] [43].Peng Bo, Hu Jun, Zhou Jingtao, Gao Xuan, and Zhang Juyong. Intrinsicngp: Intrinsic coordinate based hash encoding for human nerf. IEEE Transactions on Visualization and Computer Graphics, 2023. [DOI] [PubMed] [Google Scholar]

[R44] [44].Peng Sida, Zhang Yuanqing, Xu Yinghao, Wang Qianqian, Shuai Qing, Bao Hujun, and Zhou Xiaowei. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9054–9063, 2021. [Google Scholar]

[R45] [45].Poole Ben, Jain Ajay, Barron Jonathan T, and Mildenhall Ben. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. [Google Scholar]

[R46] [46].Qian Zhiyin, Wang Shaofei, Mihajlovic Marko, Geiger Andreas, and Tang Siyu. 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. arXiv preprint arXiv:2312.09228, 2023. [Google Scholar]

[R47] [47].Ren Weining, Zhu Zihan, Sun Boyang, Chen Jiaqi, Pollefeys Marc, and Peng Songyou. Nerf on-the-go: Exploiting uncertainty for distractor-free nerfs in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8931–8940, 2024. [Google Scholar]

[R48] [48].Rombach Robin, Blattmann Andreas, Lorenz Dominik, Esser Patrick, and Ommer Björn. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [Google Scholar]

[R49] [49].Shao Ruizhi, Zheng Zerong, Zhang Hongwen, Sun Jingxiang, and Liu Yebin. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In European Conference on Computer Vision, pages 702–720. Springer, 2022. [Google Scholar]

[R50] [50].Su Zhuo, Xu Lan, Zheng Zerong, Yu Tao, Liu Yebin, and Fang Lu. Robustfusion: Human volumetric capture with data-driven visual cues using a rgbd camera. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 246–264. Springer, 2020. [Google Scholar]

[R51] [51].Sun Guoxing, Chen Xin, Chen Yizhang, Pang Anqi, Lin Pei, Jiang Yuheng, Xu Lan, Yu Jingyi, and Wang Jingya. Neural free-viewpoint performance rendering under complex human-object interactions. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 4651âĂŞ4660, New York, NY, USA, 2021. Association for Computing Machinery. [Google Scholar]

[R52] [52].Sun Wenzhang, Che Yunlong, Huang Han, and Guo Yandong. Neural reconstruction of relightable human model from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 397–407, 2023. [Google Scholar]

[R53] [53].Tang Jiaxiang, Ren Jiawei, Zhou Hang, Liu Ziwei, and Zeng Gang. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023. [Google Scholar]

[R54] [54].Wang Guangcong, Chen Zhaoxi, Loy Chen Change, and Liu Ziwei. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9065–9076, 2023. [Google Scholar]

[R55] [55].Wang Haochen, Du Xiaodan, Li Jiahao, Yeh Raymond A, and Shakhnarovich Greg. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. [Google Scholar]

[R56] [56].Wang Zhou, Bovik Alan C, Sheikh Hamid R, and Simoncelli Eero P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. [DOI] [PubMed] [Google Scholar]

[R57] [57].Weng Chung-Yi, Curless Brian, Srinivasan Pratul P, Barron Jonathan T, and Kemelmacher-Shlizerman Ira. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 16210–16220, 2022. [Google Scholar]

[R58] [58].Weng Zhenzhen, Liu Jingyuan, Tan Hao, Xu Zhan, Zhou Yang, Yeung-Levy Serena, and Yang Jimei. Single-view 3d human digitalization with large reconstruction models. arXiv preprint arXiv:2401.12175, 2024. [Google Scholar]

[R59] [59].Weng Zhenzhen, Wang Zeyu, and Yeung Serena. Zeroavatar: Zero-shot 3d avatar generation from a single image. arXiv preprint arXiv:2305.16411, 2023. [Google Scholar]

[R60] [60].Jay Zhangjie Wu Yixiao Ge, Wang Xintao, Lei Stan Weixian, Gu Yuchao, Shi Yufei, Hsu Wynne, Shan Ying, Qie Xiaohu, and Shou Mike Zheng. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023. [Google Scholar]

[R61] [61].Wu Rundi, Mildenhall Ben, Henzler Philipp, Park Keunhong, Gao Ruiqi, Watson Daniel, Srinivasan Pratul P, Verbin Dor, Barron Jonathan T, Poole Ben, et al. Reconfusion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023. [Google Scholar]

[R62] [62].Wynn Jamie and Turmukhambetov Daniyar. Diffusionerf: Regularizing neural radiance fields with denoising diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4180–4189, 2023. [Google Scholar]

[R63] [63].Xiang Tiange, Sun Adam, Delp Scott, Kozuka Kazuki, Fei-Fei Li, and Adeli Ehsan. Wild2avatar: Rendering humans behind occlusions. arXiv preprint arXiv:2401.00431, 2023. [Google Scholar]

[R64] [64].Xiang Tiange, Sun Adam, Wu Jiajun, Adeli Ehsan, and Fei-Fei Li. Rendering humans from object-occluded monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3239–3250, 2023. [Google Scholar]

[R65] [65].Xu Jiale, Cheng Weihao, Gao Yiming, Wang Xintao, Gao Shenghua, and Shan Ying. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024. [Google Scholar]

[R66] [66].Yang Xiaofeng, Chen Yiwen, Chen Cheng, Zhang Chi, Xu Yi, Yang Xulei, Liu Fayao, and Lin Guosheng. Learn to optimize denoising scores for 3d generation: A unified and improved diffusion prior on nerf and 3d gaussian splatting. arXiv preprint arXiv:2312.04820, 2023. [Google Scholar]

[R67] [67].Ye Jingrui, Zhang Zongkai, Jiang Yujiao, Liao Qingmin, Yang Wenming, and Lu Zongqing. Occgaussian: 3d gaussian splatting for occluded human rendering. arXiv preprint arXiv:2404.08449, 2024. [Google Scholar]

[R68] [68].Ye Keyang, Shao Tianjia, and Zhou Kun. Animatable 3d gaussians for high-fidelity synthesis of human motions. arXiv preprint arXiv:2311.13404, 2023. [Google Scholar]

[R69] [69].Ye Vickie, Pavlakos Georgios, Malik Jitendra, and Kanazawa Angjoo. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023. [Google Scholar]

[R70] [70].Yi Taoran, Fang Jiemin, Wang Junjie, Wu Guanjun, Xie Lingxi, Zhang Xiaopeng, Liu Wenyu, Tian Qi, and Wang Xinggang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models, 2024.

[R71] [71].Yu Tao, Zheng Zerong, Guo Kaiwen, Liu Pengpeng, Dai Qionghai, and Liu Yebin. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5746–5756, 2021. [Google Scholar]

[R72] [72].Yu Zhengming, Cheng Wei, Liu Xian, Wu Wayne, and Lin Kwan-Yee. Monohuman: Animatable human neural field from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16943–16953, 2023. [Google Scholar]

[R73] [73].Yu Zhongrui, Wang Haoran, Yang Jinze, Wang Hanzhang, Xie Zeke, Cai Yunfeng, Cao Jiale, Ji Zhong, and Sun Mingming. Sgd: Street view synthesis with gaussian splatting and diffusion prior. arXiv preprint arXiv:2403.20079, 2024. [Google Scholar]

[R74] [74].Yuan Ye, Li Xueting, Huang Yangyi, Shalini De Mello Koki Nagano, Kautz Jan, and Iqbal Umar. Gavatar: Animatable 3d gaussian avatars with implicit mesh learning. arXiv preprint arXiv:2312.11461, 2023. [Google Scholar]

[R75] [75].Zhang Dongbin, Wang Chuming, Wang Weitao, Li Peihao, Qin Minghan, and Wang Haoqian. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. arXiv preprint arXiv:2403.15704, 2024. [Google Scholar]

[R76] [76].Zhang Lvmin, Rao Anyi, and Agrawala Maneesh. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. [Google Scholar]

[R77] [77].Zhang Richard, Isola Phillip, Alexei A Efros Eli Shechtman, and Wang Oliver. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. [Google Scholar]

[R78] [78].Zhang Tingyang, Gao Qingzhe, Li Weiyu, Liu Libin, and Chen Baoquan. Bags: Building animatable gaussian splatting from a monocular video with diffusion priors. arXiv preprint arXiv:2403.11427, 2024. [Google Scholar]

[R79] [79].Zhou Zhizhuo and Tulsiani Shubham. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023. [Google Scholar]

PERMALINK

OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Adam Sun

Tiange Xiang

Scott Delp

Li Fei-Fei

Ehsan Adeli

Abstract

1. Introduction

2. Related Work

2.1. Neural Human Rendering

2.2. Occluded Human Rendering

Table 1:

2.3. Generative Diffusion Priors

3. Preliminaries

3.1. 3D Human Modeling

3.2. Human Rendering with 3D Gaussian Splatting

3D Gaussian splatting.

GauHuman [14].

3.3. OccGauHuman: An Improved Baseline for Occlusion Handling

Figure 7:

4. OccFusion

Figure 2:

4.1. Initialization Stage: Recovering Human Geometry from Partial Observations

Conditioned human generation cannot handle challenging poses.

Figure 3:

Per-frame inpainting cannot guarantee cross-frame consistency.

Figure 4:

4.2. Optimization Stage: Enforcing Human Completeness with SDS Regularization

Using diffusion priors to enforce human completeness.

Using diffusion priors to regularize canonical pose.

4.3. Refinement Stage: Refining Human Appearance via In-context Inpainting

Figure 6:

5. Experiments

5.1. Datasets and Evaluation

ZJUMoCap.

OcMotion.

Evaluation.

5.2. Results on Simulated and Real-world Occlusions

Figure 5:

5.3. Additional Studies

Ablation studies.

Table 2:

Does the proposed OccGauHuman perform better than GauHuman [14] in rendering occluded humans?

Figure 1:

6. Discussions and Conclusion

Limitations.

Societal Impacts.

Conclusion.

Supplementary Material

7. Acknowledgment

A. Table of Symbols

Table 3:

B. Implementation Details

C. Additional Studies

Effectiveness of in-context inpainting.

Figure 8:

Figure 9:

Figure 10:

Figure 11:

Applying SDS on RGB vs on Human Occupancy Maps.

Robustness of training to inpainted masks.

Can existing generative models recover an occluded human?

D. Video Studies

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases