Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSFF Quality on Custom Dataset #18

Open
benattal opened this issue May 25, 2021 · 17 comments
Open

NSFF Quality on Custom Dataset #18

benattal opened this issue May 25, 2021 · 17 comments

Comments

@benattal
Copy link

benattal commented May 25, 2021

Hi all. I'm trying to run this method on custom data, with mixed success so far. I was wondering if you had any insight about what might be happening. I'm attaching some videos and images to facilitate discussion.

  1. First of all, we're able to run static NeRF using poses from colmap, and it seems to do fine
01_nerf_result.mp4
  1. Likewise, setting the dynamic blending weight to 0 in your model, and using only the color reconstruction loss produces plausible results (novel view synthesis result below, for fixed time)

02_nsff_static_only

  1. Using the dynamic model, while setting all the frame indices to 0 should also emulate a static NeRF. It does alright, but includes some strange haze

03_nsff_frame_zero

  1. Finally, running NSFF on our full video sequence with all losses for 130k iterations produces a lot of ghosting (04_nsff_result.mp4).
04_nsff_result.mp4

Even though the data driven monocular depth / flow losses are phased out during training, I wonder if monocular depth is perhaps causing these issues? Although again the monocular depth and flow both look reasonable.

05_depth
06_flow

Let me know if you have any insights about what might be going on, and how we can improve quality here -- I'm a bit stumped at the moment. I'm also happy to send you the short dataset / video sequence that we're using if you'd like to take a look.

All the best,
~Ben

@zhengqili
Copy link
Owner

zhengqili commented May 25, 2021

Hi,

Could you share the input video sequence?

I recently do find that for the dynamic model, if camera ego-motion is small (i.e, camera baseline is small in consecutive video frames), our local representation sometimes has difficulty reconstructing the scene very well due to our neighborhood local temporal warping (it usually works better if the camera is moving fast like the examples we show).

@benattal
Copy link
Author

Thanks for the quick response, and for your insight about when the method struggles. I've uploaded the video sequence to google drive here: https://drive.google.com/file/d/1C7XHilFxdpc9pcgfMieo7BuH0Y3Y1qeR/view?usp=sharing

@kwea123
Copy link

kwea123 commented May 25, 2021

@zhengqili Is there any theoretic reason why this happens, or this is pure observation? In my opinion the camera baseline shouldn't matter if COLMAP still estimates the poses correctly. Or do you mean that small baseline causes larger error for COLMAP (relative to the baseline)?

@zhengqili
Copy link
Owner

zhengqili commented May 26, 2021

My thought is that a small camera baseline could lead to a degenerate solution for geometry reconstruction. In classical SfM, SLAM or MVS, we have to choose input frames that contain enough motion parallax (either using keyframe selection or input subsampled frames from the original video) before performing triangulation, otherwise, the triangulation solver can be ill-conditioned.

NeRF is very beautiful for a static scene because it has no such problem thanks to its unified 3D global representation. But for our 4D approach, you can think we are doing local triangulation in a small time window, so if the camera baseline is small or the background can be better modeled as Homography matrix than Essential matrix (which actually is the case in this video), the reconstruction can be stuck in a degenerated minimum.

@kwea123
Copy link

kwea123 commented May 26, 2021

Got it, but in my opinion

the background can be better modeled as Homography matrix than Essential matrix

this happens when there is no prior knowledge of the geometry.
However here we have monodepth estimation which seems reasonably accurate, is that still not enough? In your opinion what can we do to solve this issue, or we cannot do anything? Like the other issue #15 , it seems that we should be very selective in the video so that nsff works.

@kwea123
Copy link

kwea123 commented May 26, 2021

Yes, I also have the impression that the weight is decayed too fast, and at later epochs the depth becomes weird. What I have tried is to decay the weight very slowly, I find this helpful in my scenes, but don't know if it generalizes to other scenes.

@zhengqili
Copy link
Owner

zhengqili commented May 26, 2021

Hi @breuckelen, I quickly tried this video. Since I have graduated and I don't have a lot of GPUs currently, I just use 3 frames consistency by disabling chaining scene flow during training (but the results should be very similar for view interpolation). I also subsampled the input frames by 1/2 since all the hyperparameters were validated on ~30 frames sequences.

If I use the default view interpolation camera path in my original github code, it does not look so bad (see the first video below), although it still contains some ghosting. However, if I change to larger viewpoint change, the ghosting will be more severe. To investigate this issue, I tried to render images only from the dynamic model (in function "render_bullet_time", change "rgb_map_ref" to "rgb_map_ref_dy"), the ghosting seems to disappear (see the second video).

My feeling is that the blending weights sometimes do not interpolate very well for larger viewpoint change in this video.

moving-box-full.mp4
moving-box-dynamic.mp4

@benattal
Copy link
Author

Thanks! To confirm, is the second video is a rendering of the dynamic model only? Also, what is your N_rand (ray batch size) set to here? 1024?

@zhengqili
Copy link
Owner

Yes. In function "render_bullet_time()", you can change "rgb_map_ref" to "rgb_map_ref_dy" to render images from dynamic model only. My N_rand is still 1024, num_extra_sample is set 128 due to the limited GPU of my own machine :), but it should not be any difference in this case.

@zhengqili
Copy link
Owner

zhengqili commented May 30, 2021

This is not relevant to the current implementation, but if you are interested in how to fix ghosting for the full model, I found some simple modification that can help reduce ghosting:

(1) Adding an entropy loss for blending weight to the total losses:
entropy_loss = 1e-3 * torch.mean(- ret['raw_blend_w'] * torch.log(ret['raw_blend_w'] + 1e-8))

This loss encourages blending weight to be either 0 or 1, which can help to reduce the ghosting caused by learned semi-transparent blending weight.

(2) conditioning predicted time-invariant blending weights on RGBA from the dynamic (time-dependent) model. This helps static model have better interpolation abability in unseen region during rendering. You need to modify rigid_nerf class similar to the following:
In init:
self.w_linear = nn.Sequential(nn.Linear(W + 4, W), nn.ReLU(), nn.Linear(W, 1))
In forward function:
blend_w = nn.functional.sigmoid(self.w_linear(torch.cat([input_rgba_dy, h], -1)))

The rendering results (trained after 150K iterations) shown below are much better. I haven't tried these modifications for a lot of videos, but it's worth trying if you see the ghosting effects.

moving_box_bt-15.mp4
moving_box_slowmo-bt.mp4

@zhangchi3
Copy link

If I use the default view interpolation camera path in my original github code

Hi, what do you mean using "the default view interpolation camera path" and how to modify the viewpoint change? Could you please give some instructions on COLMAP SfM, I read the project you give and followed its prosedure, but still didn't get the same camera parameters as you provide, using the kid-running data.
Thanks a lot!

@benattal
Copy link
Author

benattal commented Jun 1, 2021

I'm curious -- is the spiral meant to track the input camera motion in the second video? Or is the entire scene moving (and being reproduced in the dynamic network) due to inaccurate input poses from colmap?

@zhengqili
Copy link
Owner

The output camera trajectory is circular camera pose offset I create + interpolated input camera poses based on current fractional time indices, you can check how it works in function render_slowmo_bt:

@kwea123
Copy link

kwea123 commented Jun 10, 2021

@zhengqili For this scene I observe that sometimes the dynamic part gets explained by the viewing direction, I'm trying to remove view dependency in training. Have you encountered this problem before? Theoretically, there is no way to distinguish dynamic or view dependency in my opinion, for example the shadow can be explained by both.

@zhengqili
Copy link
Owner

zhengqili commented Jun 11, 2021

Yes. for shadow or other dynamic volumetric effects such as smoke, they can go either way. So if you don't care about modeling view-dependent effects in a dynamic region (in most cases it's indeed hard to model from a monocular camera), it's a good idea to turn of view-dependent conditioning.

@kwea123
Copy link

kwea123 commented Jun 23, 2021

Hi I tried this using my latest implementation, and it works as well as zhengqi's modifications.

box_sp20.mp4

I did not find

My thought is that a small camera baseline could lead to a degenerate solution for geometry reconstruction. In classical SfM, SLAM or MVS, we have to choose input frames that contain enough motion parallax (either using keyframe selection or input subsampled frames from the original video) before performing triangulation, otherwise, the triangulation solver can be ill-conditioned.

NeRF is very beautiful for a static scene because it has no such problem thanks to its unified 3D global representation. But for our 4D approach, you can think we are doing local triangulation in a small time window, so if the camera baseline is small or the background can be better modeled as Homography matrix than Essential matrix (which actually is the case in this video), the reconstruction can be stuck in a degenerated minimum.

this a problem. I use all 60 frames to reconstruct the poses and train.

To show you a better comparison, I do not use blending weight and let the network learn how to separate static and dynamic objects. This is the background learnt by the network (quite reasonable, the real background and some part of the body that only move a little):

box_sp20_bg.mp4

The advantage is, for static regions we know that it is going to perform as well as a normal NeRF, so no ghosting for these regions, and only the dynamic part might subject to artifact. While in this NSFF aside from the static network, the final rendering also depends on a blending weight that we cannot control. Like @zhengqili said maybe

My feeling is that the blending weights sometimes do not interpolate very well for larger viewpoint change in this video.

Time interpolation also looks good (artifacts around the face and the body, it actually moves but looks static), I'm thinking if I can use some prior to favor the network to learn all the body as dynamic...

box_fv20.mp4

@guanfuchen
Copy link

@kwea123 Hello, can you share the releated turtorial for using this project for our custom video data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants