Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use custom camera model and camera parameters using onnx inference #103

Closed
alpereninci opened this issue Jun 4, 2024 · 5 comments

Comments

@alpereninci
Copy link

alpereninci commented Jun 4, 2024

I have a custom camera with intrinsic calibration. I have problem with onnx inference. Should I give "cam_model" parameter to model or Is using post process enough?

I think post process is not complete in test_onnx.py.

I would like to inference with "metric3d_vit_small" model.

I see in "do_test.py"

ori_focal = (intrinsic[0] + intrinsic[1]) / 2
canonical_focal = canonical_space['focal_length']
cano_label_scale_ratio = canonical_focal / ori_focal
..
rgb, _, pad, resize_label_scale_ratio = resize_for_input(rgb, forward_size, canonical_intrinsic, [ori_h, ori_w], 1.0)
label_scale_factor = cano_label_scale_ratio * resize_label_scale_ratio

in vit.raft5.small.py config file

max_value = 200
# configs of the canonical space
data_basic=dict(
    canonical_space = dict(
        # img_size=(540, 960),
        focal_length=1000.0,
    ),
    depth_range=(0, 1),
    depth_normalize=(0.1, max_value),
    crop_size = (616, 1064),  # %28 = 0
     clip_depth_range=(0.1, 200),
    vit_size=(616,1064)
) 

During onnx inference should I use canonical_space['focal_length'] = 1000 (comes from config) and normalize scale = 1 ( (comes from config)).
How to use cx and cy? Are there important parameters?

Also, what should I do if I change input resolution? Given input resolution is (H,W) = (616, 1064). What would happen and what should I do if I downsample the image resolution to (308, 532)

@YvanYin
Copy link
Owner

YvanYin commented Jun 6, 2024

@Owen-Liuyuxuan does onnx support a custom camera?

@Owen-Liuyuxuan
Copy link
Collaborator

@alpereninci cc: @YvanYin
For now, the onnx scripts in this repo and the provided onnx model do not directly support a custom camera. So we may have to compute the scale outside the onnx computation.

In ros2_vision_inferece, I make a demonstration of how to put camera matrix $P$ as another input to the onnx models and get scaled depth for any perspective cameras.

I trimmed down the projection and coordinate transform codes from ros2_vision_inference to showcase the changes we need:

## Change the model export script
class Metric3DExportModel(torch.nn.Module):
    def __init__(self, meta_arch, is_export_rgb=True):
        super().__init__()
        self.meta_arch = meta_arch
        self.register_buffer('rgb_mean', torch.tensor([123.675, 116.28, 103.53]).view(1, 3, 1, 1).cuda())
        self.register_buffer('rgb_std', torch.tensor([58.395, 57.12, 57.375]).view(1, 3, 1, 1).cuda())
        self.input_size = (616, 1064)

    def normalize_image(self, image):
        image = image - self.rgb_mean
        image = image / self.rgb_std
        return image

    def forward(self, image, P):
        original_image = image.clone()
        image = self.normalize_image(image)
        with torch.no_grad():
            pred_depth, confidence, output_dict = self.meta_arch.inference({'input': image})
            canonical_to_real_scale = (P[:, 0, 0, None, None] + P[:, 1, 1, None, None] ) / 2*1000.0 # 1000.0 is the focal length of canonical camera
            print(canonical_to_real_scale.shape, pred_depth.shape)
            pred_depth = pred_depth * canonical_to_real_scale # now the depth is metric
        return pred_depth

## In testing
dummy_P = np.zeros([1, 3, 4], dtype=np.float32)
outputs = ort_session.run(None, {"image": dummy_image, "P": dummy_P})

Any reshape operations before getting the (616, 1064) should be accompanied by changes in the camera matrix $P$.

But if you are considering changing the input size to the network, I have not succeeded in doing so myself, I am afraid there could be errors inside the ViT network? @YvanYin Any ideas on changing the input to the network?

@alpereninci
Copy link
Author

Thanks for your reply @Owen-Liuyuxuan. Actually, I am considering changing the input size to network.

@Owen-Liuyuxuan
Copy link
Collaborator

By slightly changing the input shape to the network, the onnx model works (at least with no weird errors). However, I am not sure about the generalization ability and the metric accuracy. I will give it a test.

@Owen-Liuyuxuan
Copy link
Collaborator

Owen-Liuyuxuan commented Jun 6, 2024

I have tried on personal data. It works but the canonical camera focal length is not necessarily 500. I believe you could try to fine-tune the parameters for your usage.

For my scene, it is about 1000/sqrt(2), I don't know why though.

BTW, for TensorRT usage, we need to clean up the cache every time before constant parameter changes can apply, so I suggest doing tuning in GPU first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants