Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No output | Custom dataset | TensorRT #42

Closed
mahesh-sudhakar opened this issue Jan 27, 2021 · 14 comments
Closed

No output | Custom dataset | TensorRT #42

mahesh-sudhakar opened this issue Jan 27, 2021 · 14 comments

Comments

@mahesh-sudhakar
Copy link

Hi @haotian-liu,

I'm working on a custom instance segmentation task with three classes. While I get output segmentations on my Jetson Xavier by using tag --disable_tensorrt, there's no output when I run the model on TensorRT.

I'm training a ResNet50 model on my PC and transferring the learned model to Jetson for inference.

Initially, I suspected the error is similar to issue:27 as I got IndexError Warnings when enabling TensorRT. But while debugging I found that except blocks do no harm.

My commented detection.py file:

# This try-except block aims to fix the IndexError that we might encounter when we train on custom datasets and evaluate with TensorRT enabled. See https://github.com/haotian-liu/yolact_edge/issues/27.
       try:
           classes = classes[keep]
           boxes = boxes[keep]
           masks = masks[keep]
           scores = scores[keep]

           print("Passed first Try/Except")

       except IndexError:
           from utils.logging_helper import log_once
           log_once(self, "issue_27_flatten", name="yolact.layers.detect", 
           message="Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Flattening predictions to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.")

           classes = torch.flatten(classes, end_dim=1)
           boxes = torch.flatten(boxes, end_dim=1)
           masks = torch.flatten(masks, end_dim=1)
           scores = torch.flatten(scores, end_dim=1)
           keep = torch.flatten(keep, end_dim=1)

           idx = torch.nonzero(keep, as_tuple=True)[0]
           print(f"\nIdx: {idx}")
           print(f"Idx_min: {idx.min()} and Idx_max: {idx.max()}")

           classes = torch.index_select(classes, 0, idx)
           boxes = torch.index_select(boxes, 0, idx)
           masks = torch.index_select(masks, 0, idx)
           scores = torch.index_select(scores, 0, idx)

       # Only keep the top cfg.max_num_detections highest scores across all classes
       scores, idx = scores.sort(0, descending=True)
       idx = idx[:cfg.max_num_detections]
       scores = scores[:cfg.max_num_detections]

       print(f"\nIdx: {idx}")
       print(f"Idx_min: {idx.min()} and Idx_max: {idx.max()}")

       try:
           print(f"\nInside second Try")

           print(f"Classes: {classes}")
           print(f"Boxes: {boxes}")

           classes = classes[idx]
           print(f"Classes updated: {classes}")

           boxes= boxes[idx]
           print(f"Boxes updated: {boxes}")
           
           masks = masks[idx]

           print(f"Scores: {scores}")
       except IndexError:
           from utils.logging_helper import log_once
           log_once(self, "issue_27_index_select", name="yolact.layers.detect", message="Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Using `torch.index_select` to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.")

           print(f"\nSecond Try/Except")

           classes = torch.index_select(classes, 0, idx)
           boxes = torch.index_select(boxes, 0, idx)
           masks = torch.index_select(masks, 0, idx)

           print(f"Classes updated: {classes}")
           print(f"Boxes updated: {boxes}")
           print(f"Scores: {scores}")

       return boxes, masks, classes, scores

Command that I use to run evaluation:

~/Projects/yolact_edge$ python3 eval.py --config=yolact_edge_config --trained_model=weights/yolact_edge_2115_110000.pth --score_threshold=0.3 --top_k=20 --image=./test_input/020801_2020_11_25_11_54_18.png
[01/27 15:04:38 yolact.eval]: Loading model...
[01/27 15:04:42 yolact.eval]: Model loaded.
[01/27 15:04:42 yolact.eval]: Converting to TensorRT...
[01/27 15:04:42 yolact.eval]: Converting backbone to TensorRT...
[01/27 15:04:44 yolact.eval]: Converting protonet to TensorRT...
[01/27 15:04:44 yolact.eval]: Converting FPN to TensorRT...
[01/27 15:04:44 yolact.eval]: Converting PredictionModule to TensorRT...
[01/27 15:04:55 yolact.eval]: Converted to TensorRT.
WARNING [01/27 15:04:56 yolact.layers.detect]: Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Flattening predictions to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.

Idx: tensor([ 0,  1,  2,  9, 11, 13, 15, 16, 18, 24, 26, 27, 30, 31, 34])
Idx_min: 0 and Idx_max: 34

Idx: tensor([ 0, 10,  1,  2, 11, 12,  5, 13,  6, 14,  7,  3,  8,  9,  4])
Idx_min: 0 and Idx_max: 14

Inside second Try
Classes: tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
Boxes: tensor([[ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.9843,  0.7867,  0.9961,  0.8066],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.7592,  0.1906,  0.7628,  0.2080],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7592,  0.1906,  0.7628,  0.2080],
        [ 0.7592,  0.1906,  0.7633,  0.2087],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.8753,  0.4015,  0.9084,  0.5413],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.7628,  0.4625,  0.7711,  0.5278]])
WARNING [01/27 15:04:56 yolact.layers.detect]: Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Using `torch.index_select` to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.

Second Try/Except
Classes updated: tensor([0, 2, 0, 0, 2, 2, 1, 2, 1, 2, 1, 0, 1, 1, 0])
Boxes updated: tensor([[ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.7592,  0.1906,  0.7633,  0.2087],
        [ 0.9843,  0.7867,  0.9961,  0.8066],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.8753,  0.4015,  0.9084,  0.5413],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7592,  0.1906,  0.7628,  0.2080],
        [ 0.7592,  0.1906,  0.7628,  0.2080]])
Scores: tensor([9.8882e-01, 9.0431e-01, 8.8910e-01, 7.1460e-01, 5.0766e-01, 1.0031e-03,
        4.2995e-04, 3.9997e-04, 2.7969e-04, 1.7686e-04, 1.9921e-05, 1.6216e-05,
        5.4856e-06, 1.4189e-06, 1.8612e-07])

Please note: If I add --disable_tensorrt tag, I get results as the code executes in try blocks.
I also ensure to remove the cached '.trt' files in ./weights folder.
Can you help me here ? Thank you!

@haotian-liu
Copy link
Collaborator

Hi, can you try --use_fp16_tensorrt to see if the issue persists?

@mahesh-sudhakar
Copy link
Author

Hi,

No, the issue is not resolved yet.

@haotian-liu
Copy link
Collaborator

I suspect it is related to #38 and we are currently investigating on this issue (seems TensorRT related). Will let you know about the progress, and we might also need some information from you, thanks.

@mahesh-sudhakar
Copy link
Author

Okay, thank you! I'll continue to debug in the meantime and will update here in case I see any progress.

@haotian-liu
Copy link
Collaborator

Thanks!

@mahesh-sudhakar
Copy link
Author

Hi, Just an update! I figured that I get outputs for few of my test samples, provided that it passes the second try block. Why is that ? But for most of my test images, I get two exceptions and hence no outputs for them.

You can replicate the issue by changing the cfg.max_num_detections to a lower value.

@mahesh-sudhakar
Copy link
Author

mahesh-sudhakar commented Feb 2, 2021

Hi @haotian-liu, after much debugging I figured out that there's no issue with TensorRT!

While using TRT, when I specify score_threshold, I'm not getting any results. I just removed it and was able to get detections. Maybe this is because naturally the confidence scores are lower when using TRT and you can make it clearer in README if you agree.

Now, I'm facing another issue where I get duplicate detections (both with TensorRT and --disable_tensorrt) i.e., if I have 2 classes, I get 2 boxes around all objects.

I believe the issue is in this line below,
https://github.com/haotian-liu/yolact_edge/blob/545bc33d162a5d3cabf992186001ca63bfa7eda2/layers/functions/detection.py#L94

After selecting the conf_scores higher than my threshold at
https://github.com/haotian-liu/yolact_edge/blob/545bc33d162a5d3cabf992186001ca63bfa7eda2/layers/functions/detection.py#L93
keep has n non-zero values indicating that I have n objects higher than my threshold.

While, the shapes of boxes and masks are [n, 4] and [n, 32] correspondingly before passing to self.fast_nms(), the shape of scores is not [n, 1]?

Is my understanding and finding correct? I think if we fix this issue, we can permanently fix Issue #27 as well.
Thanks!

@haotian-liu
Copy link
Collaborator

Hi @smahesh2694, with TensorRT, the score might be different from the ones predicted by the pure PyTorch model, but it does not necessarily to be lower. All of our demo code and benchmarks are evaluated and generated with same hyperparameters for TensorRT (FP16/INT8) and PyTorch code. But for some reason, one some of the models trained on custom dataset, the TensorRT predictions will cause some issues (and there are quite weird and complicated phenomenon related to CUDA/TensorRT engine).

For your second issue (for both TRT and PyTorch), I currently don't get as they should have same shape in the first dimension of N as they are filtered with same keep as here. Can you elaborate more?

@mahesh-sudhakar
Copy link
Author

Thanks @haotian-liu! Yes, I agree with your first answer regarding TensorRT.

For the second issue, the shape of my scores is [num_classes, n] which is very weird although we use the same keep as you mentioned. Note that I'm working on custom dataset to be clear!

In my detect() function,

Shape of conf_score is [X]
Shape of keep is [X]
Shape of cur_scores is [num_classes, X]
Shape of scores is [num_classes, n] where n = count_nonzero(keep)
Shape of boxes is [n, 4]
Shape of masks is [n, 32]

Because of a mismatch in scores's shape, the final shape of boxes, scores, masks and classes have num_classes x n in its first dimension instead of just n, which leads to multiple detections.

I also wanted to let you know that, I have currently found a quick fix for my case as I replace the line
https://github.com/haotian-liu/yolact_edge/blob/545bc33d162a5d3cabf992186001ca63bfa7eda2/layers/functions/detection.py#L109
with

return {'box': boxes[:torch.count_nonzero(scores > self.conf_thresh)], 'mask': masks[:torch.count_nonzero(scores > self.conf_thresh)], 'class': classes[:torch.count_nonzero(scores > self.conf_thresh)], 'score': scores[:torch.count_nonzero(scores > self.conf_thresh)]} 

But, I would appreciate your comments here as I don't want this ugly fix forever!

@haotian-liu
Copy link
Collaborator

Hi @smahesh2694, I still don't see what is wrong with the shape you printed. why is the shape [num_classes, n] improper for scores.

@mahesh-sudhakar
Copy link
Author

For your second issue (for both TRT and PyTorch), I currently don't get as they should have same shape in the first dimension of N as they are filtered with same keep as here. Can you elaborate more?

As per your previous comment, I thought the shape of scores should be [n, 1], isn't it ?

@haotian-liu
Copy link
Collaborator

@smahesh2694 scores before NMS should be [X, N] as it is required for computing NMS.

@mahesh-sudhakar
Copy link
Author

Thanks for the help @haotian-liu! I'm closing the issue now!

@haotian-liu
Copy link
Collaborator

haotian-liu commented Feb 7, 2021

I somehow figured out that the cause and applied the fix, details of the solution are explained in #47. Please take a look to see if the issue can be resolved.
If the issue persists, please reply directly to #47 (this will be the main thread to deal with related issues for now) with experiment configurations (details also explain there). Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants