No output | Custom dataset | TensorRT #42

mahesh-sudhakar · 2021-01-27T20:54:50Z

I'm working on a custom instance segmentation task with three classes. While I get output segmentations on my Jetson Xavier by using tag --disable_tensorrt, there's no output when I run the model on TensorRT.

I'm training a ResNet50 model on my PC and transferring the learned model to Jetson for inference.

Initially, I suspected the error is similar to issue:27 as I got IndexError Warnings when enabling TensorRT. But while debugging I found that except blocks do no harm.

My commented detection.py file:

# This try-except block aims to fix the IndexError that we might encounter when we train on custom datasets and evaluate with TensorRT enabled. See https://github.com/haotian-liu/yolact_edge/issues/27.
       try:
           classes = classes[keep]
           boxes = boxes[keep]
           masks = masks[keep]
           scores = scores[keep]

           print("Passed first Try/Except")

       except IndexError:
           from utils.logging_helper import log_once
           log_once(self, "issue_27_flatten", name="yolact.layers.detect", 
           message="Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Flattening predictions to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.")

           classes = torch.flatten(classes, end_dim=1)
           boxes = torch.flatten(boxes, end_dim=1)
           masks = torch.flatten(masks, end_dim=1)
           scores = torch.flatten(scores, end_dim=1)
           keep = torch.flatten(keep, end_dim=1)

           idx = torch.nonzero(keep, as_tuple=True)[0]
           print(f"\nIdx: {idx}")
           print(f"Idx_min: {idx.min()} and Idx_max: {idx.max()}")

           classes = torch.index_select(classes, 0, idx)
           boxes = torch.index_select(boxes, 0, idx)
           masks = torch.index_select(masks, 0, idx)
           scores = torch.index_select(scores, 0, idx)

       # Only keep the top cfg.max_num_detections highest scores across all classes
       scores, idx = scores.sort(0, descending=True)
       idx = idx[:cfg.max_num_detections]
       scores = scores[:cfg.max_num_detections]

       print(f"\nIdx: {idx}")
       print(f"Idx_min: {idx.min()} and Idx_max: {idx.max()}")

       try:
           print(f"\nInside second Try")

           print(f"Classes: {classes}")
           print(f"Boxes: {boxes}")

           classes = classes[idx]
           print(f"Classes updated: {classes}")

           boxes= boxes[idx]
           print(f"Boxes updated: {boxes}")
           
           masks = masks[idx]

           print(f"Scores: {scores}")
       except IndexError:
           from utils.logging_helper import log_once
           log_once(self, "issue_27_index_select", name="yolact.layers.detect", message="Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Using `torch.index_select` to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.")

           print(f"\nSecond Try/Except")

           classes = torch.index_select(classes, 0, idx)
           boxes = torch.index_select(boxes, 0, idx)
           masks = torch.index_select(masks, 0, idx)

           print(f"Classes updated: {classes}")
           print(f"Boxes updated: {boxes}")
           print(f"Scores: {scores}")

       return boxes, masks, classes, scores

Command that I use to run evaluation:

~/Projects/yolact_edge$ python3 eval.py --config=yolact_edge_config --trained_model=weights/yolact_edge_2115_110000.pth --score_threshold=0.3 --top_k=20 --image=./test_input/020801_2020_11_25_11_54_18.png
[01/27 15:04:38 yolact.eval]: Loading model...
[01/27 15:04:42 yolact.eval]: Model loaded.
[01/27 15:04:42 yolact.eval]: Converting to TensorRT...
[01/27 15:04:42 yolact.eval]: Converting backbone to TensorRT...
[01/27 15:04:44 yolact.eval]: Converting protonet to TensorRT...
[01/27 15:04:44 yolact.eval]: Converting FPN to TensorRT...
[01/27 15:04:44 yolact.eval]: Converting PredictionModule to TensorRT...
[01/27 15:04:55 yolact.eval]: Converted to TensorRT.
WARNING [01/27 15:04:56 yolact.layers.detect]: Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Flattening predictions to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.

Idx: tensor([ 0,  1,  2,  9, 11, 13, 15, 16, 18, 24, 26, 27, 30, 31, 34])
Idx_min: 0 and Idx_max: 34

Idx: tensor([ 0, 10,  1,  2, 11, 12,  5, 13,  6, 14,  7,  3,  8,  9,  4])
Idx_min: 0 and Idx_max: 14

Inside second Try
Classes: tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
Boxes: tensor([[ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.9843,  0.7867,  0.9961,  0.8066],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.7592,  0.1906,  0.7628,  0.2080],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7592,  0.1906,  0.7628,  0.2080],
        [ 0.7592,  0.1906,  0.7633,  0.2087],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.8753,  0.4015,  0.9084,  0.5413],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.7628,  0.4625,  0.7711,  0.5278]])
WARNING [01/27 15:04:56 yolact.layers.detect]: Encountered IndexError as mentioned in https://github.com/haotian-liu/yolact_edge/issues/27. Using `torch.index_select` to avoid error, please verify the outputs. If there are any problems you met related to this, please report an issue.

Second Try/Except
Classes updated: tensor([0, 2, 0, 0, 2, 2, 1, 2, 1, 2, 1, 0, 1, 1, 0])
Boxes updated: tensor([[ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.7592,  0.1906,  0.7633,  0.2087],
        [ 0.9843,  0.7867,  0.9961,  0.8066],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.8753,  0.4015,  0.9084,  0.5413],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.9848,  0.7869,  0.9971,  0.8092],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.7628,  0.4625,  0.7711,  0.5278],
        [ 0.7709, -0.0089,  0.9846,  0.9502],
        [ 0.8758,  0.4074,  0.9116,  0.5297],
        [ 0.7592,  0.1906,  0.7628,  0.2080],
        [ 0.7592,  0.1906,  0.7628,  0.2080]])
Scores: tensor([9.8882e-01, 9.0431e-01, 8.8910e-01, 7.1460e-01, 5.0766e-01, 1.0031e-03,
        4.2995e-04, 3.9997e-04, 2.7969e-04, 1.7686e-04, 1.9921e-05, 1.6216e-05,
        5.4856e-06, 1.4189e-06, 1.8612e-07])

Please note: If I add --disable_tensorrt tag, I get results as the code executes in try blocks.
I also ensure to remove the cached '.trt' files in ./weights folder.
Can you help me here ? Thank you!

The text was updated successfully, but these errors were encountered:

haotian-liu · 2021-01-27T21:11:37Z

Hi, can you try --use_fp16_tensorrt to see if the issue persists?

mahesh-sudhakar · 2021-01-27T21:18:54Z

Hi,

No, the issue is not resolved yet.

haotian-liu · 2021-01-27T21:20:48Z

I suspect it is related to #38 and we are currently investigating on this issue (seems TensorRT related). Will let you know about the progress, and we might also need some information from you, thanks.

mahesh-sudhakar · 2021-01-27T21:23:08Z

Okay, thank you! I'll continue to debug in the meantime and will update here in case I see any progress.

haotian-liu · 2021-01-27T21:23:46Z

Thanks!

mahesh-sudhakar · 2021-01-29T18:52:24Z

Hi, Just an update! I figured that I get outputs for few of my test samples, provided that it passes the second try block. Why is that ? But for most of my test images, I get two exceptions and hence no outputs for them.

You can replicate the issue by changing the cfg.max_num_detections to a lower value.

mahesh-sudhakar · 2021-02-02T17:16:19Z

Hi @haotian-liu, after much debugging I figured out that there's no issue with TensorRT!

While using TRT, when I specify score_threshold, I'm not getting any results. I just removed it and was able to get detections. Maybe this is because naturally the confidence scores are lower when using TRT and you can make it clearer in README if you agree.

Now, I'm facing another issue where I get duplicate detections (both with TensorRT and --disable_tensorrt) i.e., if I have 2 classes, I get 2 boxes around all objects.

I believe the issue is in this line below,
https://github.com/haotian-liu/yolact_edge/blob/545bc33d162a5d3cabf992186001ca63bfa7eda2/layers/functions/detection.py#L94

After selecting the conf_scores higher than my threshold at
https://github.com/haotian-liu/yolact_edge/blob/545bc33d162a5d3cabf992186001ca63bfa7eda2/layers/functions/detection.py#L93
keep has n non-zero values indicating that I have n objects higher than my threshold.

While, the shapes of boxes and masks are [n, 4] and [n, 32] correspondingly before passing to self.fast_nms(), the shape of scores is not [n, 1]?

Is my understanding and finding correct? I think if we fix this issue, we can permanently fix Issue #27 as well.
Thanks!

haotian-liu · 2021-02-02T18:24:00Z

Hi @smahesh2694, with TensorRT, the score might be different from the ones predicted by the pure PyTorch model, but it does not necessarily to be lower. All of our demo code and benchmarks are evaluated and generated with same hyperparameters for TensorRT (FP16/INT8) and PyTorch code. But for some reason, one some of the models trained on custom dataset, the TensorRT predictions will cause some issues (and there are quite weird and complicated phenomenon related to CUDA/TensorRT engine).

For your second issue (for both TRT and PyTorch), I currently don't get as they should have same shape in the first dimension of N as they are filtered with same keep as here. Can you elaborate more?

mahesh-sudhakar · 2021-02-02T18:48:05Z

Thanks @haotian-liu! Yes, I agree with your first answer regarding TensorRT.

For the second issue, the shape of my scores is [num_classes, n] which is very weird although we use the same keep as you mentioned. Note that I'm working on custom dataset to be clear!

In my detect() function,

Shape of conf_score is [X]
Shape of keep is [X]
Shape of cur_scores is [num_classes, X]
Shape of scores is [num_classes, n] where n = count_nonzero(keep)
Shape of boxes is [n, 4]
Shape of masks is [n, 32]

Because of a mismatch in scores's shape, the final shape of boxes, scores, masks and classes have num_classes x n in its first dimension instead of just n, which leads to multiple detections.

I also wanted to let you know that, I have currently found a quick fix for my case as I replace the line
https://github.com/haotian-liu/yolact_edge/blob/545bc33d162a5d3cabf992186001ca63bfa7eda2/layers/functions/detection.py#L109
with

return {'box': boxes[:torch.count_nonzero(scores > self.conf_thresh)], 'mask': masks[:torch.count_nonzero(scores > self.conf_thresh)], 'class': classes[:torch.count_nonzero(scores > self.conf_thresh)], 'score': scores[:torch.count_nonzero(scores > self.conf_thresh)]}

But, I would appreciate your comments here as I don't want this ugly fix forever!

haotian-liu · 2021-02-02T19:04:05Z

Hi @smahesh2694, I still don't see what is wrong with the shape you printed. why is the shape [num_classes, n] improper for scores.

mahesh-sudhakar · 2021-02-02T19:09:23Z

For your second issue (for both TRT and PyTorch), I currently don't get as they should have same shape in the first dimension of N as they are filtered with same keep as here. Can you elaborate more?

As per your previous comment, I thought the shape of scores should be [n, 1], isn't it ?

haotian-liu · 2021-02-02T19:10:37Z

@smahesh2694 scores before NMS should be [X, N] as it is required for computing NMS.

mahesh-sudhakar · 2021-02-04T19:57:13Z

Thanks for the help @haotian-liu! I'm closing the issue now!

haotian-liu · 2021-02-07T02:44:06Z

I somehow figured out that the cause and applied the fix, details of the solution are explained in #47. Please take a look to see if the issue can be resolved.
If the issue persists, please reply directly to #47 (this will be the main thread to deal with related issues for now) with experiment configurations (details also explain there). Thanks.

haotian-liu mentioned this issue Jan 31, 2021

RuntimeError: CUDA error: an illegal memory access was encountered #44

Closed

mahesh-sudhakar closed this as completed Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No output | Custom dataset | TensorRT #42

No output | Custom dataset | TensorRT #42

mahesh-sudhakar commented Jan 27, 2021

haotian-liu commented Jan 27, 2021

mahesh-sudhakar commented Jan 27, 2021

haotian-liu commented Jan 27, 2021

mahesh-sudhakar commented Jan 27, 2021

haotian-liu commented Jan 27, 2021

mahesh-sudhakar commented Jan 29, 2021

mahesh-sudhakar commented Feb 2, 2021 •

edited

Loading

haotian-liu commented Feb 2, 2021

mahesh-sudhakar commented Feb 2, 2021

haotian-liu commented Feb 2, 2021

mahesh-sudhakar commented Feb 2, 2021

haotian-liu commented Feb 2, 2021

mahesh-sudhakar commented Feb 4, 2021

haotian-liu commented Feb 7, 2021 •

edited

Loading

No output | Custom dataset | TensorRT #42

No output | Custom dataset | TensorRT #42

Comments

mahesh-sudhakar commented Jan 27, 2021

haotian-liu commented Jan 27, 2021

mahesh-sudhakar commented Jan 27, 2021

haotian-liu commented Jan 27, 2021

mahesh-sudhakar commented Jan 27, 2021

haotian-liu commented Jan 27, 2021

mahesh-sudhakar commented Jan 29, 2021

mahesh-sudhakar commented Feb 2, 2021 • edited Loading

haotian-liu commented Feb 2, 2021

mahesh-sudhakar commented Feb 2, 2021

haotian-liu commented Feb 2, 2021

mahesh-sudhakar commented Feb 2, 2021

haotian-liu commented Feb 2, 2021

mahesh-sudhakar commented Feb 4, 2021

haotian-liu commented Feb 7, 2021 • edited Loading

mahesh-sudhakar commented Feb 2, 2021 •

edited

Loading

haotian-liu commented Feb 7, 2021 •

edited

Loading