Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI after torchmetrics update #567

Merged
merged 4 commits into from
Dec 9, 2022

Conversation

danthe3rd
Copy link
Contributor

@danthe3rd danthe3rd commented Dec 8, 2022

Stack from ghstack (oldest at bottom):

It now takes an argument: https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html

Change in pytorch lightning:
Lightning-AI/torchmetrics@20eab43

Somehow this is failing with a SEGFAULT on my A100 (in a triton kernel):

#0  0x00007fffc0f62e10 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fffc0f9303c in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffc0f2ea13 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fffc0f94603 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fffc119e4a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007fffc0f3728f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007fffc0f3999f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007fffc0fdb1c2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007fff502234c0 in _launch ()
   from /data/home/XXXXX/.triton/cache/704a3e6949e60326bc68d18a620bee50/layer_norm_fw.so
#9  0x00007fff3c0eea25 in launch ()
   from /data/home/XXXXX/.triton/cache/2cebb5590a024a2e06fe9de08c6b7079/k_dropout_bw.so
#10 0x0000555555698422 in cfunction_call (func=0x7fff3c6e5760, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.6/Objects/methodobject.c:552

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

danthe3rd added 3 commits December 8, 2022 10:28
It now takes an argument: https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html

Change in pytorch lightning:
Lightning-AI/torchmetrics@20eab43

Somehow this is failing with a SEGFAULT on my A100 (in a triton kernel):
```
#0  0x00007fffc0f62e10 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fffc0f9303c in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffc0f2ea13 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fffc0f94603 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fffc119e4a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007fffc0f3728f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007fffc0f3999f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007fffc0fdb1c2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007fff502234c0 in _launch ()
   from /data/home/XXXXX/.triton/cache/704a3e6949e60326bc68d18a620bee50/layer_norm_fw.so
#9  0x00007fff3c0eea25 in launch ()
   from /data/home/XXXXX/.triton/cache/2cebb5590a024a2e06fe9de08c6b7079/k_dropout_bw.so
#10 0x0000555555698422 in cfunction_call (func=0x7fff3c6e5760, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.6/Objects/methodobject.c:552
```

[ghstack-poisoned]
It now takes an argument: https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html

Change in pytorch lightning:
Lightning-AI/torchmetrics@20eab43

Somehow this is failing with a SEGFAULT on my A100 (in a triton kernel):
```
#0  0x00007fffc0f62e10 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fffc0f9303c in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffc0f2ea13 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fffc0f94603 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fffc119e4a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007fffc0f3728f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007fffc0f3999f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007fffc0fdb1c2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007fff502234c0 in _launch ()
   from /data/home/XXXXX/.triton/cache/704a3e6949e60326bc68d18a620bee50/layer_norm_fw.so
#9  0x00007fff3c0eea25 in launch ()
   from /data/home/XXXXX/.triton/cache/2cebb5590a024a2e06fe9de08c6b7079/k_dropout_bw.so
#10 0x0000555555698422 in cfunction_call (func=0x7fff3c6e5760, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.6/Objects/methodobject.c:552
```

[ghstack-poisoned]
It now takes an argument: https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html

Change in pytorch lightning:
Lightning-AI/torchmetrics@20eab43

Somehow this is failing with a SEGFAULT on my A100 (in a triton kernel):
```
#0  0x00007fffc0f62e10 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fffc0f9303c in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffc0f2ea13 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fffc0f94603 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fffc119e4a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007fffc0f3728f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007fffc0f3999f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007fffc0fdb1c2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007fff502234c0 in _launch ()
   from /data/home/XXXXX/.triton/cache/704a3e6949e60326bc68d18a620bee50/layer_norm_fw.so
#9  0x00007fff3c0eea25 in launch ()
   from /data/home/XXXXX/.triton/cache/2cebb5590a024a2e06fe9de08c6b7079/k_dropout_bw.so
#10 0x0000555555698422 in cfunction_call (func=0x7fff3c6e5760, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.6/Objects/methodobject.c:552
```

[ghstack-poisoned]
@danthe3rd danthe3rd merged commit 8a6130d into gh/danthe3rd/63/base Dec 9, 2022
danthe3rd pushed a commit that referenced this pull request Dec 9, 2022
It now takes an argument: https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html

ghstack-source-id: 124e71e10d9e2f513512cdc08e158a0e1f485239
Pull Request resolved: #567
@danthe3rd danthe3rd deleted the gh/danthe3rd/63/head branch December 9, 2022 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants