Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #666

Closed
adefaria opened this issue Jul 25, 2022 · 61 comments
Closed

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #666

adefaria opened this issue Jul 25, 2022 · 61 comments
Labels
bug Something isn't working

Comments

@adefaria
Copy link

adefaria commented Jul 25, 2022

Now that I finally converted all of my videos using Tdarr I like how I can leave it running and when new videos are downloaded it compresses them. And this works... But then it dies with "CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected". Now I've configured my Docker container to run Tdarr using Nvidia and I have properly configured it so that that works. But then it breaks with that error. The solution is simply to restart the Docker container and then re-queue the videos but why does it break in the first place?

QjALSDImR-log.txt

@adefaria
Copy link
Author

adefaria commented Aug 2, 2022

Another data point. When transcode fails in the docker container I can replicate the ffmpeg command on the desktop (IOW outside of the docker container) and it works fine.

Why does it break in the docker container?

@HaveAGitGat
Copy link
Owner

Hmm seems like a bug between FFmpeg/Docker/Hardware, perhaps an FFmpeg update will fix the issue. The dev container has an FFmpeg update, looking to get it out soon.

@HaveAGitGat HaveAGitGat added the bug Something isn't working label Aug 8, 2022
@adefaria
Copy link
Author

adefaria commented Aug 9, 2022

I can try out the new version when it becomes available. I was suspecting it may be the Nvidia for Cocker containers (I have a foggy memory of installing that to get GPU transcoding to work in a Docker container) and something there broke down causing the CUDA_ERROR_NO_DEVICE. Lately, I'm seeing it just about transcodes a video or two and suddenly there are no devices left to GPU transcode. Restarting the Docker container always fixes the problem but it's essentially babysitting Tdarr, which shouldn't have to be the case.

@Sc0th
Copy link

Sc0th commented Aug 18, 2022

FWIW I am seeing this exact behavior, happy to supply logs/test if it helps at all, a restart of the docker fixes it every time.

@adefaria
Copy link
Author

I had cron'ed a restart of the Docker container at midnight in an attempt to mitigate this issue but it remains.

It might be nice if Tdarr allowed access to their database so I could interrogate the state and restart the Docker container if need be but then again if the bug is fixed it would be unnecessary.

Out of curiosity, @Sc0th, what's your environment? Where are you running the Docker container? What OS/machine? And where's your server?

@Sc0th
Copy link

Sc0th commented Aug 19, 2022

Apologies, I could have made that post slightly more useful!

I also tried the automated reboot, also to no avail. I am running the docker using podman on a VM running on Proxmox with PCI pass-through.

Some (maybe) useful detail:

GPU - Nvidia 1050 Ti
Proxmox 7.2-7

Linux infra01 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Tue Aug 2 13:42:59 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
AlmaLinux 8.6 (Sky Tiger)

nvidia-container-toolkit-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

podman run --name tdarr-node -v /appdata01/INFRA/tdarr-node/configs:/app/configs -v /appdata01/INFRA/tdarr-node/logs:/app/logs -v /MEDIA:/media -v /appdata01/PROD/tdarr/transcode:/transcode -e "nodeID=infra01-tdar
r-node"  -e "serverIP=x.x.x.x" -e "serverPort=8266" --net=host -e PUID=1000 -e PGID=1000 -e "NVIDIA_DRIVER_CAPABILITIES=all" -e "NVIDIA_VISIBLE_DEVICES=all" --gpus=all -d ghcr.io/haveagitgat/tdarr_node'

Tdarr Server & Node 2.00.18

I am using the following line monitored by Zabbix to alert me when it gets stuck:

cat /appdata01/PROD/tdarr/conf/server/Tdarr/DB2/StatisticsJSONDB/*.json | jq . | grep -ic err

A result higher than 0 indicates it's got stuck, I did look at using this to 'self heal' by restating the docker on trigger, however the jobs do not appear to requeue automatically so that did not quite go to plan.

I wait with greatful anticipation of the next release with hope of a fix!

@adefaria
Copy link
Author

Yeah, I think that it will end up being some interaction between Tdarr and the Nvidia runtime for the Docker container that causes Nvidia to lose track of available devices thus reporting NO_CUDA_DEVICE.

I'm running on my Thelio desktop running Ubuntu 22.04. I have the following:

Earth:apt list| grep nvidia-container

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-container-dev/bionic 1.10.0-1 amd64
libnvidia-container-tools/bionic,now 1.10.0-1 amd64 [installed,automatic]
libnvidia-container1-dbg/bionic 1.10.0-1 amd64
libnvidia-container1/bionic,now 1.10.0-1 amd64 [installed,automatic]
nvidia-container-runtime/bionic 3.10.0-1 all
nvidia-container-toolkit/bionic,now 1.10.0-1 amd64 [installed]
Earth:

@adefaria
Copy link
Author

Still have this issue however today I think I captured a log of Tdarr working on a video when the CUDA_ERROR_NO_DEVICE happened right in the middle. Maybe this log will help in debugging this bug
.
to7uvYRwN-log.txt
.

@Lebo77
Copy link

Lebo77 commented Aug 27, 2022

I am trying downgrading nvidia-docker2 back to 2.10.0-1 from 2.11.0-1 and seeing if that makes any difference as a workaround. If not I may try 2.9.1-1.

I am having this same issue on Ubuntu 22.04 LTS.

@Lebo77
Copy link

Lebo77 commented Aug 28, 2022

Downgrading to the nvidia-docker2 version 2.9.1-1 seems to be a workaround to this issue, at least in the day of testing. If it STOPS working I will let you know.

Annoying to have my updates reporting a heald-back package, but better than Tdarr-node breaking every hour or two.

@HaveAGitGat
Copy link
Owner

Ok thanks for the update 👍

@adefaria
Copy link
Author

Thanks. Downgraded to 2.9.1-1. Will report what happens.

@Sc0th
Copy link

Sc0th commented Aug 29, 2022

You have seen this 'With the release of Docker 19.03, usage of nvidia-docker2 packages is deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.' here: - https://docs.nvidia.com/ai-enterprise/deployment-guide/dg-docker.html ?

I am using podman, with:

nvidia-container-toolkit-1.10.0-1.x86_64
nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container1-1.10.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

And am seeing the same issue.

@adefaria
Copy link
Author

So are you saying I can just remove nvidia-docker2 and restart the docker container and it'll all work?

@Sc0th
Copy link

Sc0th commented Aug 29, 2022

That would depend on the version of docker you are running.

@adefaria
Copy link
Author

Earth:apt list | grep ^docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker-clean/jammy,jammy 2.0.4-4 all
docker-compose/jammy,jammy 1.29.2-1 all
docker-doc/jammy,jammy 20.10.12-0ubuntu4 all
docker-registry/jammy 2.8.0+ds1-4 amd64
docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
docker2aci/jammy 0.17.2+dfsg-2.1 amd64
docker/jammy,jammy 1.5-2 all
Earth:

@Sc0th
Copy link

Sc0th commented Aug 29, 2022

Not massively familiar with the 'apt' set of commands, but would that not show what is available rather than what is installed? apt list --installed | grep -i docker might work? or perhaps docker --version ?

Edit - duh I just saw the [installed] - I guess that means version 20.x so in theory.....

@adefaria
Copy link
Author

Earth:apt list --installed | grep -i docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
nvidia-docker2/bionic,now 2.9.1-1 all [installed,upgradable to: 2.11.0-1]
Earth:

The [installed] marks these as installed. Note the downgrade of nvidia-docker2 to 2.9.1-1 from 2.11.0-1.

@Sc0th
Copy link

Sc0th commented Aug 29, 2022

Apologies, noticed that a little late, I have downgraded my installs by one release also, see if it has any impact...

nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.9.0-1.x86_64
libnvidia-container1-1.9.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64
nvidia-container-toolkit-1.9.0-1.x86_64
podman version 4.1.1

🤞

@Lebo77
Copy link

Lebo77 commented Aug 29, 2022

OK, I think I have got it working WITHOUT the nvidia-docker2 package at all.

It requires a docker-compose update to 1.29 and a ver. 20 of docker itself to use the built-in NVIDIA functionality.

Docker-compose needs to be modified to let it know about the GPU(s):

 tdarr-node:
        container_name: tdarr-node
        image: haveagitgat/tdarr_node:latest
        restart: unless-stopped
        network_mode: service:tdarr
#        runtime: nvidia # Comment this out. Not needed with the built-in NVIDIA support
        deploy: # ADD this section
          resources:
            reservations:
              devices:
                - capabilities: [gpu]
        environment:
            - TZ=America/New_York
            - PUID=1000
            - PGID=1000
            - UMASK_SET=002
            - nodeID=Odin
            - nodeIP=0.0.0.0
            - nodePort=8267
            - serverIP=0.0.0.0
            - serverPort=8266
            - NVIDIA_VISIBLE_DEVICES=all # Not sure if these are still needed
            - NVIDIA_DRIVER_CAPABILITIES=all
        volumes:
            - <Your volumes here>
        depends_on:
            - tdarr

So far (just an hour and 3-4 transcodes) the GPU is still working in the node.

Thanks for St0th for pointing out that nvidia-docker2 was deprecated. Recommend that examples and documentation be updated to match the new approach and eliminate the need for nvidia-docker2, at least on newer versions of docker.

@adefaria
Copy link
Author

adefaria commented Aug 29, 2022

How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.

Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.

Screenshot at 2022-08-29 07-06-18

Do you experience this too?

@Lebo77
Copy link

Lebo77 commented Aug 29, 2022

How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.

I kicked off a transpose and ran nvidia-smi in another window. The ffmpeg process showed up in the process list. Plus the transpose plug-in I use only works with NVEC.

Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.

Screenshot at 2022-08-29 07-06-18

Do you experience this too?

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

@adefaria
Copy link
Author

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.

Not sure what you mean by tdarr server being named Odin...

@Lebo77
Copy link

Lebo77 commented Aug 29, 2022

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.

Not sure what you mean by tdarr server being named Odin...

The "NodeID" in my chunk of compose file above. The server I run this on is named "Odin" and it says that in the file. You will likely want to change that name.

Also: really consider moving to docker-compose from docker run. There is a learning curve but MAN is it easier to manage once you get it working.

@adefaria
Copy link
Author

adefaria commented Aug 29, 2022

Not looking to move over to docker compose. I have a simple script. I run it once. Tdarr should then run in a container in the background with no more input from me.

There's also podman or something like that and lots of other Docker technologies. I'm not really that interested in having yet another learning curve. I have enough of them already.

On the plus side, Tdarr node seems to be holding up without that pesky nvidia-docker2 module...

@adefaria
Copy link
Author

Alas, this seemed to be working but just failed today. Note that I'm running without nvidia-docker2 installed nor used by the Docker container. I did remove it but I didn't fully remove it.

arth:apt list  | grep nvidia-docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-docker2/bionic,now 2.11.0-1 all [residual-config]
Earth:

Not sure why the Docker container would use nvidia-docker2 if it's not installed but there is residual configuration so I did complete remove of nvidia-docker2 configuration and restarted the Docker container.

IDjtt46V_-log.txt

@Sc0th
Copy link

Sc0th commented Sep 1, 2022

Since dropping libnvidia-container-tools down to version 1.9.0-1 I have not seen this issue reoccur, will report back if that changes.

@adefaria
Copy link
Author

adefaria commented Sep 1, 2022

As I still am seeing NO_CUDA_DEVICE I've downgraded libnvidia-container-tools down to version 1.9.0-1 too. Fingers crossed...

Spoke too soon:

Earth:docker start tdarr_node
Error response from daemon: exec: "nvidia-container-runtime-hook": executable file not found in $PATH
Error: failed to start containers: tdarr_node
Earth:sudo find / -mount -type f -name nvidia-container-runtime-hook
Earth:

Hmmm... Now docker won't start. Seems to do an apt purge nvidia-docker2 also removed nvidia-container-toolkit, which had /bin/nvidia-container-runtime-hook which seems to still be required.

More insights - if I downgrade libnvidia-container-tools to 1.9.0-1 then /bin/nvidia-container-runtime-hook goes away as does nvidia-container-toolkit. If I then install nvidia-container-toolkit then it updates libnvidia-container-tools to 1.10.0-1.

So, I have tdarr_node running now with nvidia-container-tools and libnvidia-container-tools set to 1.10.0-1. We'll see what happens but I think I'm still getting NO_CUDA_DEVICES error after about a day.

@Sc0th, perhaps you could detail to me what versions/setup you have and if that's still working for you.

@adefaria
Copy link
Author

Two new videos were also transcoded without error. I think I may be fixed but I'm not exactly sure how I fixed it.

@adefaria
Copy link
Author

I spoke too soon. The errors are back! I still have this problem.

@Sc0th
Copy link

Sc0th commented Sep 16, 2022

I moved to 1.11.0-1 this morning and the very next transcode failed (first failure in a couple of weeks) dropped back to 1.9.0-1 and all is well again.

@adefaria
Copy link
Author

Backed down to 1.9.0-1 on both libnvidia-container-tools AND nvidia-container-toolkit. I think that's the combo that will work. I'll report back.

@adefaria
Copy link
Author

Doesn't fix it. Same error! :-(

@saqebakhter
Copy link

Is there any working combination of packages/drivers confirmed to work over extended time?

@adefaria
Copy link
Author

The only one claiming to have it working is @Sc0th and he uses docker-compose. I guess I'll have to check out docker-compose...

@Sc0th
Copy link

Sc0th commented Sep 19, 2022

podman rather than docker compose, not that it should make any difference, my node is configured thus:

nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
nvidia-container-toolkit-1.9.0-1.x86_64
libnvidia-container1-1.11.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.11.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

podman version 4.1.1

ghcr.io/haveagitgat/tdarr_node  latest      d556192e63f7  4 months ago  1.58 GB

Mon Sep 19 09:08:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:00:10.0 Off |                  N/A |
|  0%   47C    P0    N/A / 120W |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@Lebo77
Copy link

Lebo77 commented Sep 19, 2022

My tdarr-node docker container has been up for 4 days now with no problems and I did a LOT of transcoding yesterday with no issues.

Docker-compose extract:

tdarr-node:
container_name: tdarr-node
image: haveagitgat/tdarr_node:latest
restart: unless-stopped
network_mode: service:tdarr
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
environment:
- TZ=America/New_York
- PUID=1000
- PGID=1000
- UMASK_SET=002
- nodeID=node1
- nodeIP=0.0.0.0
- nodePort=8267
- serverIP=0.0.0.0
- serverPort=8266
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
volumes:
...

Not sure WHY it started working, but I did update the docker container packages last week as part of my usual apt-get upgrade.

@Lebo77
Copy link

Lebo77 commented Sep 29, 2022

My tdarr-node docker container has been up for 4 days now with no problems and I did a LOT of transcoding yesterday with no issues.

Docker-compose extract:

tdarr-node: container_name: tdarr-node image: haveagitgat/tdarr_node:latest restart: unless-stopped network_mode: service:tdarr deploy: resources: reservations: devices: - capabilities: [gpu] environment: - TZ=America/New_York - PUID=1000 - PGID=1000 - UMASK_SET=002 - nodeID=node1 - nodeIP=0.0.0.0 - nodePort=8267 - serverIP=0.0.0.0 - serverPort=8266 - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=all volumes: ...

Not sure WHY it started working, but I did update the docker container packages last week as part of my usual apt-get upgrade.

Ignore this. Not working again.

@tedcook94
Copy link

I was having a very similar problem for a long time where my Tdarr transcodes would start failing after a while, and restarting the Docker container would fix it temporarily. Whenever my transcodes would start failing, I would run nvidia-smi inside of the Tdarr container and get the error Failed to initialize NVML: Unknown Error.

After finally getting fed up, I did some research and eventually came across a solution that seems to have fixed it for me (2 weeks with no failed transcodes). This issue in the nvidia-docker repo suggested setting the systemd.unified_cgroup_hierarchy kernel parameter to either 0 or 1 to force the use of cgroup v1 or v2, respectively. I tried using cgroupv2 at first, but the behavior didn't change, so then I set systemd.unified_cgroup_hierarchy=0 and haven't had any issues since rebooting.

I realize my error wasn't the exact same as the one being discussed here, but it might be worth trying the same solution. Hopefully it helps!

@adefaria
Copy link
Author

adefaria commented Oct 6, 2022

How exactly do you set systemd.unified_cgroup_heirachy=0? I've researched a bit and I think I'm merely supposed to set

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

Run update-grub then reboot. But I want to be sure since this is modifying the boot and I don't want to get locked out.

BTW what I at https://askubuntu.com/questions/1389602/how-to-set-cgroup-memory-enable-on-ubuntu-21-10 found said:

A note to anyone enabling cgroups for AWS Greengrass: Edit GRUB_CMDLINE_LINUX to be:

GRUB_CMDLINE_LINUX="cgroup_enable=memory cgroup_memory=1 systemd.unified_cgroup_hierarchy=0"

And I'm also not sure if I should include the cgroup_enable=memory cgroup_memory=1 portion.

@tedcook94
Copy link

Assuming you're using Ubuntu, here is the relevant documentation on setting a kernel parameter. It discusses how to set a temporary parameter, which you could do first as a test if you're concerned about problems booting. If that goes well, you can then follow the instructions for permanently setting it.

As for those other parameters you mentioned, I didn't set those. It seems like that other post is talking about configuring a server for AWS's Greengrass service, which doesn't seem relevant to Tdarr in any way.

@Lebo77
Copy link

Lebo77 commented Oct 6, 2022

Well, I just added the recommended parameter to grub and rebooted. We will see if that fixes it.

@adefaria
Copy link
Author

adefaria commented Oct 6, 2022

Hmmm... Still having found how to get to the GRUB menu on my System76 Thelio but I figured I'd simply try to update-grub with the change put into /etc/defaults/grub. I rebooted and went through a few transcodes that queued up without a problem. That was yesterday. Today a new download came in and it transcoded with success! First time I had my Tdarr node survive and overnight! Look like this may be the fix.

Of course, I wonder if Nvidia will fix this and use the newer cgroup hierarchy and how I might be notified so I can remove this setting and start using the newer setting.

@tedcook94
Copy link

I'm not sure if this is a Nvidia or a Tdarr issue. I have two other Docker containers (plex and foldingathome) that both use the same GPU as my Tdarr container, and they never had any issues before or after I added the kernel parameter. Only Tdarr would "lose" the GPU from time to time.

I'm no expert on cgroups, but it looks like there were some deprecated features when going from v1 to v2. Maybe Tdarr is either directly or indirectly using some deprecated v1 feature?

@adefaria
Copy link
Author

adefaria commented Oct 6, 2022

Inntereing observation! I didn't think about it but yes, my Plex also runs in a Docker container and uses the CPUs for hardware transcoding. This sounds like good info for the developer.

@Sc0th
Copy link

Sc0th commented Oct 7, 2022

FWIW this completely broke it for me, so it may not be the full fix.

@adefaria
Copy link
Author

adefaria commented Oct 7, 2022

Well for me it's working. What OS/distro are you on? If you didn't type in the right stuff, didn't run update-grub or didn't reboot the Linux kernel will not warn you that you got it wrong.

I'm on Ubuntu 22.04 and I have the following in /etc/defautls/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"

I've run update-grub and rebooted. My Tdarr node has stayed up for about 2 days, much longer than ever before without complaining about this error, and has been transcoding videos. In fact, it just transcoded 4 videos.

@Lebo77
Copy link

Lebo77 commented Oct 12, 2022

The grub parameter change seems to be working for me. I am on Ubuntu 22.04.1 LTS.

This is an OK workaround, but hopefully this can get fixed so that users are not required to change how the kernel is operating just to make this work.

@adefaria
Copy link
Author

Agreed. Not sure if I should close this as fixed with this workaround or leave it open to be fixed more properly.

@tedcook94
Copy link

+1 for this being a workaround, but not an actual fix. However, since we have at least 3 people here who this seemed to resolve their issue, I'm sure there are others that have had similar issues and didn't find this solution.

@HaveAGitGat, could we possibly document this workaround somewhere in the README/on docs.tdarr.io until a better or more permanent solution is found?

@tomgehrke
Copy link

Can also confirm the workaround worked for me.

So another vote for the "not fixed, but here's a workaround" option.

@yajrendrag
Copy link

yajrendrag commented Dec 31, 2022

I'd been having the same issues - tdarr would work fine for about a day, and then I'd have to restart the containers in order for transcodes to ocurr. When i finally chased it down, I was getting the same error as OP and found a solution on the Arch wiki - though i am running Ubuntu 22.04 and docker 20.10.22. Seems to be an nvidia / docker issue. see Howard O-Neil's reply here. About half way through his answer he gives an alternative method where he details two ways to make this work that don't require adding kernel parameters:

(1) set the container to privileged. After doing this, i never got the "CUDA_ERROR_NO_DEVICE" error and transcoding continues to work just fine without requiring container restarts.
(2) Alternatively, if you don't wish to run the container with elevated privileges, you can change the way you do gpu passthrough to use nvidia-container-runtime. nvidia-container-toolkit on ubuntu now contains nvidia-container-runtime, so nothing to install and /etc/docker/daemon.json exists and doesn't require changes. You just need to change the docker run: change --gpus=all to --runtime nvidia and then explicitly add all the /dev/nvidia devices:

   --device=/dev/nvidia-caps:/dev/nvidia-caps \
   --device=/dev/nvidia0:/dev/nvidia0 \
   --device=/dev/nvidiactl:/dev/nvidiactl \
   --device=/dev/nvidia-modeset:/dev/nvidia-modeset \
   --device=/dev/nvidia-uvm:/dev/nvidia-uvm \
   --device=/dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools \

Similarly, you can change the compose file to effect these same changes / additions.

Running the containers this way has also resulted in no "CUDA_ERROR_NO_DEVICE" errors and transcoding works without requiring daily-ish restarts.

EDIT - in this 2nd alternative, i found i had to create an override to my docker.service systemd unit which basically waits for the existence of the above nvidia devices to exists.

systemctl edit docker add [Unit] and then add a line for each device:
ConditionPathExists=/dev/nvidia0 etc for each of the other devices, but whereas /dev/nvidia-caps is a directory, i just listed the whole path to it's 2 files - e.g., ConditionPathExists=/dev/nvidia-caps/nvidia-cap1
OR you could add [Service] and add an ExecStartPre=/usr/bin/sleep 30 to delay start of docker giving it long enough for the device files to exist - 30 seconds is probably overkill...

EDIT 2 - 2nd method did fail after a few days, so the only way i could make it work for container to stay up was to make it privileged.

@HaveAGitGat
Copy link
Owner

@HaveAGitGat
Copy link
Owner

Issue is open on Jellyfin here so will let solution be handled there ty: jellyfin/jellyfin#9287

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants