CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #666

adefaria · 2022-07-25T00:28:33Z

Now that I finally converted all of my videos using Tdarr I like how I can leave it running and when new videos are downloaded it compresses them. And this works... But then it dies with "CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected". Now I've configured my Docker container to run Tdarr using Nvidia and I have properly configured it so that that works. But then it breaks with that error. The solution is simply to restart the Docker container and then re-queue the videos but why does it break in the first place?

QjALSDImR-log.txt

adefaria · 2022-08-02T22:11:24Z

Another data point. When transcode fails in the docker container I can replicate the ffmpeg command on the desktop (IOW outside of the docker container) and it works fine.

Why does it break in the docker container?

HaveAGitGat · 2022-08-08T23:44:43Z

Hmm seems like a bug between FFmpeg/Docker/Hardware, perhaps an FFmpeg update will fix the issue. The dev container has an FFmpeg update, looking to get it out soon.

adefaria · 2022-08-09T03:17:49Z

I can try out the new version when it becomes available. I was suspecting it may be the Nvidia for Cocker containers (I have a foggy memory of installing that to get GPU transcoding to work in a Docker container) and something there broke down causing the CUDA_ERROR_NO_DEVICE. Lately, I'm seeing it just about transcodes a video or two and suddenly there are no devices left to GPU transcode. Restarting the Docker container always fixes the problem but it's essentially babysitting Tdarr, which shouldn't have to be the case.

Sc0th · 2022-08-18T06:47:10Z

FWIW I am seeing this exact behavior, happy to supply logs/test if it helps at all, a restart of the docker fixes it every time.

adefaria · 2022-08-18T15:24:02Z

I had cron'ed a restart of the Docker container at midnight in an attempt to mitigate this issue but it remains.

It might be nice if Tdarr allowed access to their database so I could interrogate the state and restart the Docker container if need be but then again if the bug is fixed it would be unnecessary.

Out of curiosity, @Sc0th, what's your environment? Where are you running the Docker container? What OS/machine? And where's your server?

Sc0th · 2022-08-19T00:05:07Z

Apologies, I could have made that post slightly more useful!

I also tried the automated reboot, also to no avail. I am running the docker using podman on a VM running on Proxmox with PCI pass-through.

Some (maybe) useful detail:

GPU - Nvidia 1050 Ti
Proxmox 7.2-7

Linux infra01 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Tue Aug 2 13:42:59 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
AlmaLinux 8.6 (Sky Tiger)

nvidia-container-toolkit-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

podman run --name tdarr-node -v /appdata01/INFRA/tdarr-node/configs:/app/configs -v /appdata01/INFRA/tdarr-node/logs:/app/logs -v /MEDIA:/media -v /appdata01/PROD/tdarr/transcode:/transcode -e "nodeID=infra01-tdar
r-node"  -e "serverIP=x.x.x.x" -e "serverPort=8266" --net=host -e PUID=1000 -e PGID=1000 -e "NVIDIA_DRIVER_CAPABILITIES=all" -e "NVIDIA_VISIBLE_DEVICES=all" --gpus=all -d ghcr.io/haveagitgat/tdarr_node'

Tdarr Server & Node 2.00.18

I am using the following line monitored by Zabbix to alert me when it gets stuck:

cat /appdata01/PROD/tdarr/conf/server/Tdarr/DB2/StatisticsJSONDB/*.json | jq . | grep -ic err

A result higher than 0 indicates it's got stuck, I did look at using this to 'self heal' by restating the docker on trigger, however the jobs do not appear to requeue automatically so that did not quite go to plan.

I wait with greatful anticipation of the next release with hope of a fix!

adefaria · 2022-08-19T03:56:53Z

Yeah, I think that it will end up being some interaction between Tdarr and the Nvidia runtime for the Docker container that causes Nvidia to lose track of available devices thus reporting NO_CUDA_DEVICE.

I'm running on my Thelio desktop running Ubuntu 22.04. I have the following:

Earth:apt list| grep nvidia-container

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-container-dev/bionic 1.10.0-1 amd64
libnvidia-container-tools/bionic,now 1.10.0-1 amd64 [installed,automatic]
libnvidia-container1-dbg/bionic 1.10.0-1 amd64
libnvidia-container1/bionic,now 1.10.0-1 amd64 [installed,automatic]
nvidia-container-runtime/bionic 3.10.0-1 all
nvidia-container-toolkit/bionic,now 1.10.0-1 amd64 [installed]
Earth:

adefaria · 2022-08-23T16:44:34Z

Still have this issue however today I think I captured a log of Tdarr working on a video when the CUDA_ERROR_NO_DEVICE happened right in the middle. Maybe this log will help in debugging this bug
.
to7uvYRwN-log.txt
.

Lebo77 · 2022-08-27T20:47:57Z

I am trying downgrading nvidia-docker2 back to 2.10.0-1 from 2.11.0-1 and seeing if that makes any difference as a workaround. If not I may try 2.9.1-1.

I am having this same issue on Ubuntu 22.04 LTS.

Lebo77 · 2022-08-28T23:44:54Z

Downgrading to the nvidia-docker2 version 2.9.1-1 seems to be a workaround to this issue, at least in the day of testing. If it STOPS working I will let you know.

Annoying to have my updates reporting a heald-back package, but better than Tdarr-node breaking every hour or two.

HaveAGitGat · 2022-08-28T23:47:55Z

Ok thanks for the update 👍

adefaria · 2022-08-29T00:36:05Z

Thanks. Downgraded to 2.9.1-1. Will report what happens.

Sc0th · 2022-08-29T00:48:25Z

You have seen this 'With the release of Docker 19.03, usage of nvidia-docker2 packages is deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.' here: - https://docs.nvidia.com/ai-enterprise/deployment-guide/dg-docker.html ?

I am using podman, with:

nvidia-container-toolkit-1.10.0-1.x86_64
nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container1-1.10.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.10.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

And am seeing the same issue.

adefaria · 2022-08-29T00:49:41Z

So are you saying I can just remove nvidia-docker2 and restart the docker container and it'll all work?

Sc0th · 2022-08-29T00:55:03Z

That would depend on the version of docker you are running.

adefaria · 2022-08-29T00:56:08Z

Earth:apt list | grep ^docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker-clean/jammy,jammy 2.0.4-4 all
docker-compose/jammy,jammy 1.29.2-1 all
docker-doc/jammy,jammy 20.10.12-0ubuntu4 all
docker-registry/jammy 2.8.0+ds1-4 amd64
docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
docker2aci/jammy 0.17.2+dfsg-2.1 amd64
docker/jammy,jammy 1.5-2 all
Earth:

Sc0th · 2022-08-29T01:13:24Z

Not massively familiar with the 'apt' set of commands, but would that not show what is available rather than what is installed? apt list --installed | grep -i docker might work? or perhaps docker --version ?

Edit - duh I just saw the [installed] - I guess that means version 20.x so in theory.....

adefaria · 2022-08-29T01:15:42Z

Earth:apt list --installed | grep -i docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker.io/jammy,now 20.10.12-0ubuntu4 amd64 [installed]
nvidia-docker2/bionic,now 2.9.1-1 all [installed,upgradable to: 2.11.0-1]
Earth:

The [installed] marks these as installed. Note the downgrade of nvidia-docker2 to 2.9.1-1 from 2.11.0-1.

Sc0th · 2022-08-29T01:17:25Z

Apologies, noticed that a little late, I have downgraded my installs by one release also, see if it has any impact...

nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.9.0-1.x86_64
libnvidia-container1-1.9.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64
nvidia-container-toolkit-1.9.0-1.x86_64

podman version 4.1.1

🤞

Lebo77 · 2022-08-29T03:29:13Z

OK, I think I have got it working WITHOUT the nvidia-docker2 package at all.

It requires a docker-compose update to 1.29 and a ver. 20 of docker itself to use the built-in NVIDIA functionality.

Docker-compose needs to be modified to let it know about the GPU(s):

 tdarr-node:
        container_name: tdarr-node
        image: haveagitgat/tdarr_node:latest
        restart: unless-stopped
        network_mode: service:tdarr
#        runtime: nvidia # Comment this out. Not needed with the built-in NVIDIA support
        deploy: # ADD this section
          resources:
            reservations:
              devices:
                - capabilities: [gpu]
        environment:
            - TZ=America/New_York
            - PUID=1000
            - PGID=1000
            - UMASK_SET=002
            - nodeID=Odin
            - nodeIP=0.0.0.0
            - nodePort=8267
            - serverIP=0.0.0.0
            - serverPort=8266
            - NVIDIA_VISIBLE_DEVICES=all # Not sure if these are still needed
            - NVIDIA_DRIVER_CAPABILITIES=all
        volumes:
            - <Your volumes here>
        depends_on:
            - tdarr

So far (just an hour and 3-4 transcodes) the GPU is still working in the node.

Thanks for St0th for pointing out that nvidia-docker2 was deprecated. Recommend that examples and documentation be updated to match the new approach and eliminate the need for nvidia-docker2, at least on newer versions of docker.

adefaria · 2022-08-29T14:10:07Z

How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.

Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.

Do you experience this too?

Lebo77 · 2022-08-29T17:25:45Z

How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage.

I kicked off a transpose and ran nvidia-smi in another window. The ffmpeg process showed up in the process list. Plus the transpose plug-in I use only works with NVEC.

Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface.

Do you experience this too?

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

adefaria · 2022-08-29T17:32:20Z

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.

Not sure what you mean by tdarr server being named Odin...

Lebo77 · 2022-08-29T20:42:46Z

No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either.

Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen.

Not sure what you mean by tdarr server being named Odin...

The "NodeID" in my chunk of compose file above. The server I run this on is named "Odin" and it says that in the file. You will likely want to change that name.

Also: really consider moving to docker-compose from docker run. There is a learning curve but MAN is it easier to manage once you get it working.

adefaria · 2022-08-29T20:45:47Z

Not looking to move over to docker compose. I have a simple script. I run it once. Tdarr should then run in a container in the background with no more input from me.

There's also podman or something like that and lots of other Docker technologies. I'm not really that interested in having yet another learning curve. I have enough of them already.

On the plus side, Tdarr node seems to be holding up without that pesky nvidia-docker2 module...

adefaria · 2022-08-31T14:20:20Z

Alas, this seemed to be working but just failed today. Note that I'm running without nvidia-docker2 installed nor used by the Docker container. I did remove it but I didn't fully remove it.

arth:apt list  | grep nvidia-docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-docker2/bionic,now 2.11.0-1 all [residual-config]
Earth:

Not sure why the Docker container would use nvidia-docker2 if it's not installed but there is residual configuration so I did complete remove of nvidia-docker2 configuration and restarted the Docker container.

IDjtt46V_-log.txt

Sc0th · 2022-09-01T02:56:53Z

Since dropping libnvidia-container-tools down to version 1.9.0-1 I have not seen this issue reoccur, will report back if that changes.

adefaria · 2022-09-01T17:00:05Z

As I still am seeing NO_CUDA_DEVICE I've downgraded libnvidia-container-tools down to version 1.9.0-1 too. Fingers crossed...

Spoke too soon:

Earth:docker start tdarr_node
Error response from daemon: exec: "nvidia-container-runtime-hook": executable file not found in $PATH
Error: failed to start containers: tdarr_node
Earth:sudo find / -mount -type f -name nvidia-container-runtime-hook
Earth:

Hmmm... Now docker won't start. Seems to do an apt purge nvidia-docker2 also removed nvidia-container-toolkit, which had /bin/nvidia-container-runtime-hook which seems to still be required.

More insights - if I downgrade libnvidia-container-tools to 1.9.0-1 then /bin/nvidia-container-runtime-hook goes away as does nvidia-container-toolkit. If I then install nvidia-container-toolkit then it updates libnvidia-container-tools to 1.10.0-1.

So, I have tdarr_node running now with nvidia-container-tools and libnvidia-container-tools set to 1.10.0-1. We'll see what happens but I think I'm still getting NO_CUDA_DEVICES error after about a day.

@Sc0th, perhaps you could detail to me what versions/setup you have and if that's still working for you.

adefaria · 2022-09-15T18:00:23Z

Two new videos were also transcoded without error. I think I may be fixed but I'm not exactly sure how I fixed it.

adefaria · 2022-09-16T04:16:40Z

I spoke too soon. The errors are back! I still have this problem.

Sc0th · 2022-09-16T04:28:52Z

I moved to 1.11.0-1 this morning and the very next transcode failed (first failure in a couple of weeks) dropped back to 1.9.0-1 and all is well again.

adefaria · 2022-09-16T18:13:55Z

Backed down to 1.9.0-1 on both libnvidia-container-tools AND nvidia-container-toolkit. I think that's the combo that will work. I'll report back.

adefaria · 2022-09-17T03:57:08Z

Doesn't fix it. Same error! :-(

saqebakhter · 2022-09-18T19:06:30Z

Is there any working combination of packages/drivers confirmed to work over extended time?

adefaria · 2022-09-19T00:51:16Z

The only one claiming to have it working is @Sc0th and he uses docker-compose. I guess I'll have to check out docker-compose...

Sc0th · 2022-09-19T01:09:06Z

podman rather than docker compose, not that it should make any difference, my node is configured thus:

nvidia-x11-drv-515.65.01-1.el8_6.elrepo.x86_64
nvidia-container-toolkit-1.9.0-1.x86_64
libnvidia-container1-1.11.0-1.x86_64
nvidia-x11-drv-libs-515.65.01-1.el8_6.elrepo.x86_64
libnvidia-container-tools-1.11.0-1.x86_64
kmod-nvidia-515.65.01-1.el8_6.elrepo.x86_64

podman version 4.1.1

ghcr.io/haveagitgat/tdarr_node  latest      d556192e63f7  4 months ago  1.58 GB

Mon Sep 19 09:08:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:00:10.0 Off |                  N/A |
|  0%   47C    P0    N/A / 120W |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Lebo77 · 2022-09-19T03:13:22Z

My tdarr-node docker container has been up for 4 days now with no problems and I did a LOT of transcoding yesterday with no issues.

Docker-compose extract:

tdarr-node:
container_name: tdarr-node
image: haveagitgat/tdarr_node:latest
restart: unless-stopped
network_mode: service:tdarr
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
environment:
- TZ=America/New_York
- PUID=1000
- PGID=1000
- UMASK_SET=002
- nodeID=node1
- nodeIP=0.0.0.0
- nodePort=8267
- serverIP=0.0.0.0
- serverPort=8266
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
volumes:
...

Not sure WHY it started working, but I did update the docker container packages last week as part of my usual apt-get upgrade.

Lebo77 · 2022-09-29T15:51:41Z

My tdarr-node docker container has been up for 4 days now with no problems and I did a LOT of transcoding yesterday with no issues.

Docker-compose extract:

tdarr-node: container_name: tdarr-node image: haveagitgat/tdarr_node:latest restart: unless-stopped network_mode: service:tdarr deploy: resources: reservations: devices: - capabilities: [gpu] environment: - TZ=America/New_York - PUID=1000 - PGID=1000 - UMASK_SET=002 - nodeID=node1 - nodeIP=0.0.0.0 - nodePort=8267 - serverIP=0.0.0.0 - serverPort=8266 - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=all volumes: ...

Not sure WHY it started working, but I did update the docker container packages last week as part of my usual apt-get upgrade.

Ignore this. Not working again.

tedcook94 · 2022-10-06T01:46:19Z

I was having a very similar problem for a long time where my Tdarr transcodes would start failing after a while, and restarting the Docker container would fix it temporarily. Whenever my transcodes would start failing, I would run nvidia-smi inside of the Tdarr container and get the error Failed to initialize NVML: Unknown Error.

After finally getting fed up, I did some research and eventually came across a solution that seems to have fixed it for me (2 weeks with no failed transcodes). This issue in the nvidia-docker repo suggested setting the systemd.unified_cgroup_hierarchy kernel parameter to either 0 or 1 to force the use of cgroup v1 or v2, respectively. I tried using cgroupv2 at first, but the behavior didn't change, so then I set systemd.unified_cgroup_hierarchy=0 and haven't had any issues since rebooting.

I realize my error wasn't the exact same as the one being discussed here, but it might be worth trying the same solution. Hopefully it helps!

adefaria · 2022-10-06T03:54:38Z

How exactly do you set systemd.unified_cgroup_heirachy=0? I've researched a bit and I think I'm merely supposed to set

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

Run update-grub then reboot. But I want to be sure since this is modifying the boot and I don't want to get locked out.

BTW what I at https://askubuntu.com/questions/1389602/how-to-set-cgroup-memory-enable-on-ubuntu-21-10 found said:

A note to anyone enabling cgroups for AWS Greengrass: Edit GRUB_CMDLINE_LINUX to be:
GRUB_CMDLINE_LINUX="cgroup_enable=memory cgroup_memory=1 systemd.unified_cgroup_hierarchy=0"

And I'm also not sure if I should include the cgroup_enable=memory cgroup_memory=1 portion.

tedcook94 · 2022-10-06T04:06:12Z

Assuming you're using Ubuntu, here is the relevant documentation on setting a kernel parameter. It discusses how to set a temporary parameter, which you could do first as a test if you're concerned about problems booting. If that goes well, you can then follow the instructions for permanently setting it.

As for those other parameters you mentioned, I didn't set those. It seems like that other post is talking about configuring a server for AWS's Greengrass service, which doesn't seem relevant to Tdarr in any way.

Lebo77 · 2022-10-06T14:06:20Z

Well, I just added the recommended parameter to grub and rebooted. We will see if that fixes it.

adefaria · 2022-10-06T14:09:43Z

Hmmm... Still having found how to get to the GRUB menu on my System76 Thelio but I figured I'd simply try to update-grub with the change put into /etc/defaults/grub. I rebooted and went through a few transcodes that queued up without a problem. That was yesterday. Today a new download came in and it transcoded with success! First time I had my Tdarr node survive and overnight! Look like this may be the fix.

Of course, I wonder if Nvidia will fix this and use the newer cgroup hierarchy and how I might be notified so I can remove this setting and start using the newer setting.

tedcook94 · 2022-10-06T14:50:31Z

I'm not sure if this is a Nvidia or a Tdarr issue. I have two other Docker containers (plex and foldingathome) that both use the same GPU as my Tdarr container, and they never had any issues before or after I added the kernel parameter. Only Tdarr would "lose" the GPU from time to time.

I'm no expert on cgroups, but it looks like there were some deprecated features when going from v1 to v2. Maybe Tdarr is either directly or indirectly using some deprecated v1 feature?

adefaria · 2022-10-06T14:53:43Z

Inntereing observation! I didn't think about it but yes, my Plex also runs in a Docker container and uses the CPUs for hardware transcoding. This sounds like good info for the developer.

Sc0th · 2022-10-07T01:28:14Z

FWIW this completely broke it for me, so it may not be the full fix.

adefaria · 2022-10-07T03:54:51Z

Well for me it's working. What OS/distro are you on? If you didn't type in the right stuff, didn't run update-grub or didn't reboot the Linux kernel will not warn you that you got it wrong.

I'm on Ubuntu 22.04 and I have the following in /etc/defautls/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"

I've run update-grub and rebooted. My Tdarr node has stayed up for about 2 days, much longer than ever before without complaining about this error, and has been transcoding videos. In fact, it just transcoded 4 videos.

Lebo77 · 2022-10-12T16:06:05Z

The grub parameter change seems to be working for me. I am on Ubuntu 22.04.1 LTS.

This is an OK workaround, but hopefully this can get fixed so that users are not required to change how the kernel is operating just to make this work.

adefaria · 2022-10-12T16:08:17Z

Agreed. Not sure if I should close this as fixed with this workaround or leave it open to be fixed more properly.

tedcook94 · 2022-10-12T16:14:41Z

+1 for this being a workaround, but not an actual fix. However, since we have at least 3 people here who this seemed to resolve their issue, I'm sure there are others that have had similar issues and didn't find this solution.

@HaveAGitGat, could we possibly document this workaround somewhere in the README/on docs.tdarr.io until a better or more permanent solution is found?

tomgehrke · 2022-10-12T17:41:42Z

Can also confirm the workaround worked for me.

So another vote for the "not fixed, but here's a workaround" option.

yajrendrag · 2022-12-31T06:15:06Z

I'd been having the same issues - tdarr would work fine for about a day, and then I'd have to restart the containers in order for transcodes to ocurr. When i finally chased it down, I was getting the same error as OP and found a solution on the Arch wiki - though i am running Ubuntu 22.04 and docker 20.10.22. Seems to be an nvidia / docker issue. see Howard O-Neil's reply here. About half way through his answer he gives an alternative method where he details two ways to make this work that don't require adding kernel parameters:

(1) set the container to privileged. After doing this, i never got the "CUDA_ERROR_NO_DEVICE" error and transcoding continues to work just fine without requiring container restarts.
(2) Alternatively, if you don't wish to run the container with elevated privileges, you can change the way you do gpu passthrough to use nvidia-container-runtime. nvidia-container-toolkit on ubuntu now contains nvidia-container-runtime, so nothing to install and /etc/docker/daemon.json exists and doesn't require changes. You just need to change the docker run: change --gpus=all to --runtime nvidia and then explicitly add all the /dev/nvidia devices:

   --device=/dev/nvidia-caps:/dev/nvidia-caps \
   --device=/dev/nvidia0:/dev/nvidia0 \
   --device=/dev/nvidiactl:/dev/nvidiactl \
   --device=/dev/nvidia-modeset:/dev/nvidia-modeset \
   --device=/dev/nvidia-uvm:/dev/nvidia-uvm \
   --device=/dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools \

Similarly, you can change the compose file to effect these same changes / additions.

Running the containers this way has also resulted in no "CUDA_ERROR_NO_DEVICE" errors and transcoding works without requiring daily-ish restarts.

EDIT - in this 2nd alternative, i found i had to create an override to my docker.service systemd unit which basically waits for the existence of the above nvidia devices to exists.

systemctl edit docker add [Unit] and then add a line for each device:
ConditionPathExists=/dev/nvidia0 etc for each of the other devices, but whereas /dev/nvidia-caps is a directory, i just listed the whole path to it's 2 files - e.g., ConditionPathExists=/dev/nvidia-caps/nvidia-cap1
OR you could add [Service] and add an ExecStartPre=/usr/bin/sleep 30 to delay start of docker giving it long enough for the device files to exist - 30 seconds is probably overkill...

EDIT 2 - 2nd method did fail after a few days, so the only way i could make it work for container to stay up was to make it privileged.

HaveAGitGat · 2023-02-07T05:52:45Z

@tedcook94 added to the troubleshooting section:
https://docs.tdarr.io/docs/troubleshooting/#cuda_error_no_device-no-cuda-capable-device

HaveAGitGat · 2023-03-17T08:23:17Z

Issue is open on Jellyfin here so will let solution be handled there ty: jellyfin/jellyfin#9287

HaveAGitGat added the bug Something isn't working label Aug 8, 2022

adefaria mentioned this issue Aug 9, 2022

Transcode ffmpeg command fails but slight mod can make it work #645

Closed

nyanmisaka mentioned this issue Mar 15, 2023

[Issue]: NVidia Docker transcoding randomly stops working after 5 minutes to 4 hours later. jellyfin/jellyfin#9287

Closed

1 task

HaveAGitGat closed this as completed Mar 17, 2023

HaveAGitGat mentioned this issue Mar 30, 2023

Transcode fails in Docker container under Tdarr but not when run by hand #634

Closed

k0stett mentioned this issue May 16, 2023

hevc_nvenc - out of memory [SOLVED] roflcoopter/viseron#539

Open

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #666

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #666

Comments

adefaria commented Jul 25, 2022 • edited Loading

adefaria commented Aug 2, 2022

HaveAGitGat commented Aug 8, 2022

adefaria commented Aug 9, 2022

Sc0th commented Aug 18, 2022

adefaria commented Aug 18, 2022

Sc0th commented Aug 19, 2022 • edited Loading

adefaria commented Aug 19, 2022

adefaria commented Aug 23, 2022

Lebo77 commented Aug 27, 2022

Lebo77 commented Aug 28, 2022

HaveAGitGat commented Aug 28, 2022

adefaria commented Aug 29, 2022

Sc0th commented Aug 29, 2022 • edited Loading

adefaria commented Aug 29, 2022

Sc0th commented Aug 29, 2022

adefaria commented Aug 29, 2022

Sc0th commented Aug 29, 2022 • edited Loading

adefaria commented Aug 29, 2022

Sc0th commented Aug 29, 2022

Lebo77 commented Aug 29, 2022

adefaria commented Aug 29, 2022 • edited Loading

Lebo77 commented Aug 29, 2022

adefaria commented Aug 29, 2022

Lebo77 commented Aug 29, 2022

adefaria commented Aug 29, 2022 • edited Loading

adefaria commented Aug 31, 2022

Sc0th commented Sep 1, 2022

adefaria commented Sep 1, 2022 • edited Loading

adefaria commented Sep 15, 2022

adefaria commented Sep 16, 2022

Sc0th commented Sep 16, 2022 • edited Loading

adefaria commented Sep 16, 2022

adefaria commented Sep 17, 2022

saqebakhter commented Sep 18, 2022

adefaria commented Sep 19, 2022

Sc0th commented Sep 19, 2022

Lebo77 commented Sep 19, 2022

Lebo77 commented Sep 29, 2022

tedcook94 commented Oct 6, 2022

adefaria commented Oct 6, 2022

tedcook94 commented Oct 6, 2022

Lebo77 commented Oct 6, 2022

adefaria commented Oct 6, 2022

tedcook94 commented Oct 6, 2022

adefaria commented Oct 6, 2022

Sc0th commented Oct 7, 2022

adefaria commented Oct 7, 2022

Lebo77 commented Oct 12, 2022

adefaria commented Oct 12, 2022

tedcook94 commented Oct 12, 2022

tomgehrke commented Oct 12, 2022

yajrendrag commented Dec 31, 2022 • edited Loading

HaveAGitGat commented Feb 7, 2023

HaveAGitGat commented Mar 17, 2023

adefaria commented Jul 25, 2022 •

edited

Loading

Sc0th commented Aug 19, 2022 •

edited

Loading

Sc0th commented Aug 29, 2022 •

edited

Loading

Sc0th commented Aug 29, 2022 •

edited

Loading

adefaria commented Aug 29, 2022 •

edited

Loading

adefaria commented Aug 29, 2022 •

edited

Loading

adefaria commented Sep 1, 2022 •

edited

Loading

Sc0th commented Sep 16, 2022 •

edited

Loading

yajrendrag commented Dec 31, 2022 •

edited

Loading