-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected #666
Comments
Another data point. When transcode fails in the docker container I can replicate the ffmpeg command on the desktop (IOW outside of the docker container) and it works fine. Why does it break in the docker container? |
Hmm seems like a bug between FFmpeg/Docker/Hardware, perhaps an FFmpeg update will fix the issue. The dev container has an FFmpeg update, looking to get it out soon. |
I can try out the new version when it becomes available. I was suspecting it may be the Nvidia for Cocker containers (I have a foggy memory of installing that to get GPU transcoding to work in a Docker container) and something there broke down causing the CUDA_ERROR_NO_DEVICE. Lately, I'm seeing it just about transcodes a video or two and suddenly there are no devices left to GPU transcode. Restarting the Docker container always fixes the problem but it's essentially babysitting Tdarr, which shouldn't have to be the case. |
FWIW I am seeing this exact behavior, happy to supply logs/test if it helps at all, a restart of the docker fixes it every time. |
I had cron'ed a restart of the Docker container at midnight in an attempt to mitigate this issue but it remains. It might be nice if Tdarr allowed access to their database so I could interrogate the state and restart the Docker container if need be but then again if the bug is fixed it would be unnecessary. Out of curiosity, @Sc0th, what's your environment? Where are you running the Docker container? What OS/machine? And where's your server? |
Apologies, I could have made that post slightly more useful! I also tried the automated reboot, also to no avail. I am running the docker using podman on a VM running on Proxmox with PCI pass-through. Some (maybe) useful detail:
I am using the following line monitored by Zabbix to alert me when it gets stuck:
A result higher than 0 indicates it's got stuck, I did look at using this to 'self heal' by restating the docker on trigger, however the jobs do not appear to requeue automatically so that did not quite go to plan. I wait with greatful anticipation of the next release with hope of a fix! |
Yeah, I think that it will end up being some interaction between Tdarr and the Nvidia runtime for the Docker container that causes Nvidia to lose track of available devices thus reporting NO_CUDA_DEVICE. I'm running on my Thelio desktop running Ubuntu 22.04. I have the following:
|
Still have this issue however today I think I captured a log of Tdarr working on a video when the CUDA_ERROR_NO_DEVICE happened right in the middle. Maybe this log will help in debugging this bug |
I am trying downgrading nvidia-docker2 back to 2.10.0-1 from 2.11.0-1 and seeing if that makes any difference as a workaround. If not I may try 2.9.1-1. I am having this same issue on Ubuntu 22.04 LTS. |
Downgrading to the nvidia-docker2 version 2.9.1-1 seems to be a workaround to this issue, at least in the day of testing. If it STOPS working I will let you know. Annoying to have my updates reporting a heald-back package, but better than Tdarr-node breaking every hour or two. |
Ok thanks for the update 👍 |
Thanks. Downgraded to 2.9.1-1. Will report what happens. |
You have seen this 'With the release of Docker 19.03, usage of nvidia-docker2 packages is deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.' here: - https://docs.nvidia.com/ai-enterprise/deployment-guide/dg-docker.html ? I am using podman, with:
And am seeing the same issue. |
So are you saying I can just remove nvidia-docker2 and restart the docker container and it'll all work? |
That would depend on the version of docker you are running. |
|
Not massively familiar with the 'apt' set of commands, but would that not show what is available rather than what is installed? Edit - duh I just saw the [installed] - I guess that means version 20.x so in theory..... |
The [installed] marks these as installed. Note the downgrade of nvidia-docker2 to 2.9.1-1 from 2.11.0-1. |
Apologies, noticed that a little late, I have downgraded my installs by one release also, see if it has any impact...
🤞 |
OK, I think I have got it working WITHOUT the nvidia-docker2 package at all. It requires a docker-compose update to 1.29 and a ver. 20 of docker itself to use the built-in NVIDIA functionality. Docker-compose needs to be modified to let it know about the GPU(s):
So far (just an hour and 3-4 transcodes) the GPU is still working in the node. Thanks for St0th for pointing out that nvidia-docker2 was deprecated. Recommend that examples and documentation be updated to match the new approach and eliminate the need for nvidia-docker2, at least on newer versions of docker. |
How are you verifying that the GPUs are being used? I ask because I had to compile up nvtop to monitor GPU usage. Also, when I recreate my Tdarr_node docker image (I use docker run) it basically does a chown for like an hour which I find useless because it seems to be doing chown on the database to set the owner to 1000:1000 but with 31k entries in the database from my transcoding my entire library, it takes like an hour before this process stops and the Tdarr_node is registered. Until this process finishes the Tdarr_node is not registered nor does it show up in the web interface. Do you experience this too? |
I kicked off a transpose and ran nvidia-smi in another window. The ffmpeg process showed up in the process list. Plus the transpose plug-in I use only works with NVEC.
No, but all my files were already owned by 1000:1000. I just copied a chunk from my compose file. Thus would need to be modified to match your configuration. Your tdarr server process is likely not named Odin either. |
Correction, the script I have that does docker run first stops the old container, removes it then uses docker run to re-create the container. I believe this causes this long-running chown to happen. Not sure what you mean by tdarr server being named Odin... |
The "NodeID" in my chunk of compose file above. The server I run this on is named "Odin" and it says that in the file. You will likely want to change that name. Also: really consider moving to docker-compose from docker run. There is a learning curve but MAN is it easier to manage once you get it working. |
Not looking to move over to docker compose. I have a simple script. I run it once. Tdarr should then run in a container in the background with no more input from me. There's also podman or something like that and lots of other Docker technologies. I'm not really that interested in having yet another learning curve. I have enough of them already. On the plus side, Tdarr node seems to be holding up without that pesky nvidia-docker2 module... |
Alas, this seemed to be working but just failed today. Note that I'm running without nvidia-docker2 installed nor used by the Docker container. I did remove it but I didn't fully remove it.
Not sure why the Docker container would use nvidia-docker2 if it's not installed but there is residual configuration so I did complete remove of nvidia-docker2 configuration and restarted the Docker container. |
Since dropping |
As I still am seeing NO_CUDA_DEVICE I've downgraded Spoke too soon:
Hmmm... Now docker won't start. Seems to do an More insights - if I downgrade So, I have tdarr_node running now with @Sc0th, perhaps you could detail to me what versions/setup you have and if that's still working for you. |
Two new videos were also transcoded without error. I think I may be fixed but I'm not exactly sure how I fixed it. |
I spoke too soon. The errors are back! I still have this problem. |
I moved to 1.11.0-1 this morning and the very next transcode failed (first failure in a couple of weeks) dropped back to 1.9.0-1 and all is well again. |
Backed down to 1.9.0-1 on both libnvidia-container-tools AND nvidia-container-toolkit. I think that's the combo that will work. I'll report back. |
Doesn't fix it. Same error! :-( |
Is there any working combination of packages/drivers confirmed to work over extended time? |
The only one claiming to have it working is @Sc0th and he uses docker-compose. I guess I'll have to check out docker-compose... |
podman rather than docker compose, not that it should make any difference, my node is configured thus:
|
My tdarr-node docker container has been up for 4 days now with no problems and I did a LOT of transcoding yesterday with no issues. Docker-compose extract: tdarr-node: Not sure WHY it started working, but I did update the docker container packages last week as part of my usual apt-get upgrade. |
Ignore this. Not working again. |
I was having a very similar problem for a long time where my Tdarr transcodes would start failing after a while, and restarting the Docker container would fix it temporarily. Whenever my transcodes would start failing, I would run After finally getting fed up, I did some research and eventually came across a solution that seems to have fixed it for me (2 weeks with no failed transcodes). This issue in the nvidia-docker repo suggested setting the I realize my error wasn't the exact same as the one being discussed here, but it might be worth trying the same solution. Hopefully it helps! |
How exactly do you set
Run BTW what I at https://askubuntu.com/questions/1389602/how-to-set-cgroup-memory-enable-on-ubuntu-21-10 found said:
And I'm also not sure if I should include the |
Assuming you're using Ubuntu, here is the relevant documentation on setting a kernel parameter. It discusses how to set a temporary parameter, which you could do first as a test if you're concerned about problems booting. If that goes well, you can then follow the instructions for permanently setting it. As for those other parameters you mentioned, I didn't set those. It seems like that other post is talking about configuring a server for AWS's Greengrass service, which doesn't seem relevant to Tdarr in any way. |
Well, I just added the recommended parameter to grub and rebooted. We will see if that fixes it. |
Hmmm... Of course, I wonder if Nvidia will fix this and use the newer cgroup hierarchy and how I might be notified so I can remove this setting and start using the newer setting. |
I'm not sure if this is a Nvidia or a Tdarr issue. I have two other Docker containers ( I'm no expert on cgroups, but it looks like there were some deprecated features when going from v1 to v2. Maybe Tdarr is either directly or indirectly using some deprecated v1 feature? |
Inntereing observation! I didn't think about it but yes, my Plex also runs in a Docker container and uses the CPUs for hardware transcoding. This sounds like good info for the developer. |
FWIW this completely broke it for me, so it may not be the full fix. |
Well for me it's working. What OS/distro are you on? If you didn't type in the right stuff, didn't run update-grub or didn't reboot the Linux kernel will not warn you that you got it wrong. I'm on Ubuntu 22.04 and I have the following in
I've run |
The grub parameter change seems to be working for me. I am on Ubuntu 22.04.1 LTS. This is an OK workaround, but hopefully this can get fixed so that users are not required to change how the kernel is operating just to make this work. |
Agreed. Not sure if I should close this as fixed with this workaround or leave it open to be fixed more properly. |
+1 for this being a workaround, but not an actual fix. However, since we have at least 3 people here who this seemed to resolve their issue, I'm sure there are others that have had similar issues and didn't find this solution. @HaveAGitGat, could we possibly document this workaround somewhere in the README/on docs.tdarr.io until a better or more permanent solution is found? |
Can also confirm the workaround worked for me. So another vote for the "not fixed, but here's a workaround" option. |
I'd been having the same issues - tdarr would work fine for about a day, and then I'd have to restart the containers in order for transcodes to ocurr. When i finally chased it down, I was getting the same error as OP and found a solution on the Arch wiki - though i am running Ubuntu 22.04 and docker 20.10.22. Seems to be an nvidia / docker issue. see Howard O-Neil's reply here. About half way through his answer he gives an alternative method where he details two ways to make this work that don't require adding kernel parameters: (1) set the container to privileged. After doing this, i never got the "CUDA_ERROR_NO_DEVICE" error and transcoding continues to work just fine without requiring container restarts.
Similarly, you can change the compose file to effect these same changes / additions. Running the containers this way has also resulted in no "CUDA_ERROR_NO_DEVICE" errors and transcoding works without requiring daily-ish restarts. EDIT - in this 2nd alternative, i found i had to create an override to my docker.service systemd unit which basically waits for the existence of the above nvidia devices to exists.
EDIT 2 - 2nd method did fail after a few days, so the only way i could make it work for container to stay up was to make it privileged. |
@tedcook94 added to the troubleshooting section: |
Issue is open on Jellyfin here so will let solution be handled there ty: jellyfin/jellyfin#9287 |
Now that I finally converted all of my videos using Tdarr I like how I can leave it running and when new videos are downloaded it compresses them. And this works... But then it dies with "CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected". Now I've configured my Docker container to run Tdarr using Nvidia and I have properly configured it so that that works. But then it breaks with that error. The solution is simply to restart the Docker container and then re-queue the videos but why does it break in the first place?
QjALSDImR-log.txt
The text was updated successfully, but these errors were encountered: