-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IGP causes NVMe Kernel Panic CSTS=0xffffffff #1193
Comments
I forgot to mention that I spent a lot of time troubleshooting this before discovering. Different NVMe cards, different motherboards, NVMe heatsinks, built-in M.2 slots vs PCI adapter cards, UEFI PCI power settings, enable/disable ASPM etc, the kernel panic always reoccurred. Sometimes the VID/PID would read as 0xffff Onboard PCH IGE, AHCI, USB never had an issue at all, only NVMe. I'm guessing it's some kind of UEFI firmware bug? |
That's an extremely curious bug, thanks for suggesting a fix. I think force disabling RC6 by default in the FeatureControl dict of the framebuffer IORegistryEntry is a good immediate solution. Were you able to isolate the issue just to a single key of this dictionary? Worth mentioning you can also disable render standby by passing bootarg |
Thanks for the tip on the bootarg. I am pretty sure that's it. It can take hours for the panic to happen, but I set RenderStandby back to 1 and I got a panic almost immediately. I have reverted the previous changes and am testing just with forceRenderStandby=0 right now and it hasn't KP so far. I am not sure the power impact with this change? This is a desktop system, but the same problem could be happening with laptops. One of the linux posts mentions disabling coarse power gating as the better option. There is a key CoarsePowerGatingSelect but I haven't deduced what the values mean yet. |
Coarse power gating is another mechanism used in GEN9 to transition Render and Media engines to sleep. The two appear to be independent in principle. The |
Thanks for the info, it has saved me a lot of time! I did some testing with Setting Intel Power Gadget reports that the IGP frequency never drops below 350mhz and total power consumption is approximately 1W higher than with RenderStandby enabled. I'm still at a loss as to why RC6 on the IGP would be affecting the NVMe at all, though. |
It's a complete mystery why there is interference between GPU and PCI. If you can reproduce it on Linux with i915, then this could be reported to Intel. |
By the way, value A similar bug in Linux: https://bugs.freedesktop.org/show_bug.cgi?id=108546. Apparently, it is a BIOS issue, although in that case |
Thanks for your help! Added a comment to WhateverGreen FAQ. Other FAQs will also need to be updated. |
I added forceRenderStandby=0 boot arg as well , and IGPU is stacked at 0,3ghz. |
Maybe this state is when TRIM runs and it is crashing? Try |
It's back doing it again on my machine after a month or so of no issues Getting more consistent too |
I haven't had this panic since I disabled TRIM |
Will try that - thank you! |
How did you disable trim? |
|
I ended up having to do a fresh Big Sur install and restore my install from Time Machine That all went great and I'm back up and running with no freezes again and I've used @1alessandro1 tips/settings above in hopes that might cure it long term. I don't think I will really know for a month or so, as that's how long the freezing issue took to reappear after the last time I did all this. I'll report back in hopes of helping anyone else down the line. Thank you all |
Let me start with the fact that this is not a bug in NVMeFix or Whatevergreen but this seems like the best place to document the issue.
I have an Intel 9600K/H370 system that experiences kernel panics in IONVMeController that manifests as a generic timeout:
I have tried to debug this timeout, which always happens at random times but there is a commonality - it only happens when using the IGP and the display is sleeping.
The IGP going into a low-power mode seems to disrupt power to the NVMe, causing it to crash/reset, and thus causing the timeout. The NVMe keeps smart statistics on power offs, and I have recorded this anomaly:
I have not been able to figure out exactly how the IGP is causing the NVMe to lose power, but I suspect it may be related to this issue (RC6)
I modified the CFL FB kext with these changes, which seems to completely solve the KP issue:
<key>RenderStandby</key><integer>0</integer>
<key>SetRC6Voltage</key><integer>1</integer>
<key>SupportPSRwithExternalDisplay</key><integer>0</integer>
Have you guys seen issues relating to IGP power saving causing any similar problems? I'm thinking there might be a way to work around this in Whatevergreen or NVMeFix to avoid having to create a plist-only kext to change these settings.
The text was updated successfully, but these errors were encountered: