Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMR-Stak stopping across multiple rigs at same time #2398

Closed
fbmoose48 opened this issue Apr 7, 2019 · 10 comments
Closed

XMR-Stak stopping across multiple rigs at same time #2398

fbmoose48 opened this issue Apr 7, 2019 · 10 comments
Assignees
Labels

Comments

@fbmoose48
Copy link

2.10.3 on Windows 10, no overclock/undervolt, blockchain gpus

My setup of 4 Windows 10 and 2 Debian rigs with a mix of 480s and 580s has demonstrated something curious:

the XMR-stak seems to drop all gpus and revert to cpu-only mining or worse stop entirely, these both occur at what seems to be nearly the same time (within a few minutes to under an hour at least, I'm not sitting in front of them actively watching) across all the W10 rigs, but the Debian ones run for days unattended.

The W10 rigs all run the blockchain drivers. The Debian don't. Switching entirely to Debian is tempting, but I'd get 10% better hash rates on the same cards with blockchain drivers on W10. I've tested this, same card different OS, consistently 10% less. I've heard rocm drivers on Debian might resolve this, but not all my pcie slots are 3.0 so I'm not sure they're compatible.

Rather than give up on the blockchain drivers on W10, I'd like to understand why this is happening before I just trade reduced hash rate for stability on Debian.

Any ideas? The timing part of the stoppages always seems suspicious to me.

Also, since the Debian rigs continue running through all this I don't think its a network issue.

I haven't been able to correlate what type of restart is necessary yet to get up and running.

Sometimes a restart works, which is great if I'm remote. Other times I have to do a full shutdown-startup to get everything reinitialized - sometimes more than once per rig, but sometimes one shutdown-startup gets it working again.

Blaming the Beta blockchain drivers on Windows seems too obvious. I stick with them over Adrenalin because they have always worked (up until the fork). The hours of downtime on Windows isn't worth a 10% boost when up when compared to being able to run Debian at a 10% reduction with no user intervention. That all W10 rigs stop together and in the same manner (either gpus drop out or entire XMR-stak closes) seems not to be coincidence.

These failures seem to occur after 2 - 12 hours of running, so multiple times per day. I've noticed that after one of these failures the cpus, if they remain hashing, seem to be at 25% their typical rate. I wonder if that is related? What could cause the GPUs to drop off and cpus to be restricted? Memory issue? I've never "affined to cpu", maybe I should.

One last thought is it could be unfortunate coincidence. Since the first time they all stopped together they've now all consistently been started within a minute of each other. They all have similar builds, maybe Windows just reaches it's breaking point independently on each rig and since they start together and are similarly built it just happens to occur near-simultaneously? I doubt it, but don't want to rule anything out trying to resolve this.

I have reason to believe this may be happening to other users' Windows rigs at simultaneously with my own.

@psychocrypt
Copy link
Collaborator

psychocrypt commented Apr 7, 2019 via email

@CryptoBroke550
Copy link

CryptoBroke550 commented Apr 7, 2019

I run 4 windows 10 rigs version 1809 with latest patches and the updates paused. Mining XMR.
Last night at 9:30pm on my Rx580 and Rx470 rigs running XMR-Stak 2.10.3 both miners crashed, I have just restarted both. I have a RX550 rig still running 2.10.3 and Nvidia 1050Ti rig running 2.10.4 that both worked passed this point with no issues.
I do overclock the memory, under clock the Core and lower the voltage but HWinfo shows no errors when mining XMR so they are as stable as possible. I use Seasonic power supply's with over double the voltage required so power supply's should be ok. I also use UPS's on all the rigs. Driver is 18.6.1.

I also think the miner or card hit something it didn't like.

@fbmoose48
Copy link
Author

fbmoose48 commented Apr 7, 2019

I would absolutely love to try some updated source code. I very much appreciate your help @psychocrypt

I've been stable the last few hours after having switched back to 2.10.2 if that helps you to isolate the problem any

@Zarkoob
Copy link

Zarkoob commented Apr 7, 2019

Had two miners crash for me as well at the same time. Pool also shown a huge drop in hash at the same time. I think the pool is causing a crash somehow. I'd love to help solve this.

@psychocrypt psychocrypt self-assigned this Apr 8, 2019
@psychocrypt psychocrypt added the bug label Apr 8, 2019
@fbmoose48
Copy link
Author

fbmoose48 commented Apr 8, 2019

I run 2.10.3 until it crashes, then start 2.10.2 until it crashes, I've been alternating like this for 48 hours with 2-10 hours of uptime each time. Can't get a crashed version to restart after a crash, but the alternative will - usually without even needing a restart.

I have no idea why this is, but thought it might help you debug

Edit: this work around stopped working, have had to reboot between failures and restart last 2 times

@StuieG
Copy link

StuieG commented Apr 10, 2019

I've noticed the same thing - rigs with AMD cards stopping at the same time. Weirdly enough, I have a few rigs with Nvidia cards as well and I've noticed they've stopped a couple of times at the same time as well. Not as the same times as the AMD rigs though.

@PasternakMichal
Copy link

Seems very similar to what happened to me, #2377 but I was on adrenalin drivers and also had a vega 64 mixed in

@lordhugo7880
Copy link

What is the current status of this bug?

@psychocrypt
Copy link
Collaborator

fir NVIDIA I fixed it #2390. For AMD GPUs it was mostly a out of memory bug and was also fixed.
@lordhugo7880 If you have still issues like that please open a new PR with all the details like system,gpus and the exact error.

I will close this issue.

@fbmoose48
Copy link
Author

I suspected it was a windows-specific driver issue. I switched the same rig over to Debian, has run uninterpreted on same version of XMR-stak for almost 8 weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants