-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Redis crash on Azure XS VM #167
Comments
Since this is a pseudo-production server, I was yet unable to exchange the binaries, but I may have additional information: Shall I upgrade to 2.8.14 and see if the problem remains or should I wait? |
What does INFO on master return? I fixed a bug recently where master/slave sync with >2GB data to be exchanged will cause the sync to fail. This might be the same problem. |
Sorry, should’ve explained the file names, redis-host-1 is the master server, redis-host-2 is slave. The size of the dataset is around 10 MB, so it shouldn’t be related with the 2 GB data problem (I think). |
I upgraded to 2.8.17, but unfortunately this did not solve the problem. Another crash did not occur, but on the slave (redis-host-2) I get the following log output several times per second and it's visible inside explorer that redis tries to write to the output directory (temporary files are created and then deleted)
Unfortunately, I do not see a way to get more info about this "error", the output is somewhat lacking in that regard ;) |
I have seen a crash recently that looks a bit like heap corruption. I have set up an automated test environment in order to reproduce this issue. Once I have a few more crash dumps I may be able to identify the reason for the problem. You can try running with "--loglevel verbose" in order to capture more diagnostic information about the problem you are seeing. |
Just to update you on this issue. I have a scenario that can replicate this problem consistently. My debugging tools are showing a heap corruption event coming from what looks to be outside of the Redis process. I am in discussion with the Windows product group about this issue. I will update you as I get more information. |
Is there an estimate as to when this will probably be fixed? It's been 1 1/2 months and we have to move to production-phase soon and I will be forced to move to an alternate solution that, to be frank, does not crash (somtimes) several times a day :/ |
At this point I don't have an estimate. |
It's been 5 1/2 months. As you said yourself, you can consistently replicate this problem. Is there any progress on this issue? |
After using a linux-based redis instance for a few months, I tried again with a newer and bigger Azure instance (D1, 3.5 GB ram), the newest windows redis version (2.8.19-rc1), completely new config files
This occurs endlessly without pause in-between, as soon as the "rewriteAppendOnlyFile failed..." line shows, the next rewrite is scheduled/run. |
The problem vanished for a while when AOF rewriting was disabled, but has now occured 2 days in a row. Is this project still supported? A reproduceable bug, which results in data loss, for a data storage server seems pretty severe, so I don't understand how this issue can be ignored like that for more than half a year. Application Error Faulting application name: redis-server.exe, version: 0.0.0.0, time stamp: 0x54edbbf6 0 1001 EVENTLOG_INFORMATION_TYPE redis#796 Windows Error Reporting Fault bucket , type 0 Problem signature: Attached files: These files may be available here: Analysis symbol: |
Hi @nmehlei , I have tracked down the cause of this issue and I have a fix for it. I will be publishing a new release for it asap (in the next day or two). Apologies for taking so long. This wasn’t an easy one to track down and the investigation hit a couple of dead ends. |
Fix for #167 RejoinCOWPages used to call QueryWorkingSetEx to figure out which pages had been dirtied since the memory map was protected with PAGE_WRITECOPY. But dirty pages that had been swapped out to the system page file would be reported as not valid (VirtualAttributes.Valid == 0) and so we wouldn't restore them into the file map. QueryWorkingSetEx only gives information about pages that are in the working set at the time it is called. Pages can be forced into the working set using VirtualLock, but that seems like a potentially risky / expensive solution. I implemented a solution that uses VirtualQuery to find out which regions have changed protection from PAGE_WRITECOPY.
Fixed in release 2.8.19.1 |
Hello, I found similar issue on 2.8.19 as well, May I know is the new release fix the above crash issues as well? Thanks a lot, |
While the related issue of the failing aof rewrite seems indeed to be fixed, the main issue of a crashing issue (with error code 0xc0000409) is NOT fixed in 2.8.19.1. As there are multiple github issues and users complaining about this, what is the roadmap and expected timetable for this? And please reopen this ticket. |
Thanks for the reply. |
@bruceliu2008 the issue you reported is a different one, and it's now tracked here: #242 Release 2.8.19.1 is the most stable version. But it's still affected by #242 |
@nmehlei are you still experiencing crashes with error code 0xc0000409 with 2.8.19.1. |
@orangemocha Yes we upgraded to 2.8.19.1 on Saturday morning and already experienced one crash (Saturday evening). Details: Application Error Faulting application name: redis-server.exe, version: 0.0.0.0, time stamp: 0x5547a2d5 Looks very similar. Could this also be related to the known issue in https://github.com/MSOpenTech/redis#known-issues ? |
We neither currently have process scanning software enabled nor can "RejoinCOWPages" be found anywhere in our logs, so I am pretty sure that at least my issue (and thus this ticket) is not the same as #242 |
That known issue is the same as #242. And if "RejoinCOWPages" is not in your logs, we can rule that out. I'll look at this new report id and get back to you asap. |
Understood. Could you reopen this ticket then? |
Can Redis Watcher be a workaround for this issue? |
@orangemocha As redis crashed 3 times this morning - with data loss - I'm now in a difficult position, possibly forced to migrate our storage servers to Linux to use the native redis binaries. Can you give me an estimate? @bruceliu2008 Redis watcher could restart redis after the crash, but it would not prevent the outage itself or the data loss associated with it :/ |
Thanks for the comment. |
Well...I'd be very interested in that answer myself. Currently I have none. |
thanks. |
@nmehlei : I am still investigating. I can confirm that this is not the same issue that manifested itself after aof rewrite, so I will be opening a new issue. The crash reports collected by Windows Error Reporting contain very limited information, and in this case they don't make it easy to determine the cause of the problem. Would you be possible for you to configure your machine to collect full memory dumps? The instructions are here: https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx . You can configure it for redis-server.exe only (the article explains how to do so). |
@orangemocha Thanks. If you need any more information then I'm happy to assist. I changed the Windows Error Reporting settings, though I'm not sure if it'll occur in the next few days. |
Closing this issue. Opened: #244 |
We just released |
I think you meant 2.8.21 ;) |
Yes :) We just released 2.8.21, which fixes many stability issues including the ones reported here. |
split from issue 161
@nmehlei
I need to get a full dump for analysis. Please do the following:
procdump -e -ma redis-server.exe
Once Redis throws the exception a dump will be created in the directory where you launched procdump from. Send me the generated file.
The text was updated successfully, but these errors were encountered: