Redis crash on Azure XS VM #167

jepickett · 2014-09-11T21:26:01Z

split from issue 161

I need to get a full dump for analysis. Please do the following:

Build the debug x64 version from source
Run this version and verify that it crashes in the same manner.
Download ProcDump from http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx
Relaunch Redis, then launch procdump from a command prompt with the following command line:

procdump -e -ma redis-server.exe

Once Redis throws the exception a dump will be created in the directory where you launched procdump from. Send me the generated file.

nmehlei · 2014-09-16T07:29:45Z

Since this is a pseudo-production server, I was yet unable to exchange the binaries, but I may have additional information:
I have two Azure XS VMs, one master and one slave, both with the same 2.8.12 binaries. Since the crashes I already mentioned, redis did not crash again, but on the slave, memory and cpu usage increases dramatically over time, which seems to be related to aof rewriting.
Master has ram usage of ~13 MiB and ~0% cpu usage, while slave currently has ~196 MiB and a cpu usage of 35-40%. INFO on slave also shows aof_rewrite_in_progress and aof_rewrite_scheduled both at 1, but doesn't seem to be doing something. aof_last_bgrewrite_status is at "err".
I sent you both INFO extracts and the corresponding redis.conf via mail.

Shall I upgrade to 2.8.14 and see if the problem remains or should I wait?

jepickett · 2014-09-16T07:49:45Z

What does INFO on master return? I fixed a bug recently where master/slave sync with >2GB data to be exchanged will cause the sync to fail. This might be the same problem.

nmehlei · 2014-09-16T07:53:43Z

Sorry, should’ve explained the file names, redis-host-1 is the master server, redis-host-2 is slave.

The size of the dataset is around 10 MB, so it shouldn’t be related with the 2 GB data problem (I think).

nmehlei · 2014-09-22T09:26:31Z

I upgraded to 2.8.17, but unfortunately this did not solve the problem. Another crash did not occur, but on the slave (redis-host-2) I get the following log output several times per second and it's visible inside explorer that redis tries to write to the output directory (temporary files are created and then deleted)

# fork operation complete
# Background AOF rewrite terminated with error
* Background append only file rewriting started by pid 2404

Unfortunately, I do not see a way to get more info about this "error", the output is somewhat lacking in that regard ;)

jepickett · 2014-10-01T00:26:05Z

I have seen a crash recently that looks a bit like heap corruption. I have set up an automated test environment in order to reproduce this issue. Once I have a few more crash dumps I may be able to identify the reason for the problem. You can try running with "--loglevel verbose" in order to capture more diagnostic information about the problem you are seeing.

jepickett · 2014-10-20T21:57:58Z

Just to update you on this issue. I have a scenario that can replicate this problem consistently. My debugging tools are showing a heap corruption event coming from what looks to be outside of the Redis process. I am in discussion with the Windows product group about this issue. I will update you as I get more information.

nmehlei · 2014-10-26T14:07:23Z

Is there an estimate as to when this will probably be fixed? It's been 1 1/2 months and we have to move to production-phase soon and I will be forced to move to an alternate solution that, to be frank, does not crash (somtimes) several times a day :/

jepickett · 2014-10-29T03:50:27Z

At this point I don't have an estimate.

nmehlei · 2015-02-24T09:34:00Z

It's been 5 1/2 months. As you said yourself, you can consistently replicate this problem. Is there any progress on this issue?

nmehlei · 2015-03-06T14:23:35Z

After using a linux-based redis instance for a few months, I tried again with a newer and bigger Azure instance (D1, 3.5 GB ram), the newest windows redis version (2.8.19-rc1), completely new config files
etc., but when it's time for redis to rewirte its AOF file, it still throws the following errors:

» 14:51:09.163  [476] 06 Mar 13:51:04.660 * Background append only file rewriting started by pid 3580
» 14:51:09.163  [476] 06 Mar 13:51:04.770 # fork operation complete
» 14:51:09.163  [476] 06 Mar 13:51:04.785 # Background AOF rewrite terminated with error
» 14:51:09.163  [476] 06 Mar 13:51:04.895 * Starting automatic rewriting of AOF on 8438166500% growth
» 14:51:09.163  [1672] 06 Mar 13:51:04.910 # Write error writing append only file on disk: Invalid argument
» 14:51:09.164  [1672] 06 Mar 13:51:04.910 # rewriteAppendOnlyFile failed in qfork: Invalid argument

This occurs endlessly without pause in-between, as soon as the "rewriteAppendOnlyFile failed..." line shows, the next rewrite is scheduled/run.
To clarify, this is a completely new environment, so even if the aforementioned XS vm was just damaged or misconfigured, this would not be the case for this D1 VM.

nmehlei · 2015-04-20T07:48:18Z

The problem vanished for a while when AOF rewriting was disabled, but has now occured 2 days in a row.

Is this project still supported? A reproduceable bug, which results in data loss, for a data storage server seems pretty severe, so I don't understand how this issue can be ignored like that for more than half a year.

Application Error Faulting application name: redis-server.exe, version: 0.0.0.0, time stamp: 0x54edbbf6
Faulting module name: redis-server.exe, version: 0.0.0.0, time stamp: 0x54edbbf6
Exception code: 0xc0000409
Fault offset: 0x0000000000032e00
Faulting process id: 0x538
Faulting application start time: 0x01d078245c3175f7
Faulting application path: C:\Program Files\Redis\redis-2.8.19\redis-server.exe
Faulting module path: C:\Program Files\Redis\redis-2.8.19\redis-server.exe
Report Id: 895e2f76-e599-11e4-80bc-000d3a20bbd9
Faulting package full name:
Faulting package-relative application ID:

0 1001 EVENTLOG_INFORMATION_TYPE redis#796 Windows Error Reporting Fault bucket , type 0
Event Name: BEX64
Response: Not available
Cab Id: 0

Problem signature:
P1: redis-server.exe
P2: 0.0.0.0
P3: 54edbbf6
P4: redis-server.exe
P5: 0.0.0.0
P6: 54edbbf6
P7: 0000000000032e00
P8: c0000409
P9: 0000000000000007
P10:

Attached files:
C:\Windows\Temp\WER761E.tmp.appcompat.txt
C:\Windows\Temp\WER764E.tmp.WERInternalMetadata.xml
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_redis-server.exe_f4325f7735cbd49e77642c85562c0cc8788b788_d372a104_cab_0e79767b\memory.hdmp
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_redis-server.exe_f4325f7735cbd49e77642c85562c0cc8788b788_d372a104_cab_0e79767b\triagedump.dmp

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_redis-server.exe_f4325f7735cbd49e77642c85562c0cc8788b788_d372a104_cab_0e79767b

Analysis symbol:
Rechecking for solution: 0
Report Id: 895e2f76-e599-11e4-80bc-000d3a20bbd9
Report Status: 4
Hashed bucket:

orangemocha · 2015-05-01T06:14:29Z

Hi @nmehlei ,

I have tracked down the cause of this issue and I have a fix for it. I will be publishing a new release for it asap (in the next day or two).

Apologies for taking so long. This wasn’t an easy one to track down and the investigation hit a couple of dead ends.

Fix for #167 RejoinCOWPages used to call QueryWorkingSetEx to figure out which pages had been dirtied since the memory map was protected with PAGE_WRITECOPY. But dirty pages that had been swapped out to the system page file would be reported as not valid (VirtualAttributes.Valid == 0) and so we wouldn't restore them into the file map. QueryWorkingSetEx only gives information about pages that are in the working set at the time it is called. Pages can be forced into the working set using VirtualLock, but that seems like a potentially risky / expensive solution. I implemented a solution that uses VirtualQuery to find out which regions have changed protection from PAGE_WRITECOPY.

orangemocha · 2015-05-04T17:09:34Z

Fixed in release 2.8.19.1

bruceliu2008 · 2015-05-08T02:19:03Z

Hello, I found similar issue on 2.8.19 as well,
Here is the symptoms.
When I installed it in a Windows Server 2008 VM, It crashed a few times and I got this error.
"[4248] 05 May 22:48:18.497 # EndForkOperation: 0x000001e7 – RejoinCOWPages: MapViewOfFileEx failed. Please upgrade your OS to Win8 or newer.: unknown error"
and after that we replace the server with a new physical server with Windows Server 2012 installed.
But after we run it for a few hours, it crashed twice, the first time has no error logged, the second time the error is:
"Accepting client connection: accept: Unknown error"

May I know is the new release fix the above crash issues as well?

Thanks a lot,
Bruce

nmehlei · 2015-05-09T20:57:20Z

While the related issue of the failing aof rewrite seems indeed to be fixed, the main issue of a crashing issue (with error code 0xc0000409) is NOT fixed in 2.8.19.1. As there are multiple github issues and users complaining about this, what is the roadmap and expected timetable for this?

And please reopen this ticket.

bruceliu2008 · 2015-05-11T04:39:27Z

Thanks for the reply.
May I know which version is the most stable version?
Are there any crash issue for 2.8.17.4 version?
Thanks a lot.

orangemocha · 2015-05-11T16:53:11Z

@bruceliu2008 the issue you reported is a different one, and it's now tracked here: #242

Release 2.8.19.1 is the most stable version. But it's still affected by #242

orangemocha · 2015-05-11T16:58:06Z

@nmehlei are you still experiencing crashes with error code 0xc0000409 with 2.8.19.1.
The issue you reported above (report id 895e2f76-e599-11e4-80bc-000d3a20bbd9) is the one that got fixed in 2.8.19.1.

nmehlei · 2015-05-11T17:05:50Z

@orangemocha Yes we upgraded to 2.8.19.1 on Saturday morning and already experienced one crash (Saturday evening).

Details:

Application Error Faulting application name: redis-server.exe, version: 0.0.0.0, time stamp: 0x5547a2d5
Faulting module name: redis-server.exe, version: 0.0.0.0, time stamp: 0x5547a2d5
Exception code: 0xc0000409
Fault offset: 0x0000000000032de0
Faulting process id: 0xf9c
Faulting application start time: 0x01d08a4e9d574822
Faulting application path: C:\Program Files\Redis\redis-2.8.19.1\redis-server.exe
Faulting module path: C:\Program Files\Redis\redis-2.8.19.1\redis-server.exe
Report Id: f0e8be3a-f680-11e4-80bf-000d3a20bbd9
Faulting package full name:
Faulting package-relative application ID:

Looks very similar. Could this also be related to the known issue in https://github.com/MSOpenTech/redis#known-issues ?

nmehlei · 2015-05-11T17:12:54Z

We neither currently have process scanning software enabled nor can "RejoinCOWPages" be found anywhere in our logs, so I am pretty sure that at least my issue (and thus this ticket) is not the same as #242

orangemocha · 2015-05-11T17:22:46Z

Could this also be related to the known issue in https://github.com/MSOpenTech/redis#known-issues ?

That known issue is the same as #242. And if "RejoinCOWPages" is not in your logs, we can rule that out.

I'll look at this new report id and get back to you asap.

nmehlei · 2015-05-11T17:31:07Z

Understood. Could you reopen this ticket then?

bruceliu2008 · 2015-05-12T01:28:48Z

Can Redis Watcher be a workaround for this issue?
Are there any other choice to fix this issue in any way?

nmehlei · 2015-05-12T08:39:53Z

@orangemocha As redis crashed 3 times this morning - with data loss - I'm now in a difficult position, possibly forced to migrate our storage servers to Linux to use the native redis binaries. Can you give me an estimate?

@bruceliu2008 Redis watcher could restart redis after the crash, but it would not prevent the outage itself or the data loss associated with it :/

bruceliu2008 · 2015-05-12T08:44:46Z

Thanks for the comment.
so from your point of view, what is the best way to fix it before we have more stable version released?
We are using Redis in production environment now.

nmehlei · 2015-05-12T08:50:50Z

Well...I'd be very interested in that answer myself. Currently I have none.
One might downgrade to an older version in the hope that these do not crash, though those don't have the bugfixes for AOF, so these are (at least for us) not really viable alternatives.

bruceliu2008 · 2015-05-12T09:04:06Z

thanks.
Anyway, we have used Redis Watcher a few hours ago, and it did auto restart the redis server after it is crashed.

orangemocha · 2015-05-12T16:13:41Z

@nmehlei : I am still investigating. I can confirm that this is not the same issue that manifested itself after aof rewrite, so I will be opening a new issue.

The crash reports collected by Windows Error Reporting contain very limited information, and in this case they don't make it easy to determine the cause of the problem. Would you be possible for you to configure your machine to collect full memory dumps? The instructions are here: https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx . You can configure it for redis-server.exe only (the article explains how to do so).

nmehlei · 2015-05-13T08:30:32Z

@orangemocha Thanks. If you need any more information then I'm happy to assist. I changed the Windows Error Reporting settings, though I'm not sure if it'll occur in the next few days.
We have a release cycle of one month, at which usage increases every day until it's at its highest the last few days, after which we reset our data and the cycle begins again. Our cycle ended yesterday (thus the frequent crashes, because of high usage) and now we're down to relatively low usage. Unfortunately, the last few days of this new cycle can't have that frequent crashes again, so I might have to migrate to Linux before the usage rises that high again if it's not fixed until then :/
So, like I said, if I can assist in any way, I'm here.

orangemocha · 2015-05-14T15:02:05Z

Closing this issue. Opened: #244

orangemocha · 2015-06-24T22:00:04Z

We just released ~~2.8.1~~ 2.8.21, which fixes many stability issues including the ones reported here.

nmehlei · 2015-06-25T06:40:05Z

I think you meant 2.8.21 ;)

orangemocha · 2015-06-25T10:47:07Z

Yes :)

We just released 2.8.21, which fixes many stability issues including the ones reported here.

jepickett self-assigned this Sep 11, 2014

jepickett added the Bug label Oct 27, 2014

jepickett assigned orangemocha and unassigned jepickett Dec 3, 2014

orangemocha mentioned this issue May 1, 2015

QFork fixes #236

Merged

orangemocha closed this as completed May 4, 2015

nmehlei mentioned this issue May 11, 2015

Very high memory commit amount #241

Closed

orangemocha reopened this May 11, 2015

orangemocha mentioned this issue May 14, 2015

Redis-server abort in dlmalloc #244

Closed

nmehlei mentioned this issue May 29, 2015

Crash during relatively high utilization, related to string content #247

Closed

orangemocha closed this as completed Jun 24, 2015

msftgits unassigned orangemocha Jul 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis crash on Azure XS VM #167

Redis crash on Azure XS VM #167

jepickett commented Sep 11, 2014

nmehlei commented Sep 16, 2014

jepickett commented Sep 16, 2014

nmehlei commented Sep 16, 2014

nmehlei commented Sep 22, 2014

jepickett commented Oct 1, 2014

jepickett commented Oct 20, 2014

nmehlei commented Oct 26, 2014

jepickett commented Oct 29, 2014

nmehlei commented Feb 24, 2015

nmehlei commented Mar 6, 2015

nmehlei commented Apr 20, 2015

orangemocha commented May 1, 2015

orangemocha commented May 4, 2015

bruceliu2008 commented May 8, 2015

nmehlei commented May 9, 2015

bruceliu2008 commented May 11, 2015

orangemocha commented May 11, 2015

orangemocha commented May 11, 2015

nmehlei commented May 11, 2015

nmehlei commented May 11, 2015

orangemocha commented May 11, 2015

nmehlei commented May 11, 2015

bruceliu2008 commented May 12, 2015

nmehlei commented May 12, 2015

bruceliu2008 commented May 12, 2015

nmehlei commented May 12, 2015

bruceliu2008 commented May 12, 2015

orangemocha commented May 12, 2015

nmehlei commented May 13, 2015

orangemocha commented May 14, 2015

orangemocha commented Jun 24, 2015

nmehlei commented Jun 25, 2015

orangemocha commented Jun 25, 2015

Redis crash on Azure XS VM #167

Redis crash on Azure XS VM #167

Comments

jepickett commented Sep 11, 2014

nmehlei commented Sep 16, 2014

jepickett commented Sep 16, 2014

nmehlei commented Sep 16, 2014

nmehlei commented Sep 22, 2014

jepickett commented Oct 1, 2014

jepickett commented Oct 20, 2014

nmehlei commented Oct 26, 2014

jepickett commented Oct 29, 2014

nmehlei commented Feb 24, 2015

nmehlei commented Mar 6, 2015

nmehlei commented Apr 20, 2015

orangemocha commented May 1, 2015

orangemocha commented May 4, 2015

bruceliu2008 commented May 8, 2015

nmehlei commented May 9, 2015

bruceliu2008 commented May 11, 2015

orangemocha commented May 11, 2015

orangemocha commented May 11, 2015

nmehlei commented May 11, 2015

nmehlei commented May 11, 2015

orangemocha commented May 11, 2015

nmehlei commented May 11, 2015

bruceliu2008 commented May 12, 2015

nmehlei commented May 12, 2015

bruceliu2008 commented May 12, 2015

nmehlei commented May 12, 2015

bruceliu2008 commented May 12, 2015

orangemocha commented May 12, 2015

nmehlei commented May 13, 2015

orangemocha commented May 14, 2015

orangemocha commented Jun 24, 2015

nmehlei commented Jun 25, 2015

orangemocha commented Jun 25, 2015