Skip to content
This repository has been archived by the owner on Feb 20, 2021. It is now read-only.

Redis crash on Azure XS VM #167

Closed
jepickett opened this issue Sep 11, 2014 · 33 comments
Closed

Redis crash on Azure XS VM #167

jepickett opened this issue Sep 11, 2014 · 33 comments
Labels

Comments

@jepickett
Copy link

split from issue 161

@nmehlei

I need to get a full dump for analysis. Please do the following:

  1. Build the debug x64 version from source
  2. Run this version and verify that it crashes in the same manner.
  3. Download ProcDump from http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx
  4. Relaunch Redis, then launch procdump from a command prompt with the following command line:

procdump -e -ma redis-server.exe

Once Redis throws the exception a dump will be created in the directory where you launched procdump from. Send me the generated file.

@jepickett jepickett self-assigned this Sep 11, 2014
@nmehlei
Copy link

nmehlei commented Sep 16, 2014

Since this is a pseudo-production server, I was yet unable to exchange the binaries, but I may have additional information:
I have two Azure XS VMs, one master and one slave, both with the same 2.8.12 binaries. Since the crashes I already mentioned, redis did not crash again, but on the slave, memory and cpu usage increases dramatically over time, which seems to be related to aof rewriting.
Master has ram usage of ~13 MiB and ~0% cpu usage, while slave currently has ~196 MiB and a cpu usage of 35-40%. INFO on slave also shows aof_rewrite_in_progress and aof_rewrite_scheduled both at 1, but doesn't seem to be doing something. aof_last_bgrewrite_status is at "err".
I sent you both INFO extracts and the corresponding redis.conf via mail.

Shall I upgrade to 2.8.14 and see if the problem remains or should I wait?

@jepickett
Copy link
Author

What does INFO on master return? I fixed a bug recently where master/slave sync with >2GB data to be exchanged will cause the sync to fail. This might be the same problem.

@nmehlei
Copy link

nmehlei commented Sep 16, 2014

Sorry, should’ve explained the file names, redis-host-1 is the master server, redis-host-2 is slave.

The size of the dataset is around 10 MB, so it shouldn’t be related with the 2 GB data problem (I think).

@nmehlei
Copy link

nmehlei commented Sep 22, 2014

I upgraded to 2.8.17, but unfortunately this did not solve the problem. Another crash did not occur, but on the slave (redis-host-2) I get the following log output several times per second and it's visible inside explorer that redis tries to write to the output directory (temporary files are created and then deleted)

# fork operation complete
# Background AOF rewrite terminated with error
* Background append only file rewriting started by pid 2404

Unfortunately, I do not see a way to get more info about this "error", the output is somewhat lacking in that regard ;)

@jepickett
Copy link
Author

I have seen a crash recently that looks a bit like heap corruption. I have set up an automated test environment in order to reproduce this issue. Once I have a few more crash dumps I may be able to identify the reason for the problem. You can try running with "--loglevel verbose" in order to capture more diagnostic information about the problem you are seeing.

@jepickett
Copy link
Author

Just to update you on this issue. I have a scenario that can replicate this problem consistently. My debugging tools are showing a heap corruption event coming from what looks to be outside of the Redis process. I am in discussion with the Windows product group about this issue. I will update you as I get more information.

@nmehlei
Copy link

nmehlei commented Oct 26, 2014

Is there an estimate as to when this will probably be fixed? It's been 1 1/2 months and we have to move to production-phase soon and I will be forced to move to an alternate solution that, to be frank, does not crash (somtimes) several times a day :/

@jepickett jepickett added the Bug label Oct 27, 2014
@jepickett
Copy link
Author

At this point I don't have an estimate.

@jepickett jepickett assigned orangemocha and unassigned jepickett Dec 3, 2014
@nmehlei
Copy link

nmehlei commented Feb 24, 2015

It's been 5 1/2 months. As you said yourself, you can consistently replicate this problem. Is there any progress on this issue?

@nmehlei
Copy link

nmehlei commented Mar 6, 2015

After using a linux-based redis instance for a few months, I tried again with a newer and bigger Azure instance (D1, 3.5 GB ram), the newest windows redis version (2.8.19-rc1), completely new config files
etc., but when it's time for redis to rewirte its AOF file, it still throws the following errors:

» 14:51:09.163  [476] 06 Mar 13:51:04.660 * Background append only file rewriting started by pid 3580
» 14:51:09.163  [476] 06 Mar 13:51:04.770 # fork operation complete
» 14:51:09.163  [476] 06 Mar 13:51:04.785 # Background AOF rewrite terminated with error
» 14:51:09.163  [476] 06 Mar 13:51:04.895 * Starting automatic rewriting of AOF on 8438166500% growth
» 14:51:09.163  [1672] 06 Mar 13:51:04.910 # Write error writing append only file on disk: Invalid argument
» 14:51:09.164  [1672] 06 Mar 13:51:04.910 # rewriteAppendOnlyFile failed in qfork: Invalid argument

This occurs endlessly without pause in-between, as soon as the "rewriteAppendOnlyFile failed..." line shows, the next rewrite is scheduled/run.
To clarify, this is a completely new environment, so even if the aforementioned XS vm was just damaged or misconfigured, this would not be the case for this D1 VM.

@nmehlei
Copy link

nmehlei commented Apr 20, 2015

The problem vanished for a while when AOF rewriting was disabled, but has now occured 2 days in a row.

Is this project still supported? A reproduceable bug, which results in data loss, for a data storage server seems pretty severe, so I don't understand how this issue can be ignored like that for more than half a year.


Application Error Faulting application name: redis-server.exe, version: 0.0.0.0, time stamp: 0x54edbbf6
Faulting module name: redis-server.exe, version: 0.0.0.0, time stamp: 0x54edbbf6
Exception code: 0xc0000409
Fault offset: 0x0000000000032e00
Faulting process id: 0x538
Faulting application start time: 0x01d078245c3175f7
Faulting application path: C:\Program Files\Redis\redis-2.8.19\redis-server.exe
Faulting module path: C:\Program Files\Redis\redis-2.8.19\redis-server.exe
Report Id: 895e2f76-e599-11e4-80bc-000d3a20bbd9
Faulting package full name:
Faulting package-relative application ID:


0 1001 EVENTLOG_INFORMATION_TYPE redis#796 Windows Error Reporting Fault bucket , type 0
Event Name: BEX64
Response: Not available
Cab Id: 0

Problem signature:
P1: redis-server.exe
P2: 0.0.0.0
P3: 54edbbf6
P4: redis-server.exe
P5: 0.0.0.0
P6: 54edbbf6
P7: 0000000000032e00
P8: c0000409
P9: 0000000000000007
P10:

Attached files:
C:\Windows\Temp\WER761E.tmp.appcompat.txt
C:\Windows\Temp\WER764E.tmp.WERInternalMetadata.xml
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_redis-server.exe_f4325f7735cbd49e77642c85562c0cc8788b788_d372a104_cab_0e79767b\memory.hdmp
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_redis-server.exe_f4325f7735cbd49e77642c85562c0cc8788b788_d372a104_cab_0e79767b\triagedump.dmp

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_redis-server.exe_f4325f7735cbd49e77642c85562c0cc8788b788_d372a104_cab_0e79767b

Analysis symbol:
Rechecking for solution: 0
Report Id: 895e2f76-e599-11e4-80bc-000d3a20bbd9
Report Status: 4
Hashed bucket:

@orangemocha
Copy link

Hi @nmehlei ,

I have tracked down the cause of this issue and I have a fix for it. I will be publishing a new release for it asap (in the next day or two).

Apologies for taking so long. This wasn’t an easy one to track down and the investigation hit a couple of dead ends.

orangemocha added a commit that referenced this issue May 1, 2015
Fix for #167

RejoinCOWPages used to call QueryWorkingSetEx to figure out
which pages had been dirtied since the memory map was protected
with PAGE_WRITECOPY. But dirty pages that had been swapped out to
the system page file would be reported as not valid
(VirtualAttributes.Valid == 0) and so we wouldn't restore them into
the file map.
QueryWorkingSetEx only gives information about pages that are in the
working set at the time it is called. Pages can be forced into the
working set using VirtualLock, but that seems like a potentially
risky / expensive solution.
I implemented a solution that uses VirtualQuery to find out which
regions have changed protection from PAGE_WRITECOPY.
@orangemocha orangemocha mentioned this issue May 1, 2015
@orangemocha
Copy link

Fixed in release 2.8.19.1

@bruceliu2008
Copy link

Hello, I found similar issue on 2.8.19 as well,
Here is the symptoms.
When I installed it in a Windows Server 2008 VM, It crashed a few times and I got this error.
"[4248] 05 May 22:48:18.497 # EndForkOperation: 0x000001e7 – RejoinCOWPages: MapViewOfFileEx failed. Please upgrade your OS to Win8 or newer.: unknown error"
and after that we replace the server with a new physical server with Windows Server 2012 installed.
But after we run it for a few hours, it crashed twice, the first time has no error logged, the second time the error is:
"Accepting client connection: accept: Unknown error"

May I know is the new release fix the above crash issues as well?

Thanks a lot,
Bruce

@nmehlei
Copy link

nmehlei commented May 9, 2015

While the related issue of the failing aof rewrite seems indeed to be fixed, the main issue of a crashing issue (with error code 0xc0000409) is NOT fixed in 2.8.19.1. As there are multiple github issues and users complaining about this, what is the roadmap and expected timetable for this?

And please reopen this ticket.

@bruceliu2008
Copy link

Thanks for the reply.
May I know which version is the most stable version?
Are there any crash issue for 2.8.17.4 version?
Thanks a lot.

@orangemocha
Copy link

@bruceliu2008 the issue you reported is a different one, and it's now tracked here: #242

Release 2.8.19.1 is the most stable version. But it's still affected by #242

@orangemocha
Copy link

@nmehlei are you still experiencing crashes with error code 0xc0000409 with 2.8.19.1.
The issue you reported above (report id 895e2f76-e599-11e4-80bc-000d3a20bbd9) is the one that got fixed in 2.8.19.1.

@nmehlei
Copy link

nmehlei commented May 11, 2015

@orangemocha Yes we upgraded to 2.8.19.1 on Saturday morning and already experienced one crash (Saturday evening).

Details:


Application Error Faulting application name: redis-server.exe, version: 0.0.0.0, time stamp: 0x5547a2d5
Faulting module name: redis-server.exe, version: 0.0.0.0, time stamp: 0x5547a2d5
Exception code: 0xc0000409
Fault offset: 0x0000000000032de0
Faulting process id: 0xf9c
Faulting application start time: 0x01d08a4e9d574822
Faulting application path: C:\Program Files\Redis\redis-2.8.19.1\redis-server.exe
Faulting module path: C:\Program Files\Redis\redis-2.8.19.1\redis-server.exe
Report Id: f0e8be3a-f680-11e4-80bf-000d3a20bbd9
Faulting package full name:
Faulting package-relative application ID:


Looks very similar. Could this also be related to the known issue in https://github.com/MSOpenTech/redis#known-issues ?

@nmehlei
Copy link

nmehlei commented May 11, 2015

We neither currently have process scanning software enabled nor can "RejoinCOWPages" be found anywhere in our logs, so I am pretty sure that at least my issue (and thus this ticket) is not the same as #242

@orangemocha
Copy link

Could this also be related to the known issue in https://github.com/MSOpenTech/redis#known-issues ?

That known issue is the same as #242. And if "RejoinCOWPages" is not in your logs, we can rule that out.

I'll look at this new report id and get back to you asap.

@nmehlei
Copy link

nmehlei commented May 11, 2015

Understood. Could you reopen this ticket then?

@orangemocha orangemocha reopened this May 11, 2015
@bruceliu2008
Copy link

Can Redis Watcher be a workaround for this issue?
Are there any other choice to fix this issue in any way?

@nmehlei
Copy link

nmehlei commented May 12, 2015

@orangemocha As redis crashed 3 times this morning - with data loss - I'm now in a difficult position, possibly forced to migrate our storage servers to Linux to use the native redis binaries. Can you give me an estimate?

@bruceliu2008 Redis watcher could restart redis after the crash, but it would not prevent the outage itself or the data loss associated with it :/

@bruceliu2008
Copy link

Thanks for the comment.
so from your point of view, what is the best way to fix it before we have more stable version released?
We are using Redis in production environment now.

@nmehlei
Copy link

nmehlei commented May 12, 2015

Well...I'd be very interested in that answer myself. Currently I have none.
One might downgrade to an older version in the hope that these do not crash, though those don't have the bugfixes for AOF, so these are (at least for us) not really viable alternatives.

@bruceliu2008
Copy link

thanks.
Anyway, we have used Redis Watcher a few hours ago, and it did auto restart the redis server after it is crashed.

@orangemocha
Copy link

@nmehlei : I am still investigating. I can confirm that this is not the same issue that manifested itself after aof rewrite, so I will be opening a new issue.

The crash reports collected by Windows Error Reporting contain very limited information, and in this case they don't make it easy to determine the cause of the problem. Would you be possible for you to configure your machine to collect full memory dumps? The instructions are here: https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx . You can configure it for redis-server.exe only (the article explains how to do so).

@nmehlei
Copy link

nmehlei commented May 13, 2015

@orangemocha Thanks. If you need any more information then I'm happy to assist. I changed the Windows Error Reporting settings, though I'm not sure if it'll occur in the next few days.
We have a release cycle of one month, at which usage increases every day until it's at its highest the last few days, after which we reset our data and the cycle begins again. Our cycle ended yesterday (thus the frequent crashes, because of high usage) and now we're down to relatively low usage. Unfortunately, the last few days of this new cycle can't have that frequent crashes again, so I might have to migrate to Linux before the usage rises that high again if it's not fixed until then :/
So, like I said, if I can assist in any way, I'm here.

@orangemocha
Copy link

Closing this issue. Opened: #244

@orangemocha
Copy link

We just released 2.8.1 2.8.21, which fixes many stability issues including the ones reported here.

@nmehlei
Copy link

nmehlei commented Jun 25, 2015

I think you meant 2.8.21 ;)

@orangemocha
Copy link

Yes :)

We just released 2.8.21, which fixes many stability issues including the ones reported here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants