-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: schemachange/index/tpcc/w=1000 failed #62320
Comments
node 4 is dead, why? I don't know. |
Feels similar to #54934. |
I wonder what it would take to go "meta" and run a roachtest where CRDB intentionally leaks memory and the job of the test is to verify traces of the oom killer. Yes, desperate. |
ack, will give it a shot. |
I wasn't suggesting that! But it is appreciated. You may want to crib the |
In a repro, I don't think the machine was hosed. The process was just gone without a trace. |
That's what it looks like in the above run too, I just figure that when you do this programmatically you'll end up with the broken VM some fraction of the time. But yeah if it's a semi-manual experiment, probably doesn't help too much |
This change unearths runtime out-of-memory errors. Before this change, a runtime-detected failure to allocate memory would not make it to the logs. This was tested by injecting a rapid memory leak into the process and determining first that it was not present in logs and then with this change that the stack traces were present in the stderr log. I'm omitting a release note primarily because I do not know how to frame one. Relates to debugging cockroachdb#62320. Release note: None
I'm inclined to take this off of the schema board as there isn't much evidence that this OOM lives in schema land. Happy to continue to provide assistance. |
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 9e821aa456bfbe8b5ec3a8ef79ae3954d52cb675:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 3cd29547884feb24b150b67ca2a88073798a35fd:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 2d7f9a08bac49040a6d77c5465e2b4fd8fdac1f4:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
first new failure (#62320 (comment)) seems to have happened after the create index completed:
The CRDB logs say "unhappy". Nodes go in and out of liveness, raft ready handling is slow, etc. The pebble log on n1 ends with:
and that lines up with the end of the CRDB log, and likely the time the node died. Here's the inuse_space: |
second one #62320 (comment)
We're not getting anything for n5, which I think is the one hardest hit & browned out. |
third #62320 (comment) failed before statement 3 (create index) returned, with n4 dying. Not oomkiller, but also nothing in the logs, unfortunately, despite #63472. Note that work is ongoing to improve our infra, #64177. Not seeing any clear offenders in inuse_space, but also not sure what's expected. From the looks of it, we're past backfilling here. |
There is currently a regression on master, #64214, which started April 21st. The first failure in this last burst is April 24th, which lines up. @RaduBerinde could you (or someone on Queries) get 10 passing runs of this when the regression is fixed (if they still fail, report back here, otherwise close). This is straightforward via the script in https://go.crdb.dev/roachstress. |
@ajwerner says:
|
I started some runs with my fix for #64214 and I already got a "unexpected node event: 4: dead" failure. |
I kicked off 10 runs of this test right before c3b1617 and including c3b1617. In all runs I observe that CPU is very high in both cases; however, the baseline RAM usage is definitely higher with that change in - before 7.5-9GB, after 9-11GB, so indeed that commit somehow makes things noticeably worse. Unfortunately heap profiles don't provide many clues (they look basically identical w/o and w/). Also, all other metrics seems about the same. |
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ d84625c42d11e38775b6a1ff3f4089e839fe3e82:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
I think I have a fix - got 10 successful runs with e05af1c. |
Interesting. I know you said you didn't know on the commit message, but it feels like it would be really instructive if, after the fact, we could use hindsight to arrive at a way to catch this kind of problem by inspecting profiles next time. |
I tried to find a microbenchmark in The profiles have been quite unhelpful in this investigation. One thing that was kind of like a clue was the fact |
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 42e2e4265087751695afc05fc5ad8ec534fe0121:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
We should try to figure out which part of the fix is actually needed and why. For example, slices of pointers shouldn't help unless we're relying on some kind of aliasing. |
Is that change in allocations fixed by e05af1c? I would be surprised, I can't see why that commit would reduce any allocations (except one small thing, we are now reusing By the way, note that when we are reusing slices by slicing with |
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ 684e753c15f3fc58df79b6ea70e7b6715eae4835:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
@RaduBerinde yeah, I think you're right that
is the problem. The following diff fixes the memory leak:
I am aware that slicing to |
Yes, but only up to how many entries will be needed in the new instance. It's possible that occasionally the slices get way longer than normal. |
So can we revisit why we didn't see this in profiles? If I am reading the analysis correctly we see a difference in memory use on the order of GB. This certainly needs to show up in heap profiles like inuse_space? This kind of problem is bound to occur again and we need to be able to use the go tooling to hunt it down proactively. |
Hm, I thought that the diff above fixes the issue (my "reproduction" is: running tpcc over a couple of minutes, restarting the cluster with different binaries, and looking at RAM usage; and the fix above was sufficient to bring the usage down to what is expected), but |
Hm, I think that the diff above (or 0100ba7) fixes the OOM scenario of this issue with My reproduction was running TPCC1000 for 5 minutes with different binaries and observing the reported RAM usage. For example, here I alternated "before" (without 0100ba7) and "after" (with 0100ba7) twice: This clearly shows an increased RAM usage in the "before" case. I also collected the heap profiles with Looking at the
In particular, Additionally, I'm confused by the following on the "before" profile:
Why does the top part report 3.5MB while the bottom part over 100MB? They refer to the same code, so I'm quite confused the profiles. Should I be taking profiles with some other parameters? Am I missing something? |
Note that #64177 added a 95% hard memory limit to cockroach in roachprod. Maybe the test was just squeezing by before.
That would be explained if |
Oh yeah, that must be it. Then I think we're good.
Indeed, silly me. I was too focused on a particular code path. I guess looking at On the former, we see only "positive" changes over the 2 minute period (IIUC), whereas on the latter we see incremental changes. |
Hm, I am still very confused by this - those 100MB of unreleased converters come from |
Maybe it's a similar problem - |
Yeah, quite possibly, I added another commit auditing our BTW, I meant my previous comment
as a rough summary of my understanding of how we could have investigated the memory leak if we didn't know which commit broke things or what is the fix. |
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ dbf711a83969210bcf706d65f53f888856da6ad0:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
The last failure is interesting - n4 crashed with OOM during the check queries. I don't think I've seen a case like this before, but unfortunately the debug.zip collection from n4 wasn't successful. Anyway, we should fix known problems, and likely the OOMs will go away. |
roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on master @ f9dca40d3afb0a605eeee76bec4a07ff65d6fec0:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000 Same failure on other branches
|
(roachtest).schemachange/index/tpcc/w=1000 failed on master@893643b63ea0b1cfa4888c6b73b5c68a9c100c3a:
More
Artifacts: /schemachange/index/tpcc/w=1000
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: