-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resiliency can be improved when facing constrained memory situations #5515
Comments
Unfortunately, there are also other places where we use Line 1017 in 06f10e7
Line 617 in 06f10e7
I think the fix here is to preallocate the next vector in tx_construct_user_buffer , somewhere around here:Line 708 in 06f10e7
|
Ah yes, I forgot to check if out_init() was also executed in the context of each thread !!... I will try to find an other place in the early steps of each thread.
So, should I at least fix this one this way ?
I will have a look to these other "simple" cases and to fix them the same way.
Sorry, I don't get this. Is this related to the very special case you stated in the "tx_commit path" ?? |
|
Sorry to be late on this...
Ok, after looking more how our DAOS code uses PMDK code, I understand that threads creation is not handled in PMDK, so I better have to follow your idea ;-)
And including the other occurrences you had pointed already...
Hmm, I have tried to do something in this direction also in PR-5537 ... |
Environment Information
Please provide a reproduction of the bug:
Running DAOS Servers as a simple user with memory limits.
How often bug is revealed: (always, often, rare): rare
Actual behavior:
Silent abort() or SEGV, leading to unnecessary/wasted time spent to debug.
Expected behavior:
graceful handling.
Details
1st case/abort() encountered with a stack trace like following :
Further corefile analysis along with associated pmdk/libpmemobj source code review indicates that the abort()/FATAL() comes from the error-return handling from the malloc() call in Last_errormsg_get() :
and more source code analysis also indicates that this behaviour could be avoided if these allocation/initialisation stuff could be done upstream during out_init(). What do you think ??
2nd case/SEGV encountered with a stack trace like following :
Further corefile analysis along with associated pmdk/libpmemobj source code review indicates that the SEGV has occurred due to a 0xffffffffffffff80 invalid action pointer being returned by tx_action_add() and to be passed as 3rd parameter to palloc_defer_free() in pmemobj_tx_xfree(), and looking more into the associated code (in core/out.c, libpmemobj/palloc.c, common/mem.c, libpmemobj/tx.c, common/vec.h), this should be caused by the fact tx->actions->buffer value is NULL due to a previous Realloc() error in vec_reserve().
For this issue, I believe this could be handled by testing the VEC_INC_BACK() negative return value in tx_action_add() and thus have this latter to return NULL in this case instead of an invalid pointer. What do you think ?
I would be happy to push a PR for both issues if you agree with my analysis, just let me know.
Additional information about Priority and Help Requested:
Are you willing to submit a pull request with a proposed change? Yes
Requested priority: Medium
The text was updated successfully, but these errors were encountered: