-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Possible data race when reading metadata of a parquet file #40068
Comments
|
|
Well, I'm not sure why it would be a problem in the snippet above. |
In this call, sure. But by the same reasoning the locking is not needed at all. |
Hmm, then please post a corresponding TSAN report, this would help understand the issue. |
Arrow is built from commit 3c655df |
Thanks! So, trying to sum it up:
What seems weird is that two |
One possible explanation would be if cc @westonpace |
In any case, it would probably be good for |
Dataset creation itself looks quite vanilla to me
However, when reading it I do this (for filtering columns and such)
I don't know if this is idiomatic enough |
That looks reasonable to me. |
Correct, this is a sort of lazy cache. Parquet datasets will cache parquet metadata the first time it is loaded. So to trigger this you would need to:
So it is possible that we have one thread testing if a shared_ptr is null at the same time another thread is assigning to it. Assigning to a shared_ptr is maybe one of those things that is probably atomic but not necessarily documented to be atomic. Either way, suppressing TSAN reports is a valid enough reason to fix this. The |
… file (#40111) ### Rationale for this change The `ParquetFileFragment` will cache the parquet metadata when loading it. The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr. It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior. ### What changes are included in this PR? Guard access to the metadata by grabbing the mutex first ### Are these changes tested? Existing tests should regress this change ### Are there any user-facing changes? No * Closes: #40068 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…arquet file (apache#40111) ### Rationale for this change The `ParquetFileFragment` will cache the parquet metadata when loading it. The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr. It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior. ### What changes are included in this PR? Guard access to the metadata by grabbing the mutex first ### Are these changes tested? Existing tests should regress this change ### Are there any user-facing changes? No * Closes: apache#40068 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…arquet file (apache#40111) ### Rationale for this change The `ParquetFileFragment` will cache the parquet metadata when loading it. The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr. It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior. ### What changes are included in this PR? Guard access to the metadata by grabbing the mutex first ### Are these changes tested? Existing tests should regress this change ### Are there any user-facing changes? No * Closes: apache#40068 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… file (#40111) ### Rationale for this change The `ParquetFileFragment` will cache the parquet metadata when loading it. The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr. It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior. ### What changes are included in this PR? Guard access to the metadata by grabbing the mutex first ### Are these changes tested? Existing tests should regress this change ### Are there any user-facing changes? No * Closes: #40068 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
The first and the last line of this block of code access the same
metadata
variable but only one of them does so holding a lock.I assume this means the other one should too.
There are some other places in this file that access metadata in tricky ways (e.g. it is not clear from a first glance at a method whether nullptr is allowed or not). They could also race.
arrow/cpp/src/arrow/dataset/file_parquet.cc
Lines 607 to 618 in 0dbbd43
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: