-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctness integration test for parquet filter pushdown #3976
Conversation
2ca7a31
to
1faf508
Compare
@@ -263,251 +182,15 @@ fn gen_data( | |||
scale_factor: f32, | |||
page_size: Option<usize>, | |||
row_group_size: Option<usize>, | |||
) -> Result<(ObjectStoreUrl, ObjectMeta)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this code was moved into parquet-test-utils/src/lib.rs so i could reuse it
3ab17f3
to
43af7a9
Compare
653a8ef
to
644bdf5
Compare
Ok, I am pretty happy with the state of pushdown predicates with this test -- I think we are go for launch |
3176d02
to
585c85b
Compare
585c85b
to
c1b42f2
Compare
.await | ||
} | ||
|
||
#[cfg(not(target_family = "windows"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am skipping windows for now as these tests fail with some sort of path issue I don't have time to debug. Since there is no logic difference on linux/windows/mac there is no reason this also needs to be run on windows
Example failure:
https://github.com/apache/arrow-datafusion/actions/runs/3369837748/jobs/5589984001
---- basic_conjunction stdout ----
Writing test data to "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\.tmpDGlfus\\data.parquet"
Generated test dataset with 53819 rows
thread 'basic_conjunction' panicked at 'Error writing data: IO error: The filename, directory name, or volume label syntax is incorrect. (os error 123)', datafusion\core\tests\parquet_filter_pushdown.rs:66:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This is finally ready for review ! cc @Ted-Jiang @tustvold and @thinkharderdev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving, but I think we should fix the temporary file handling prior to merge. I already have severe memory issues running the DataFusion tests on my laptop, leaking large files will make this exponentially worse.
Perhaps we could merge the cases into a single test to avoid needing to use lazy_static?
|
||
// Only create the parquet file once as it is fairly large | ||
lazy_static! { | ||
static ref TEMPDIR: TempDir = TempDir::new().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be mistaken but I believe static destructors are not run, so this will leak the file. This is likely not ideal... Particularly on linux machines where /tmp is typically backed by memory
|
||
assert_eq!(no_pushdown, pushdown_and_reordering); | ||
|
||
// page index filtering is not correct: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be enabled now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- done in #4062
I think you are correct . I added be746ac to drop the temp file but yet still only created it once. Can you give it a look? I tested this code by running
And you can see the file is created once :
And then cleaned up afterwards $ ls /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/.tmpeaNruZ/data.parquet
ls: /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/.tmpeaNruZ/data.parquet: No such file or directory Full logs:
|
The use of the weak pointer is effectively making an assumption about test execution order, namely that more than one is executed in parallel and test are run in module order. I'm not hugely comfortable about this. What is the hesistance to just bundle them into a single test, we don't really need the additional runner parallelism (on my machine at least I have to throttle the parallelism to avoid running out of memory). |
For me it was a matter of running the tests more quickly when there is more than 1 core available -- it takes a bit to run them serially and running them concurrently made it faster. I can make a single (single threaded) test if you prefer, but that seems to be like it reduces the speed at which the CI will run 🤔 |
Right -- the weaker pointer is an optimization for this case to avoid recreating the same file multiple times |
I sincerely doubt that it will be the tall pole when running test, especially when compilation is so slow. I'd vote for simple first, and if it is an issue then try to be clever |
Right -- I think it is on the order of a few more seconds on my machine Given you feel so strongly about it I will change the test to be serial |
😭 running these tests serially takes more 3x longer on my laptop. Oh well, it is the price of progress I suppose. We can optimize the tests later if/when the people writing them get frustrated with the lack of time Multi-threaded time on my laptop:
Single threaded time:
|
Benchmark runs are scheduled for baseline = 60f3ef6 and contender = 695cedc. 695cedc is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* parquet filter pushdown correctness tests * Do not run tests on windows * Drop shared file after tests are over * Rework to be single threaded
* parquet filter pushdown correctness tests * Do not run tests on windows * Drop shared file after tests are over * Rework to be single threaded
Which issue does this PR close?
Part of #3463
Rationale for this change
I want to avoid tricky wrong results bugs related to applying predicates during the parquet scans. It also found at least four bugs (see list below).
What changes are included in this PR?
parquet_predicate_pushdown
integration testparquet_predicate_pushdown
performance testIn case you are curious, and want to play around, here is the generated parquet file: data.zip
This test creates a parquet file and then
Are there any user-facing changes?
No
Future items
I will file a ticket for some additional testing
Draft until:
parquet_predicate_pushdown
benchmarkparquet-test-util
crate #4042