-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] dynamic sizing of input data #4164
Comments
The file schema is already known, fetched by Spark as part of planning and validating the query, and available via the |
I would start off with the simplest possible approach and see how far that gets us. I would only make it more complicated if we run into real world situations where it is too far off for us to get the benefit we want. The simplest I can think of is to keep targeting |
I have been working on this for a while and there is no simple way to do what we want with only the information we have ahead of time. I tried a few heuristics based off of information that we currently have access to through Spark when creating the splits. I mainly looked at the number and types of columns to try and increase the maximum batch size config per-input. That way the batches would get larger and the sub-linear scaling of the GPU works better. This worked well for a few extreme cases, NDS queries 9, 7, and 88 but not as well in the general case. There are a number of problems with the initial approach that I took.
Despite all of this I have hope that this can be useful and we should look into this more. Even this imperfect code saved over 3,400 seconds of GPU compute time from decoding parquet. That is about 27.8% of the total compute time used to parse the parquet data on the GPU, and about 1.7% of the total run time of all of the queries, assuming that the computation could be spread evenly among all of the tasks. So there is potential here to make a decent improvement. But it has not worked out perfectly. We are looking at ways to improve the decoding performance of the parquet kernels themselves. In addition to this we might want to look at more of a control system approach instead. We lack information up front and it looks to be expensive to try and get that information early on. It might be better to try and dynamically adjust the sizes of the batches we buffer/send to the GPU based off of throughput rates that we are able to achieve while buffering and guesses about how much data the GPU can processes efficiently. But we first need to do a fair amount of benchmarking to understand what really impacts the performance of the buffering and the performance of decode. We also would need to look how AQE will impact downstream processing. |
This is something that might be interesting to push back to Spark too. When we are trying to figure out the splits for reading, can we look at the read schema and the file schema to get an idea of how much data is going to be thrown out when we do a read so we can better size the splits. I can see a few options for this.
One option is where we look at a small amount of data (1 or 2 files at most) to try and determine the file schema. The other option is to launch an actual job that all it does is read the file schema to get better knowledge about exactly what is happening. Option 1 is nice because it can be fast and probably can give us a decent estimate on what things will look like. Option 2 fits more with things like delta lake that will read and cache metadata information before running a much larger query. We might even be able to switch between the two based off of the number of files, or the number of directories involved in a single query.
The text was updated successfully, but these errors were encountered: