This repository has been archived by the owner on Jan 12, 2024. It is now read-only.
Use external metadata to improve filter/cache performance #4
Labels
inframundo
intake
Intake data catalogs
metadata
Data about our liberated data
parquet
Apache Parquet is an open columnar data file format.
performance
Make data go faster by using less memory, disk, network, compute, etc.
Maybe this shouldn't be surprising, but when you query the whole collection of Parquet files with caching on, they all get downloaded, even if you're only reading data from a few of them, because as it is now you still need to access metadata inside the Parquet files to figure out which ones contain the data you're looking for.
This defeats some of the purpose of caching, since the first time you do a query/filter, you have to wait 10+ minutes for it all to download. Probably this wouldn't be an issue on cloud resources with 1-10Gb of network bandwidth, but it's a pain on our home network connections.
It looks like pyarrow supports _metadata sidecar files, which would hopefully speed up scanning the whole dataset considerably. But it also looks like it's tied to writing out a PyArrow dataset, rather than just a collection of files with the same schema in the same directory (which means all the columns are in all the files, and the schema applies simply to all of them without needing to munge around in the partitioning columns)
So far as I can tell, writing
pandas_metadata
into the parquet files (see #7) also requires usingdf.to_parquet()
rather than aParquetWriter
directly or other methods for writing the dataframes out to parquet files, which is frustrating.df.to_parquet()
using the same schema for all of them, and then generate the metadata sidecar file after the fact?df.to_parquet()
?Using
pd.read_parquet()
When using
pd.read_parquet()
reading data from a collection of remote parquet files using thegcs://
protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:%%time
theuser
time does double locally for the partitioned data, but the elapsed time doesn't. Is it working with multiple threads locally, but only a single thread remotely?Using
intake_parquet
Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with
pd.read_parquet()
and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.The text was updated successfully, but these errors were encountered: