Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyarrow.Table for handling of dependencies #356

Closed
wants to merge 29 commits into from
Closed

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Feb 1, 2024

This pull request main goal is to speed up loading, saving, and parsing of the dependency table.

To achieve this we switch to use pyarrow.Table to represent the dependencies.

Benchmark loading and saving dependency files

Reading a dependency file with 1,000,000 entries from CSV, pickle, or parquet

Destination CSV pickle parquet
pandas.DataFrame 1.15 s 0.19 s 0.37 s
pyarrow.Table -> pandas.DataFrame 0.41 s 0.39 s
pyarrow.Table 0.05 s 0.05 s
pandas.DataFrame -> pyarrow.Table 0.47 s

Writing a dependency file with 1,000,000 entries to CSV, pickle, or parquet

Origin CSV pickle parquet
pandas.DataFrame 1.96 s 0.70 s 0.47 s
pandas.DataFrame -> pyarrow.Table 0.47 s 0.77 s
pyarrow.Table 0.25 s 0.24 s

Conclusions

  • pyarrow.Table should be used when reading/writing CSV files
  • the fastest solution would be to represent dependencies as pyarrow.Table instead of pandas.DataFrame

Benchmarking single methods

Method pyarrow.Table pandas.DataFrame
Dependency.__call__() 0.315 s 0.000 s
Dependency.__contains__() 0.001 s 0.000 s
Dependency.__get_item__() 0.001 s 0.000 s
Dependency.__len__() 0.000 s 0.000 s
Dependency.__str__() 0.006 s 0.006 s
Dependency.archives 0.124 s 0.413 s
Dependency.attachments 0.019 s 0.021 s
Dependency.attachment_ids 0.022 s 0.022 s
Dependency.files 0.039 s 0.029 s
Dependency.media 0.090 s 0.094 s
Dependency.removed_media 0.097 s 0.092 s
Dependency.table_ids 0.022 s 0.030 s
Dependency.tables 0.018 s 0.021 s
Dependency.archive(1000 files) 0.884 s 0.005 s
Dependency.bit_depth(1000 files) 1.044 s 0.004 s
Dependency.channels(1000 files) 1.018 s 0.004 s
Dependency.checksum(1000 files) 0.963 s 0.004 s
Dependency.duration(1000 files) 1.299 s 0.004 s
Dependency.format(1000 files) 1.037 s 0.004 s
Dependency.removed(1000 files) 1.507 s 0.004 s
Dependency.sampling_rate(1000 files) 1.116 s 0.004 s
Dependency.type(1000 files) 1.271 s 0.004 s
Dependency.version(1000 files) 0.886 s 0.004 s
Dependency._add_attachment() 0.090 s 0.073 s
Dependency._add_media(1000 files) 0.044 s 0.068 s
Dependency._add_meta() 0.112 s 0.118 s
Dependency._drop() 0.026 s 0.209 s
Dependency._remove() 0.057 s 0.062 s
Dependency._update_media() 0.103 s 0.064 s
Dependency._update_media_version(1000 files) 1.043 s 0.008 s

Conclusion

Using pyarrow.Table (or a polars.DataFrame) is faster for certain column based operations, but it is way too slow when addressing single rows. So we should not use it, but stay with pandas.DataFrame to represent the dependency table.

@hagenw
Copy link
Member Author

hagenw commented May 3, 2024

We decided against storing the dependency table internally as pyarrow.Table and opted instead for pd.DataFrame and use pyarrow.Table only as an intermediate state when reading/writing a file, see #372.

@hagenw hagenw closed this May 3, 2024
@hagenw hagenw mentioned this pull request May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant