Use pyarrow.Table for handling of dependencies #356

hagenw · 2024-02-01T16:01:12Z

This pull request main goal is to speed up loading, saving, and parsing of the dependency table.

To achieve this we switch to use pyarrow.Table to represent the dependencies.

Benchmark loading and saving dependency files

Reading a dependency file with 1,000,000 entries from CSV, pickle, or parquet

Destination	CSV	pickle	parquet
pandas.DataFrame	1.15 s	0.19 s	0.37 s
pyarrow.Table -> pandas.DataFrame	0.41 s		0.39 s
pyarrow.Table	0.05 s		0.05 s
pandas.DataFrame -> pyarrow.Table		0.47 s

Writing a dependency file with 1,000,000 entries to CSV, pickle, or parquet

Origin	CSV	pickle	parquet
pandas.DataFrame	1.96 s	0.70 s	0.47 s
pandas.DataFrame -> pyarrow.Table	0.47 s	0.77 s
pyarrow.Table	0.25 s		0.24 s

Conclusions

pyarrow.Table should be used when reading/writing CSV files
the fastest solution would be to represent dependencies as pyarrow.Table instead of pandas.DataFrame

Benchmarking single methods

Method	pyarrow.Table	pandas.DataFrame
`Dependency.__call__()`	0.315 s	0.000 s
`Dependency.__contains__()`	0.001 s	0.000 s
`Dependency.__get_item__()`	0.001 s	0.000 s
`Dependency.__len__()`	0.000 s	0.000 s
`Dependency.__str__()`	0.006 s	0.006 s
`Dependency.archives`	0.124 s	0.413 s
`Dependency.attachments`	0.019 s	0.021 s
`Dependency.attachment_ids`	0.022 s	0.022 s
`Dependency.files`	0.039 s	0.029 s
`Dependency.media`	0.090 s	0.094 s
`Dependency.removed_media`	0.097 s	0.092 s
`Dependency.table_ids`	0.022 s	0.030 s
`Dependency.tables`	0.018 s	0.021 s
`Dependency.archive(1000 files)`	0.884 s	0.005 s
`Dependency.bit_depth(1000 files)`	1.044 s	0.004 s
`Dependency.channels(1000 files)`	1.018 s	0.004 s
`Dependency.checksum(1000 files)`	0.963 s	0.004 s
`Dependency.duration(1000 files)`	1.299 s	0.004 s
`Dependency.format(1000 files)`	1.037 s	0.004 s
`Dependency.removed(1000 files)`	1.507 s	0.004 s
`Dependency.sampling_rate(1000 files)`	1.116 s	0.004 s
`Dependency.type(1000 files)`	1.271 s	0.004 s
`Dependency.version(1000 files)`	0.886 s	0.004 s
`Dependency._add_attachment()`	0.090 s	0.073 s
`Dependency._add_media(1000 files)`	0.044 s	0.068 s
`Dependency._add_meta()`	0.112 s	0.118 s
`Dependency._drop()`	0.026 s	0.209 s
`Dependency._remove()`	0.057 s	0.062 s
`Dependency._update_media()`	0.103 s	0.064 s
`Dependency._update_media_version(1000 files)`	1.043 s	0.008 s

Conclusion

Using pyarrow.Table (or a polars.DataFrame) is faster for certain column based operations, but it is way too slow when addressing single rows. So we should not use it, but stay with pandas.DataFrame to represent the dependency table.

hagenw · 2024-05-03T11:23:13Z

We decided against storing the dependency table internally as pyarrow.Table and opted instead for pd.DataFrame and use pyarrow.Table only as an intermediate state when reading/writing a file, see #372.

hagenw added 29 commits January 30, 2024 16:27

Start working on pyarrow deps

ff95f1f

Further work

8cf87b2

Fix __contains__

224dcf1

Continue work

e1e0456

Continue work

66deffe

Further work on dependency code

b4734b0

Update tests

59aa5c2

Addjust _add_media() to take less arguments

f6d1f59

Update load() and publish()

514ea14

Start debugging

1aa3176

Extend tests

715101c

Fix updating of table for _drop()

fe5cc91

Further tests and fixes

24924ac

Add debugging code

77a73fb

Fix load tests

02713f2

Fix further bugs

9e8a76b

Remove verbose print statements

f933fdf

Further fixes

0498ba6

Add benchmark script and results

048ffe8

Remove unneeded print statement

320ce96

Add more results

57d5496

Update reading from PKL

c42ceef

Add more results

e36e241

Speed up

93ada39

Add polar to read/save benchmark

7b7d300

Add benchmark of single methods

ce0c677

Extend benchmark with pandas pyarrow read

71f0470

Update benchmark

6bbbe8c

Fix benchmark for main

dfe33f0

hagenw mentioned this pull request Feb 6, 2024

Speed up audb.Dependencies._drop() #358

Merged

hagenw mentioned this pull request Feb 6, 2024

Speed up audb.Dependencies.archives #359

Merged

hagenw mentioned this pull request Feb 15, 2024

Store Dependencies as parquet file #372

Merged

hagenw closed this May 3, 2024

hagenw mentioned this pull request May 30, 2024

Polars benchmarks methods #424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pyarrow.Table for handling of dependencies #356

Use pyarrow.Table for handling of dependencies #356

hagenw commented Feb 1, 2024 •

edited

Loading

hagenw commented May 3, 2024

Use pyarrow.Table for handling of dependencies #356

Use pyarrow.Table for handling of dependencies #356

Conversation

hagenw commented Feb 1, 2024 • edited Loading

Benchmark loading and saving dependency files

Benchmarking single methods

hagenw commented May 3, 2024

hagenw commented Feb 1, 2024 •

edited

Loading