Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support an additional file format for tables #376

Closed
hagenw opened this issue Apr 28, 2023 · 15 comments · Fixed by #419
Closed

Support an additional file format for tables #376

hagenw opened this issue Apr 28, 2023 · 15 comments · Fixed by #419
Labels
enhancement New feature or request

Comments

@hagenw
Copy link
Member

hagenw commented Apr 28, 2023

One of the biggest limitations we are currently facing is due to our usage of the CSV format to store tables.

There are strong arguments in favour of CSV as it is widely supported and human readable. I would not propose to replace it, but provide the possibility that tables can be provided in a different format as well.

The additional format should support storing array data like embeddings in its columns and it should support reading only parts of it into memory to not have a limit on database sizes.

I don't see too much overhead in implementing it. We could add an argument to audformat.Table to indicate when it should not be stored as CSV, and in addition we can also make it dependent on the used scheme, e. g. if somebody specifies an array scheme (which we will need to add) to a column the table will then automatically not stored as CSV.

This way we could for example add an embeddings table to a database containing embeddings from different models in its columns.

@hagenw
Copy link
Member Author

hagenw commented Apr 28, 2023

Another approach to providing an argument to audformat.Table might be to introduce another table class, e.g. audformat.ArrayTable.

@frankenjoe
Copy link
Collaborator

frankenjoe commented Apr 28, 2023

This way we could for example add an embeddings table to a database containing embeddings from different models in its columns.

In my opinion storing embeddings / features in tables has a big downside: whenever we add files to a database, we will need to update the tables and upload them again. So I am still more in favor of a solution as described in #321 where we store embeddings / features on media level. I.e with a new database version we only have to upload the embeddings / features of newly added files.

@hagenw
Copy link
Member Author

hagenw commented May 2, 2023

It might be that we will need even both (if we cannot solve this with a single solution). For databases that no longer have (or are not allowed from the beginning to have) WAV files it would indeed feel more native to store the features instead of the media files. But there still remain arguments to also support features inside a table:

  • You may want to store embeddings in databases that have media files, which means the argument that you only have to upload newly added embeddings is no longer valid
  • You may want to store array labels that are not embeddings/features
  • We want to support tables that do not fit into memeory
  • You may want to use the table as a replacement for a call to audinterface that extracted the features, which means they need to be stored as labels

@frankenjoe
Copy link
Collaborator

For databases that no longer have (or are not allowed from the beginning to have) WAV files it would indeed feel more native to store the features instead of the media files.

In #321 this was the motivation to introduce such a feature. But actually, I don't see why we should not also store embeddings in addition to WAV files. Probably that's the more likely use-case in the end.

You may want to store embeddings in databases that have media files, which means the argument that you only have to upload newly added embeddings is no longer valid

Mhh, can't follow you :)

You may want to store array labels that are not embeddings/features

This is supported by Pandas DataFrame.

We want to support tables that do not fit into memory

Yes. However, there might be chance that pandas will soon offer a read-on-demand feature, now that they added support for Arrow in 2.0.0.

You may want to use the table as a replacement for a call to audinterface that extracted the features, which means they need to be stored as labels

Not necessaryily, we can also add a function like Table.get_data() that does return the labels but collects the audio / embeddings / features from the requested files and returns them.


But in principle I agree, it would be nice to support other table formats of course. However, I see the main advantage of audformat + audb in the ability to properly version binary data without uploading all data with every version again. And by simply putting embeddings into tables, we would lose this feature.

@hagenw
Copy link
Member Author

hagenw commented May 2, 2023

properly version binary data without uploading all data with every version again

Good point, this would indeed be lost as we would need to upload the whole table when just adding a single new entry to it.

This is supported by Pandas DataFrame.

Yes, but we don't want to store those in CSV files as loading those becomes extremely slow. We experience those already with databases that store transcriptions.

So my main goal of this issue would be to support storing tables in another format besides CSV, and afterwards maybe add an array Scheme.

@hagenw
Copy link
Member Author

hagenw commented Sep 28, 2023

For me the biggest downside when using audb at the moment is that loading tables of large databases takes too long. As we constantly have new versions of databases, they are also very often not stored in the cache folder, or not in the cache folder on the machine I'm on.

In addition, CSV files cannot be scaled to very large databases.

So I think, supporting a binary format to store tables would be my highest priority for a new feature in audformat.
If we go with PKL it might be possible without much additional work. But maybe PARQUET would be a better choise as the tables could still be loaded by other programs?

@hagenw
Copy link
Member Author

hagenw commented Apr 4, 2024

When storing tables as PARQUET we might get the problem that the MD5SUMs are not reproducible, see the discussion at audeering/audb#372 (comment).

My hope would be that we find a setting to store PARQUET files in a way to always get a reproducible MD5SUM.

@hagenw
Copy link
Member Author

hagenw commented Apr 4, 2024

Another interesting question is if we find a format that allows downloading only a part of a table, e.g. for previewing it. I guess this would also require that we no longer store the tables inside a ZIP file?

@hagenw
Copy link
Member Author

hagenw commented Apr 24, 2024

@ChristianGeng you mentioned that there are potential alternatives to PARQUET, e.g. having a SQL like database per table. What might be of interest here is, if any of those formats allows us to download only a part of the table, e.g. for preview.
Could you list here potential alternatives, please.

@hagenw
Copy link
Member Author

hagenw commented Apr 24, 2024

@maxschmitt could you please elaborate on what kind of text data you might want to store, that might not be well suited for a misc table. We could then see if this might be better supported by any of the formats we are trying to use as an alternative to CSV.

@maxschmitt
Copy link

@maxschmitt could you please elaborate on what kind of text data you might want to store, that might not be well suited for a misc table. We could then see if this might be better supported by any of the formats we are trying to use as an alternative to CSV.

Datasets would look like this one, e.g.:
https://huggingface.co/datasets/wikipedia#data-instances
i.e., some metadata and text.

In theory, misc_table would be fine, but if a dataset has a size of many GB, there will be some limitations, especially, as only_metadata=True won't be possible then.

Another option would be to store the text files as media files (not supported atm) with the table containing only the metadata and files linked in the index. However, also there, we might run into issues if databases exceed 1TB or >10M files. Nevertheless, supporting text files media files would solve many problems for the beginning.

@hagenw
Copy link
Member Author

hagenw commented Apr 25, 2024

Thanks for the feedback. So the biggest challenge seems to be size.

If we find a format for the storage of tables, that supports downloading only a selected part of it, we might still be able to store text as misc tables.

Otherwise, I agree with your proposal to store text files as single media files. This would also not be too complicated.
In audformat, we can already create a dataset with text files, e.g.

import audeer
import audformat


build_dir = audeer.path("./build")
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)

# Create text file as media file
data_dir = audeer.mkdir(build_dir, "data")
with open(audeer.path(data_dir, "file1.txt"), "w") as file:
    file.write("Text written by a person.\n")

db = audformat.Database("text-db")
db.schemes["speaker"] = audformat.Scheme("str")
index = audformat.filewise_index(["data/file1.txt"])
db["files"] = audformat.Table(index)
db["files"]["speaker"] = audformat.Column(scheme_id="speaker")
db["files"]["speaker"].set(["speaker-a"])

db.save(build_dir)

It only fails at the publication stage at the moment, as audb.publish() stores information about the media files like bit depth, sampling rate, ... in the dependency table when publishing the data, see https://github.com/audeering/audb/blob/069cc042341ef244df6068651b518c439680b7b8/audb/core/publish.py#L296-L320.

We can easily avoid this by having a list of file extensions that shouldn't be treated as audio/video files, and we don't gather metadata with audiofile on them. In a similar fashion we could extend audinterface to also read in text files not with audiofile.read(), but with another function.


Regarding the large number of files, e.g. >10M files, we have already started to work on better support for very large dependency tables in audb (audeering/audb#372), and in audformat we are also looking for alternatives to CSV files for storing large tables (this issue).

@hagenw
Copy link
Member Author

hagenw commented Jun 14, 2024

When storing tables directly as PARQUET files on servers we might be able to stream parts of it, which would be a nice feature for table preview, or if the tables are really huge:

@hagenw
Copy link
Member Author

hagenw commented Jun 17, 2024

Streaming from a local PARQUET file is very simple. E.g. to preview the first 10 lines of an audb dependency table:

import pyarrow.parquet as parquet

file = parquet.ParquetFile(parquet_file)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())

In principle, this should also be possible with files stored on Artifactory, but we need to implement a fsspec class or an object that is compatible with the pyarrow.FileSystem class, e.g. https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html#pyarrow.fs.PyFileSystem, which might be slightly easier.

@hagenw
Copy link
Member Author

hagenw commented Jun 21, 2024

Even better, we don't need any extra implementation, but can directly preview PARQUET files from Artifactory (example uses our internal server for now):

import aiohttp
import fsspec
import pyarrow.parquet as parquet

import audbackend


host = "https://artifactory.audeering.com/artifactory"
auth = audbackend.backend.Artifactory.get_authentication(host)
repository = "data-public-local"

# Prepare fsspec https file-system to communicate with Artifactory
fs = fsspec.filesystem("https", auth=aiohttp.BasicAuth(auth[0], auth[1]))

# Preview dependency table of casual-conversations-v2 dataset
dataset = "casual-conversations-v2"
version = "1.0.0"
url = f"{host}/{repository}/{dataset}/db/{version}/db-{version}.parquet"
file = parquet.ParquetFile(url, filesystem=fs)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())

this returns

                                      file                               archive  bit_depth  channels  ... removed  sampling_rate type  version
0                      db.disabilities.csv                          disabilities          0         0  ...       0              0    0    1.0.0
1                             db.files.csv                                 files          0         0  ...       0              0    0    1.0.0
2               db.physical-adornments.csv                   physical-adornments          0         0  ...       0              0    0    1.0.0
3               db.physical-attributes.csv                   physical-attributes          0         0  ...       0              0    0    1.0.0
4                         db.recording.csv                             recording          0         0  ...       0              0    0    1.0.0
5                         db.skin-tone.csv                             skin-tone          0         0  ...       0              0    0    1.0.0
6                           db.speaker.csv                               speaker          0         0  ...       0              0    0    1.0.0
7  audio/0000_portuguese_nonscripted_1.wav  f76b3d4a-a172-63ee-22f2-fb2255d692ee         16         1  ...       0          48000    1    1.0.0
8  audio/0000_portuguese_nonscripted_2.wav  81db070f-69a1-ab92-a365-ca95ac36c893         16         1  ...       0          48000    1    1.0.0
9  audio/0000_portuguese_nonscripted_3.wav  d4572eb1-d458-7717-2145-a7861208b8da         16         1  ...       0          48000    1    1.0.0

[10 rows x 11 columns]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants