Support an additional file format for tables #376

hagenw · 2023-04-28T14:52:57Z

One of the biggest limitations we are currently facing is due to our usage of the CSV format to store tables.

There are strong arguments in favour of CSV as it is widely supported and human readable. I would not propose to replace it, but provide the possibility that tables can be provided in a different format as well.

The additional format should support storing array data like embeddings in its columns and it should support reading only parts of it into memory to not have a limit on database sizes.

I don't see too much overhead in implementing it. We could add an argument to audformat.Table to indicate when it should not be stored as CSV, and in addition we can also make it dependent on the used scheme, e. g. if somebody specifies an array scheme (which we will need to add) to a column the table will then automatically not stored as CSV.

This way we could for example add an embeddings table to a database containing embeddings from different models in its columns.

The text was updated successfully, but these errors were encountered:

hagenw · 2023-04-28T15:20:16Z

Another approach to providing an argument to audformat.Table might be to introduce another table class, e.g. audformat.ArrayTable.

frankenjoe · 2023-04-28T16:35:20Z

This way we could for example add an embeddings table to a database containing embeddings from different models in its columns.

In my opinion storing embeddings / features in tables has a big downside: whenever we add files to a database, we will need to update the tables and upload them again. So I am still more in favor of a solution as described in #321 where we store embeddings / features on media level. I.e with a new database version we only have to upload the embeddings / features of newly added files.

hagenw · 2023-05-02T06:48:04Z

It might be that we will need even both (if we cannot solve this with a single solution). For databases that no longer have (or are not allowed from the beginning to have) WAV files it would indeed feel more native to store the features instead of the media files. But there still remain arguments to also support features inside a table:

You may want to store embeddings in databases that have media files, which means the argument that you only have to upload newly added embeddings is no longer valid
You may want to store array labels that are not embeddings/features
We want to support tables that do not fit into memeory
You may want to use the table as a replacement for a call to audinterface that extracted the features, which means they need to be stored as labels

frankenjoe · 2023-05-02T07:14:55Z

For databases that no longer have (or are not allowed from the beginning to have) WAV files it would indeed feel more native to store the features instead of the media files.

In #321 this was the motivation to introduce such a feature. But actually, I don't see why we should not also store embeddings in addition to WAV files. Probably that's the more likely use-case in the end.

You may want to store embeddings in databases that have media files, which means the argument that you only have to upload newly added embeddings is no longer valid

Mhh, can't follow you :)

You may want to store array labels that are not embeddings/features

This is supported by Pandas DataFrame.

We want to support tables that do not fit into memory

Yes. However, there might be chance that pandas will soon offer a read-on-demand feature, now that they added support for Arrow in 2.0.0.

You may want to use the table as a replacement for a call to audinterface that extracted the features, which means they need to be stored as labels

Not necessaryily, we can also add a function like Table.get_data() that does return the labels but collects the audio / embeddings / features from the requested files and returns them.

But in principle I agree, it would be nice to support other table formats of course. However, I see the main advantage of audformat + audb in the ability to properly version binary data without uploading all data with every version again. And by simply putting embeddings into tables, we would lose this feature.

hagenw · 2023-05-02T07:41:36Z

properly version binary data without uploading all data with every version again

Good point, this would indeed be lost as we would need to upload the whole table when just adding a single new entry to it.

This is supported by Pandas DataFrame.

Yes, but we don't want to store those in CSV files as loading those becomes extremely slow. We experience those already with databases that store transcriptions.

So my main goal of this issue would be to support storing tables in another format besides CSV, and afterwards maybe add an array Scheme.

hagenw · 2023-09-28T13:24:22Z

For me the biggest downside when using audb at the moment is that loading tables of large databases takes too long. As we constantly have new versions of databases, they are also very often not stored in the cache folder, or not in the cache folder on the machine I'm on.

In addition, CSV files cannot be scaled to very large databases.

So I think, supporting a binary format to store tables would be my highest priority for a new feature in audformat.
If we go with PKL it might be possible without much additional work. But maybe PARQUET would be a better choise as the tables could still be loaded by other programs?

hagenw · 2024-04-04T14:37:33Z

When storing tables as PARQUET we might get the problem that the MD5SUMs are not reproducible, see the discussion at audeering/audb#372 (comment).

My hope would be that we find a setting to store PARQUET files in a way to always get a reproducible MD5SUM.

hagenw · 2024-04-04T14:38:26Z

Another interesting question is if we find a format that allows downloading only a part of a table, e.g. for previewing it. I guess this would also require that we no longer store the tables inside a ZIP file?

hagenw · 2024-04-24T13:57:29Z

@ChristianGeng you mentioned that there are potential alternatives to PARQUET, e.g. having a SQL like database per table. What might be of interest here is, if any of those formats allows us to download only a part of the table, e.g. for preview.
Could you list here potential alternatives, please.

hagenw · 2024-04-24T13:58:59Z

@maxschmitt could you please elaborate on what kind of text data you might want to store, that might not be well suited for a misc table. We could then see if this might be better supported by any of the formats we are trying to use as an alternative to CSV.

maxschmitt · 2024-04-24T14:20:07Z

@maxschmitt could you please elaborate on what kind of text data you might want to store, that might not be well suited for a misc table. We could then see if this might be better supported by any of the formats we are trying to use as an alternative to CSV.

Datasets would look like this one, e.g.:
https://huggingface.co/datasets/wikipedia#data-instances
i.e., some metadata and text.

In theory, misc_table would be fine, but if a dataset has a size of many GB, there will be some limitations, especially, as only_metadata=True won't be possible then.

Another option would be to store the text files as media files (not supported atm) with the table containing only the metadata and files linked in the index. However, also there, we might run into issues if databases exceed 1TB or >10M files. Nevertheless, supporting text files media files would solve many problems for the beginning.

hagenw · 2024-04-25T09:36:17Z

Thanks for the feedback. So the biggest challenge seems to be size.

If we find a format for the storage of tables, that supports downloading only a selected part of it, we might still be able to store text as misc tables.

Otherwise, I agree with your proposal to store text files as single media files. This would also not be too complicated.
In audformat, we can already create a dataset with text files, e.g.

import audeer
import audformat


build_dir = audeer.path("./build")
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)

# Create text file as media file
data_dir = audeer.mkdir(build_dir, "data")
with open(audeer.path(data_dir, "file1.txt"), "w") as file:
    file.write("Text written by a person.\n")

db = audformat.Database("text-db")
db.schemes["speaker"] = audformat.Scheme("str")
index = audformat.filewise_index(["data/file1.txt"])
db["files"] = audformat.Table(index)
db["files"]["speaker"] = audformat.Column(scheme_id="speaker")
db["files"]["speaker"].set(["speaker-a"])

db.save(build_dir)

It only fails at the publication stage at the moment, as audb.publish() stores information about the media files like bit depth, sampling rate, ... in the dependency table when publishing the data, see https://github.com/audeering/audb/blob/069cc042341ef244df6068651b518c439680b7b8/audb/core/publish.py#L296-L320.

We can easily avoid this by having a list of file extensions that shouldn't be treated as audio/video files, and we don't gather metadata with audiofile on them. In a similar fashion we could extend audinterface to also read in text files not with audiofile.read(), but with another function.

Regarding the large number of files, e.g. >10M files, we have already started to work on better support for very large dependency tables in audb (audeering/audb#372), and in audformat we are also looking for alternatives to CSV files for storing large tables (this issue).

hagenw · 2024-06-14T14:33:19Z

When storing tables directly as PARQUET files on servers we might be able to stream parts of it, which would be a nice feature for table preview, or if the tables are really huge:

https://rolkotech.blogspot.com/2020/09/parquet-files-and-datasets-on-remote-with-python.html (uses fsspec, which is something similar as audbackend)
https://stackoverflow.com/a/68989122

hagenw · 2024-06-17T10:13:36Z

Streaming from a local PARQUET file is very simple. E.g. to preview the first 10 lines of an audb dependency table:

import pyarrow.parquet as parquet

file = parquet.ParquetFile(parquet_file)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())

In principle, this should also be possible with files stored on Artifactory, but we need to implement a fsspec class or an object that is compatible with the pyarrow.FileSystem class, e.g. https://arrow.apache.org/docs/python/generated/pyarrow.fs.PyFileSystem.html#pyarrow.fs.PyFileSystem, which might be slightly easier.

hagenw · 2024-06-21T13:01:07Z

Even better, we don't need any extra implementation, but can directly preview PARQUET files from Artifactory (example uses our internal server for now):

import aiohttp
import fsspec
import pyarrow.parquet as parquet

import audbackend


host = "https://artifactory.audeering.com/artifactory"
auth = audbackend.backend.Artifactory.get_authentication(host)
repository = "data-public-local"

# Prepare fsspec https file-system to communicate with Artifactory
fs = fsspec.filesystem("https", auth=aiohttp.BasicAuth(auth[0], auth[1]))

# Preview dependency table of casual-conversations-v2 dataset
dataset = "casual-conversations-v2"
version = "1.0.0"
url = f"{host}/{repository}/{dataset}/db/{version}/db-{version}.parquet"
file = parquet.ParquetFile(url, filesystem=fs)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas())

this returns

                                      file                               archive  bit_depth  channels  ... removed  sampling_rate type  version
0                      db.disabilities.csv                          disabilities          0         0  ...       0              0    0    1.0.0
1                             db.files.csv                                 files          0         0  ...       0              0    0    1.0.0
2               db.physical-adornments.csv                   physical-adornments          0         0  ...       0              0    0    1.0.0
3               db.physical-attributes.csv                   physical-attributes          0         0  ...       0              0    0    1.0.0
4                         db.recording.csv                             recording          0         0  ...       0              0    0    1.0.0
5                         db.skin-tone.csv                             skin-tone          0         0  ...       0              0    0    1.0.0
6                           db.speaker.csv                               speaker          0         0  ...       0              0    0    1.0.0
7  audio/0000_portuguese_nonscripted_1.wav  f76b3d4a-a172-63ee-22f2-fb2255d692ee         16         1  ...       0          48000    1    1.0.0
8  audio/0000_portuguese_nonscripted_2.wav  81db070f-69a1-ab92-a365-ca95ac36c893         16         1  ...       0          48000    1    1.0.0
9  audio/0000_portuguese_nonscripted_3.wav  d4572eb1-d458-7717-2145-a7861208b8da         16         1  ...       0          48000    1    1.0.0

[10 rows x 11 columns]

hagenw mentioned this issue May 9, 2023

Switch to pyarrow engine when reading CSV files #382

Closed

hagenw mentioned this issue Apr 4, 2024

Store Dependencies as parquet file audeering/audb#372

Merged

hagenw mentioned this issue Apr 29, 2024

Add support for TXT files as media files audeering/audb#392

Merged

hagenw added the enhancement New feature or request label Jun 13, 2024

hagenw mentioned this issue Jun 13, 2024

Store tables as PARQUET files #419

Merged

hagenw closed this as completed in #419 Jun 19, 2024

hagenw mentioned this issue Jun 21, 2024

Add possibility to preview tables audeering/audbcards#59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support an additional file format for tables #376

Support an additional file format for tables #376

hagenw commented Apr 28, 2023

hagenw commented Apr 28, 2023

frankenjoe commented Apr 28, 2023 •

edited

Loading

hagenw commented May 2, 2023

frankenjoe commented May 2, 2023

hagenw commented May 2, 2023 •

edited

Loading

hagenw commented Sep 28, 2023

hagenw commented Apr 4, 2024

hagenw commented Apr 4, 2024

hagenw commented Apr 24, 2024

hagenw commented Apr 24, 2024

maxschmitt commented Apr 24, 2024

hagenw commented Apr 25, 2024

hagenw commented Jun 14, 2024

hagenw commented Jun 17, 2024

hagenw commented Jun 21, 2024

Support an additional file format for tables #376

Support an additional file format for tables #376

Comments

hagenw commented Apr 28, 2023

hagenw commented Apr 28, 2023

frankenjoe commented Apr 28, 2023 • edited Loading

hagenw commented May 2, 2023

frankenjoe commented May 2, 2023

hagenw commented May 2, 2023 • edited Loading

hagenw commented Sep 28, 2023

hagenw commented Apr 4, 2024

hagenw commented Apr 4, 2024

hagenw commented Apr 24, 2024

hagenw commented Apr 24, 2024

maxschmitt commented Apr 24, 2024

hagenw commented Apr 25, 2024

hagenw commented Jun 14, 2024

hagenw commented Jun 17, 2024

hagenw commented Jun 21, 2024

frankenjoe commented Apr 28, 2023 •

edited

Loading

hagenw commented May 2, 2023 •

edited

Loading