-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support an additional file format for tables #376
Comments
Another approach to providing an argument to |
In my opinion storing embeddings / features in tables has a big downside: whenever we add files to a database, we will need to update the tables and upload them again. So I am still more in favor of a solution as described in #321 where we store embeddings / features on media level. I.e with a new database version we only have to upload the embeddings / features of newly added files. |
It might be that we will need even both (if we cannot solve this with a single solution). For databases that no longer have (or are not allowed from the beginning to have) WAV files it would indeed feel more native to store the features instead of the media files. But there still remain arguments to also support features inside a table:
|
In #321 this was the motivation to introduce such a feature. But actually, I don't see why we should not also store embeddings in addition to WAV files. Probably that's the more likely use-case in the end.
Mhh, can't follow you :)
This is supported by Pandas
Yes. However, there might be chance that
Not necessaryily, we can also add a function like But in principle I agree, it would be nice to support other table formats of course. However, I see the main advantage of |
Good point, this would indeed be lost as we would need to upload the whole table when just adding a single new entry to it.
Yes, but we don't want to store those in CSV files as loading those becomes extremely slow. We experience those already with databases that store transcriptions. So my main goal of this issue would be to support storing tables in another format besides CSV, and afterwards maybe add an array Scheme. |
For me the biggest downside when using In addition, CSV files cannot be scaled to very large databases. So I think, supporting a binary format to store tables would be my highest priority for a new feature in |
When storing tables as PARQUET we might get the problem that the MD5SUMs are not reproducible, see the discussion at audeering/audb#372 (comment). My hope would be that we find a setting to store PARQUET files in a way to always get a reproducible MD5SUM. |
Another interesting question is if we find a format that allows downloading only a part of a table, e.g. for previewing it. I guess this would also require that we no longer store the tables inside a ZIP file? |
@ChristianGeng you mentioned that there are potential alternatives to PARQUET, e.g. having a SQL like database per table. What might be of interest here is, if any of those formats allows us to download only a part of the table, e.g. for preview. |
@maxschmitt could you please elaborate on what kind of text data you might want to store, that might not be well suited for a misc table. We could then see if this might be better supported by any of the formats we are trying to use as an alternative to CSV. |
Datasets would look like this one, e.g.: In theory, Another option would be to store the text files as media files (not supported atm) with the table containing only the metadata and files linked in the index. However, also there, we might run into issues if databases exceed 1TB or >10M files. Nevertheless, supporting text files media files would solve many problems for the beginning. |
Thanks for the feedback. So the biggest challenge seems to be size. If we find a format for the storage of tables, that supports downloading only a selected part of it, we might still be able to store text as misc tables. Otherwise, I agree with your proposal to store text files as single media files. This would also not be too complicated. import audeer
import audformat
build_dir = audeer.path("./build")
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)
# Create text file as media file
data_dir = audeer.mkdir(build_dir, "data")
with open(audeer.path(data_dir, "file1.txt"), "w") as file:
file.write("Text written by a person.\n")
db = audformat.Database("text-db")
db.schemes["speaker"] = audformat.Scheme("str")
index = audformat.filewise_index(["data/file1.txt"])
db["files"] = audformat.Table(index)
db["files"]["speaker"] = audformat.Column(scheme_id="speaker")
db["files"]["speaker"].set(["speaker-a"])
db.save(build_dir) It only fails at the publication stage at the moment, as We can easily avoid this by having a list of file extensions that shouldn't be treated as audio/video files, and we don't gather metadata with Regarding the large number of files, e.g. >10M files, we have already started to work on better support for very large dependency tables in |
When storing tables directly as PARQUET files on servers we might be able to stream parts of it, which would be a nice feature for table preview, or if the tables are really huge:
|
Streaming from a local PARQUET file is very simple. E.g. to preview the first 10 lines of an import pyarrow.parquet as parquet
file = parquet.ParquetFile(parquet_file)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas()) In principle, this should also be possible with files stored on Artifactory, but we need to implement a |
Even better, we don't need any extra implementation, but can directly preview PARQUET files from Artifactory (example uses our internal server for now): import aiohttp
import fsspec
import pyarrow.parquet as parquet
import audbackend
host = "https://artifactory.audeering.com/artifactory"
auth = audbackend.backend.Artifactory.get_authentication(host)
repository = "data-public-local"
# Prepare fsspec https file-system to communicate with Artifactory
fs = fsspec.filesystem("https", auth=aiohttp.BasicAuth(auth[0], auth[1]))
# Preview dependency table of casual-conversations-v2 dataset
dataset = "casual-conversations-v2"
version = "1.0.0"
url = f"{host}/{repository}/{dataset}/db/{version}/db-{version}.parquet"
file = parquet.ParquetFile(url, filesystem=fs)
first_ten_rows = next(file.iter_batches(batch_size=10))
print(first_ten_rows.to_pandas()) this returns
|
One of the biggest limitations we are currently facing is due to our usage of the CSV format to store tables.
There are strong arguments in favour of CSV as it is widely supported and human readable. I would not propose to replace it, but provide the possibility that tables can be provided in a different format as well.
The additional format should support storing array data like embeddings in its columns and it should support reading only parts of it into memory to not have a limit on database sizes.
I don't see too much overhead in implementing it. We could add an argument to
audformat.Table
to indicate when it should not be stored as CSV, and in addition we can also make it dependent on the used scheme, e. g. if somebody specifies an array scheme (which we will need to add) to a column the table will then automatically not stored as CSV.This way we could for example add an embeddings table to a database containing embeddings from different models in its columns.
The text was updated successfully, but these errors were encountered: