Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface for metadata (including per-column) #176

Closed
nalimilan opened this issue Jun 3, 2020 · 2 comments
Closed

Interface for metadata (including per-column) #176

nalimilan opened this issue Jun 3, 2020 · 2 comments

Comments

@nalimilan
Copy link
Member

nalimilan commented Jun 3, 2020

Assuming we agree at JuliaData/DataAPI.jl#22 on a common framework for general metadata that can be attached to any object, this issue is about defining an interface within that framework so that Table objects can provide and retrieve metadata. This needs some design in particular for per-column metadata, as implementations need to be able to know which value corresponds to which column, for example to be able to do hcat(df1, df2) preserving column metadata.

Broadly speaking two approaches can be considered:

  1. Store all metadata at the table level. By default metadata fields refer to the whole table (e.g. year of data collection). Per-column metadata fields can be identified with a special prefix like #Tables#, and they are required to be AbstractDict{Symbol}-like objects with keys referring to column names. Convenience functions can be provided (in Tables.jl or by implementations) to avoid the need for users to see this prefix.
  2. Store metadata referring to the whole table on the table, and metadata referring to specific columns to the column object. This approach has the advantage that metadata moves automatically with the column, but it doesn't work for row-oriented tables.

See also discussion for the DataFrames implementation at JuliaData/DataFrames.jl#2276.

Cc: @bkamins @pdeffebach @quinnj

@bkamins
Copy link
Member

bkamins commented Jun 3, 2020

So my preference is for option 1, exactly because we are not guaranteed that a table even has an object representing columns.

Now for the metadata referring to columns stored on table level I think that there just be some convention agreed for their naming scheme, but still we would not e.g. throw errors when these things go out of sync. It is up to the user to manage this if the user wants it handled.

The only way in which these "column" related metadata would be special is how they are handled when mixing tables e.g. via hcat, vcat or joins (as this is the only place where this is relevant I think). There are probably several merging rules but the distinction is that:

  • normal metadata when conflicting would be discarded, or last one taken (or whatever rule we decide on, by default merge in Base takes the last one)
  • column level "special" metadata would on the other hand be merged recursively, so e.g. if we get #Tables#label (#Tables# prefix is tentative) key in both tables then we merge the dict-like values that they point to (instead of dropping them or taking the last one - as we would do with normal metadata).

The proposal of this nested nature is because user might want to attach many types of metadata to columns.

@bkamins
Copy link
Member

bkamins commented Aug 3, 2023

Closing, as this is done.

@bkamins bkamins closed this as completed Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants