Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata support to DataFrames #1413

Closed
wants to merge 4 commits into from
Closed

Add metadata support to DataFrames #1413

wants to merge 4 commits into from

Conversation

gcalderone
Copy link

This PR adds metadata support to DataFrames. The idea for this PR comes from this discussion

Metadata are internally stored as a Dict{Any,Any}, one for each column and one for the whole table.

Metadata access is performed through the following methods:

  • metaget(df::DataFrame, key; default=nothing): returns the metadata entry with key key from the table dictionary. If the key is not present the value of the default keyword will be returned;

  • metaget(df::DataFrame, column::Symbol, key; default=nothing): returns the metadata entry with key key from the column column dictionary. If the key is not present the value of the default keyword will be returned;

  • metaset!(df::DataFrame, key, value): set an entry in the table metadata dictionary with key key and value value;

  • metaset!(df::DataFrame, column::Symbol, key, value): set an entry in the column column metadata dictionary with key key and value value;

  • metadict(df::DataFrame): return the table dictionary;

  • metadict(df::DataFrame, column::Symbol): return the column column dictionary.

Example:

using DataFrames
df = DataFrame(:col1=>1, :col2=>[1,2])
showcols(df)

# Request a non-present key in the table metadata dictionary
println("Table source: ",  metaget(df, :source, default="Unknown"))

# Set an entry in the table metadata dictionary and read it back
metaset!(df, :source, "www.some.site")
println("Table source: ",  metaget(df, :source))

# Set an entry using a `String` as key
metaset!(df, "query", "The query used to retrieve the data...")

# Display the table metadata dictionary
display(metadict(df))

# Request non-present keys in the column metadata dictionary
println("Column descr.: ",  metaget(df, :col1, :descr, default="Unspecified"))
println("Column unit  : ",  metaget(df, :col1, :unit,  default="Unspecified"))

# Set entries in the column dictionary and read them back
metaset!(df, :col1, :descr, "First column")
metaset!(df, :col1, :unit , "km / s")
println("Column descr.: ",  metaget(df, :col1, :descr))
println("Column unit  : ",  metaget(df, :col1, :unit ))

# `showcols` now search for the `:descr` and `:unit` entries in column
# dictionaries.  If these are available, and the values can be
# converted to a `String`, they are also printed
showcols(df)

# Display the column metadata dictionary
display(metadict(df, :col1))

# Explore the column metadata dictionary
for (key, val) in metadict(df, :col1)
    println("$key = $val")
end

This PR adds no package dependencies and is backward compatible. All new methods have their own docstring, and a new module has been added to test the new facility (in test/meta.jl)

@kescobo
Copy link
Contributor

kescobo commented May 26, 2018

This is awesome. 🎉

I don't love the names metaset and metaget. ¯_(ツ)_/¯

@gcalderone
Copy link
Author

Well, I hope the method names will be the only thing to change ... ;-)
Do getmeta, setmeta! and dictmeta sound better?

@nalimilan
Copy link
Member

Thanks for taking the initiative!

Here are a few general remarks:

  • I agree metaget and metaset don't look very Julian. I'd suggest meta and setmeta!/meta! (Document preferred naming convention for getters/setters in style guide JuliaLang/julia#16770).
  • I'm not sure we should expose the implementation to users: meta and setmeta! should be enough, no need for metadict. We can always add it later if it's really useful, but better start with a minimal API.
  • I'd also rather restrict the type of the keys to Symbol (or String?) for now, as I don't think we have a strong use case for other types. In particular this will limit inconsistencies, with some packages using symbols and other strings.
  • Regarding the implementation, I think it would be more efficient to have only two dicts, one for the global meta-data and one for column-specific meta-data. The second one would store vectors of values in the same order as columns. Then you don't need to modify the Index type (which is just a mechanism to lookup the index of columns from their names). Columns for which a key isn't available can have a nothing entry, and valid value can be wrapped inside Some to allow the user to explicitly store nothing (as Some(nothing)). That way we don't have to create a Dict object (which is relatively expensive) for each column, and it will be faster to retrieve all column-specific values for a given key.
  • You'll have to adapt all setindex! and getindex methods to handle the column meta-data. Tests don't cover this currently.
  • It would be useful to have a look at whether/how other software does this. AFAICT neither dplyr, data.table nor Pandas (Allow custom metadata to be attached to panel/df/series? pandas-dev/pandas#2485) support meta-data, but maybe there are other apps than Stata (which is quite restrictive)? In particular it would make sense to identify the most common kinds of meta-data, and document standard key names for them.

@gcalderone
Copy link
Author

I'd suggest meta and setmeta!/meta!

Done, new method names are: meta, setmeta! and metakeys (to retrieve metadata keys);

no need for metadict

metadict is still present but no longer exported;

I'd also rather restrict the type of the keys to Symbol (or String?)

I agree we have no strong use cases to use Any for keys, still I think that both Symbol and String may be useful.

For instance, we may suggest users to adopt Symbol keys for quantities supposed to be read/interpreted by other programs, and String keys for quantities to be displayed (e.g. plot labels).

I think it would be more efficient to have only two dicts, one
for the global meta-data and one for column-specific
meta-data. The second one would store vectors of values in the
same order as columns.

I'm not sure I understood. Are you proposing:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Vector{Any}} # column specific

?

This approach is not flexible enough since you would not have column specific keys (i.e. :col1, :unit). Rather I would prefer:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Dict{Symbol,Any}} # column specific

but also this approach is not convenient since colmeta should be updated each time the colindex is updated. Hence, I think it is better to modify the Index type (as I did).

Or maybe I'm missing something?

You'll have to adapt all setindex! and getindex methods

Sorry, I don't understand why. Data and metadata live in two separate objects, and setindex! and getindex only operate on data.

Moreover, when a column is added/deleted the corresponding Dict is added/deleted accordingly.

Could you please elaborate on this ?

@gcalderone
Copy link
Author

Update of the first comment:
This PR adds metadata support to DataFrames. The idea for this PR comes from this discussion

Metadata are internally stored as a Dict{Union{Symbol,String},Any}, one for each column and one for the whole table.

Metadata access is performed through the following methods:

  • meta(df::DataFrame, key::Union{Symbol,String}; default=nothing): returns the metadata entry with key key from the table dictionary. If the key is not present the value of the default keyword will be returned;

  • meta(df::DataFrame, column::Symbol, key::Union{Symbol,String}; default=nothing): returns the metadata entry with key key from the column column dictionary. If the key is not present the value of the default keyword will be returned;

  • metaset!(df::DataFrame, key::Union{Symbol,String}, value): set an entry in the table metadata dictionary with key key and value value;

  • metaset!(df::DataFrame, column::Symbol, key::Union{Symbol,String}, value): set an entry in the column column metadata dictionary with key key and value value;

  • metakeys(df::DataFrame): return the keys in the table dictionary;

  • metakeys(df::DataFrame, column::Symbol): return the keys in the column column dictionary.

Example:

using DataFrames
df = DataFrame(:col1=>1, :col2=>[1,2])
showcols(df)

# Request a non-present key
println("Table source: ",  meta(df, :source, default="Unknown"))

# Set an entry in the dictionary and read it back
metaset!(df, :source, "www.some.site")
println("Table source: ",  meta(df, :source))

# Set an entry using a string as key
metaset!(df, "query", "The query used to retrieve the data...")

# Request non-present keys in the column dictionaries
println("Column descr.: ",  meta(df, :col1, :descr, default="Unspecified"))
println("Column unit  : ",  meta(df, :col1, :unit,  default="Unspecified"))

# Set entries in the column dictionaries and read them back
metaset!(df, :col1, :descr, "First column")
metaset!(df, :col1, :unit , "km / s")
println("Column descr.: ",  meta(df, :col1, :descr))
println("Column unit  : ",  meta(df, :col1, :unit ))

# `showcols` now search for the `:descr` and `:unit` entries in column
# dictionaries.  If these are available, and the values can be
# converted to a `String`, they are also printed
showcols(df)

# Explore the column metadata dictionary
for key in metakeys(df, :col1)
    println("$key = ", meta(df, :col1, key))
end

This PR adds no package dependencies and is backward compatible. All new methods have their own docstring, and a new module has been added to test the new facility (in test/meta.jl)

@nalimilan
Copy link
Member

Done, new method names are: meta, setmeta! and metakeys (to retrieve metadata keys);

Ah, indeed now that you mention metakeys I realize we need a way to get the names of available keys.

I wonder whether it wouldn't be better to provide a single meta function which would return an object of the DataFrameMetadata <: AbstractDict type. Then you'd do m = meta(df), keys(m), m[:key, :col], and m[:key, :col] = .... And just m[:key] to set global meta-data. Not sure which approach is the best one.

I agree we have no strong use cases to use Any for keys, still I think that both Symbol and String may be useful.

For instance, we may suggest users to adopt Symbol keys for quantities supposed to be read/interpreted by other programs, and String keys for quantities to be displayed (e.g. plot labels).

It would be confusing to allow for both strings and symbols. Anyway they are displayed the same, so I'm not sure why you say strings are better in that case?

I'm not sure I understood. Are you proposing:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Vector{Any}} # column specific

?

This approach is not flexible enough since you would not have column specific keys (i.e. :col1, :unit). Rather I would prefer:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Dict{Symbol,Any}} # column specific

but also this approach is not convenient since colmeta should be updated each time the colindex is updated. Hence, I think it is better to modify the Index type (as I did).

Or maybe I'm missing something?

I mean the former (but with colmeta::Dict{Symbol, Vector}). It allows for column-specific keys, it just requires storing a nothing entry for columns where the property isn't set.

Both approaches are equivalent from the user's POV, it's just a matter of efficiency in typical use cases. Updating the meta-data when the index is modified isn't an issue, it just requires a few additional function calls when adding or removing columns.

Sorry, I don't understand why. Data and metadata live in two separate objects, and setindex! and getindex only operate on data.

Moreover, when a column is added/deleted the corresponding Dict is added/deleted accordingly.

Could you please elaborate on this ?

I mean that things like df[1:3], df[:, 1:3], df[1:10, :] and df[1:0, 1:3] should return a DataFrame with the meta-data from columns 1 to 3. There are a few getindex variants which need to handle this. Also, it's probably worth thinking about whether we should keep column meta-data or drop it when replacing columns, e.g. via df[1] = v.

@kescobo
Copy link
Contributor

kescobo commented May 27, 2018

Piggybacking on the point about indexing - how are the view() family of functions implemented? Will it just work to use these methods on subdataframes?

@nalimilan
Copy link
Member

Piggybacking on the point about indexing - how are the view() family of functions implemented? Will it just work to use these methods on subdataframes?

Good point. view just creates a SubDataFrame, so we will need to delegate methods to the parent DataFrame.

@pdeffebach
Copy link
Contributor

It would be useful to have a look at whether/how other software does this. AFAICT neither dplyr, data.table nor Pandas (pandas-dev/pandas#2485) support meta-data,

R supports metadata via attributes. It's just an array of Strings. Granted, in R you can add an attribute to any object, but it is really only useful for dataframes. I do use attributes with R, however. It's easy to write a simple plotting function that calls attributes[df$x][1].

@dmbates
Copy link
Contributor

dmbates commented May 27, 2018

@pdeffebach I would take issue with your statement that R attributes are only useful for data frames. Any complexity of R (and before R, S) objects is determined by attributes. Dimensions of matrices and higher-order arrays, for example, are stored as attributes. The only primitive R data objects, in the sense of SEXPREC's, are fixed-length vectors of 32-bit integers, fixed-length vectors of 64-bit floats, fixed-length vectors of interned character strings, and fixed-length vectors of pointers to SEXPRECs. Everything else is coded in the attributes.

I don't think that R attributes are a good model for this facility.

@dmbates
Copy link
Contributor

dmbates commented May 27, 2018

I think that @nalimilan's suggestion of having an extractor that returns a Dict is the best way to go. Could I make a plea for it to be named metadata instead of meta? meta could be about metaprogramming, etc. Especially if the use is to be a pattern like m = metadata(df); m[:key] etc. I think the clarity of the name outweighs the cost of typing 4 more characters.

@nalimilan
Copy link
Member

@gcalderone Maybe wait until the resolution of the discussion in the Discourse thread to avoid wasting your time if we choose the alternative approach (storing meta-data in vectors).

Reference to previous discussion: #35.

@gcalderone
Copy link
Author

Implemented metadata copying while copying/slicing/creating a view of the DataFrame. I also added a showmeta method to pretty print metadata contents.

Examples:

# Create a main DataFrame
df = DataFrame(:col1=>1, :col2=>1:10, :col3=>"dummy")
metaset!(df, :key1, "val1")
metaset!(df, :col1, :key1, "val1")
showmeta(df) # pretty print metadata

# Copy
c = copy(df)  # both data and metadata are copied
c[:col1] *= 2
metaset!(c, :key1, "UPDATED")
metaset!(c, :col1, :key1, "UPDATED")
showmeta(c)  # updated
showmeta(df) # unchanged

# Slice
sub = df[2:5, [:col1]]
metaset!(sub, :key1, "UPDATED")
metaset!(sub, :col1, :key1, "UPDATED")
showmeta(sub)  # updated
showmeta(df) # unchanged

# View
vv = view(df, 2:5, [:col1])
metaset!(df, :key1, "UPDATED")
metaset!(df, :col1, :key1, "UPDATED")
showmeta(vv)  # updated
showmeta(df)  # updated

# Insert a DataFrame
add = DataFrame(:col4=>rand(size(df)[1]))
metaset!(add, :key1, "ADDITIONAL")
metaset!(add, :col4, :key1, "ADDITIONAL")
df[[:col1]] = add
showmeta(df)

# Merge two DataFrame objects
merge!(df, add)
showmeta(df)

#Empty metadata dictionaries
emptymeta!(df)
showmeta(df)

@nalimilan
Copy link
Member

Sorry for the delay. After thinking a bit more about this, I think it would be cleaner to defined a DataFrameMetadata type which would be handled a little like Index in most functions: for example getindex(::DataFrame, ::Vector{Symbol}) would call getindex on it and pass the resulting DataFrameMeta object to the DataFrame constructor. Global meta-data would always be preserved when indexing that object.

Then, as noted above, you don't need to add all these new functions: people can get and set meta-data using getindex, setindex! and keys, and it can be printed via the standard show function.

The definition of the type and of its methods should go to a separate src/dataframemeta/dataframemeta.jl file. Please also drop MetaKey in favor of Symbol. Better keep things simple.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 8, 2018

Okay as far as I can tell this means

  1. Adding a new DataFrame constructor method in the type definition to allow a constructor with a metadata type, but (thanks to multiple dispatch) leave the other constructor alone
  2. Update copy and deepcopy to use this new constructor
  3. Update getindex

Then we can do this with just function metadata(df) = getfield(df, :metadata) and

Base.copy(df::DataFrame) = DataFrame(copy(columns(df)), copy(index(df)), copy(metadata(df)))

Without the introduction of copymeta! etc.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 8, 2018

I also see what you mean with regards to MetaData behaving like an Index. We need to define names! etc to act on a MetaData type and have that be called in abstractdataframe.jl. However having a call so names!(::MetaData) means that any AbstractDataFrame, should someone define their own type <: AbstractDataFrame, would have to overwrite that method.

The way that this PR would get around this is by having colindex know a decent amount about the metadata of a dataframe. It seems like a more streamlined approach would be to keep them entirely separate, but edit metadata when you edit colindex, where relevant (I'm sure there are other examples outside of renaming).

@gcalderone @nalimilan should we finalize the changes to dataframes.jl before moving forward with how MetaData might actually behave?

@nalimilan
Copy link
Member

The way that this PR would get around this is by having colindex know a decent amount about the metadata of a dataframe. It seems like a more streamlined approach would be to keep them entirely separate, but edit metadata when you edit colindex, where relevant (I'm sure there are other examples outside of renaming).

Yes, better keep them separate. I'm not sure why renaming should affect meta-data: column-specific meta-data should be stored by position rather than by name, so that it only needs to be adjusted when reordering columns (which is less frequent).

@gcalderone @nalimilan should we finalize the changes to dataframes.jl before moving forward with how MetaData might actually behave?

Yeah, I guess it makes sense to start with a minimal implementation. However it should probably support at least a few operations so that it's testable and at least minimally usable.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 9, 2018

column-specific meta-data should be stored by position rather than by name, so that it only needs to be adjusted when reordering columns (which is less frequent).

I was trying to implement a lot of the functions for Index on a MetaData type which was just two Dicts and ran into the problem with names!. There are a lot of functions in DataFrames that are agnostic to the current names of the dataframe. With names! there is no for loop with individual renaming.

Then I realized that's the whole point of the Index, to keep track of all this. Consequently, I'm going to try an implementation where MetaData is just an array of Dict{Symbol, String}. Then I can use the exact same architecture that relates colindex with columns to relate colindex with metadata.

I will worry about global metadata later since that seems like it will be easier to add after the indexing and renaming functions are taken care of.

edit: It should probably be a Dict{Int, Dict{Symbol, String}} eventually so we don't automatically create a bunch of Dicts when a DataFrame is created.

@nalimilan
Copy link
Member

I was trying to implement a lot of the functions for Index on a MetaData type which was just two Dicts and ran into the problem with names!. There are a lot of functions in DataFrames that are agnostic to the current names of the dataframe. With names! there is no for loop with individual renaming.

Then I realized that's the whole point of the Index, to keep track of all this. Consequently, I'm going to try an implementation where MetaData is just an array of Dict{Symbol, String}. Then I can use the exact same architecture that relates colindex with columns to relate colindex with metadata.

Sorry, I don't understand. Wouldn't storing column-specific meta-data in a vector allow precisely to be agnostic to column names and only use their integer indices instead? Then you can use the index to get the integer index from the name, which the code already does anyway.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 9, 2018

Then you can use the index to get the integer index from the name, which the code already does anyway.

Yes, that is exactly what I am trying to implement. A first step will be to have MetaData be an array of dicts, one for every column, and work from there. That way we can use Base getindex functions etc. exactly like the columns(df).

In the future, we might not want to initiate an array of empty Dicts every time, because that might get annoying for DataFrames with large amounts of columns, and have a cleverer approach that only creates Dicts to store strings on the variables the user wants. Every column in a DataFrame has a columns of data, but we don't want every column in a DataFrame to have metadata, necessarily.

However this just means clever versions of getindex, permute etc. for the metadata type. We should write code in dataframe.jl that works assuming MetaData contains a vector of Dicts, then change the way MetaData actually works later.

@nalimilan
Copy link
Member

OK. But then why use dicts when vectors would be simpler and more efficient?

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 9, 2018

The purpose of using Dicts as a whole was so that people could add particular information like :unit, maybe :source etc. So metadata for each column is a dictionary.

I am imagining a "default" :label key in these Dicts so that other packages can get printable labels easily.

I also changed my mind about a vector of Dicts anyways, because then whenever we add a new column, we have to add push an empty Dict onto the array of Dicts. Basically we would have to touch all the setindex! code. It's easier to have the user make a new dictionary when they want to.

@nalimilan
Copy link
Member

We clearly need a dictionary to map the user-defined meta-data types to their values. But better store values for each type of meta-data in a vector with one entry per column. Yes, you need to resize the vector each time you add a column, but calling getindex will be much less expensive than with one dict per column (copying lots of dicts is going to slow and the cost increases with the number of columns) and it's a much more frequent operation.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 9, 2018

Are you saying a vector

units = array of strings
sources = array of strings

If that's the case, I'm not sure thats a great idea because unit might not be applicable to many columns. Rather, the user will add a unit metadata to columns on an as-needed basis. If a user wants to add a metadata, say, :transformation entry to just one column, that would involve creating a whole new array of strings, where there is only one non-empty value.

Here is what I have written. I just finished the getindex implementation and need to add a few more metadata entry functions and tests before I can make a PR.

# Defining behavior for DataFrames metadata
abstract type AbstractMetaData end

mutable struct MetaData <: AbstractMetaData
	columndata::Dict{Int, Dict{Symbol, String}}
end

MetaData() = MetaData(Dict{Int,Dict{Symbol, String}}())

function MetaData(x::Array{Dict{Symbol, String}}) 
	columndata = Dict{Int, Dict{Symbol, String}}()
	for i in eachindex(x)
		columndata[i] = x[i]
	end
	MetaData(columndata)
end

function Base.getindex(x::MetaData, col_inds::AbstractVector)
	dictarray = [x.columndata[i] for i in col_inds if haskey(x.columndata, i)]
	MetaData(dictarray)
end

So if we have a dataframe

│ Row │ x1       │ x2        │ x3       │ x4       │ x5       │ y       │
├─────┼──────────┼───────────┼──────────┼──────────┼──────────┼─────────┤
│ 1   │ 0.384981 │ 0.760864  │ 0.432747 │ 0.277874 │ 0.768403 │ 2.5135  │
│ 2   │ 0.247484 │ 0.543756  │ 0.30999  │ 0.623039 │ 0.181284 │ 69.9532 │
│ 3   │ 0.937264 │ 0.392081  │ 0.803099 │ 0.855908 │ 0.164826 │ 76.241  │
│ 4   │ 0.510181 │ 0.818951  │ 0.464661 │ 0.680837 │ 0.575248 │ 24.2244 │
│ 5   │ 0.936744 │ 0.0300326 │ 0.568476 │ 0.188381 │ 0.135375 │ 59.0194 │

Our metadata.columndata looks like this:

Dict(
1 => Dict(:label => "label for x1")
2 => Dict(:label =>"label for x2")
...
)

and we want df[[:x3, :x4, :x5]], then we go through the above columndata Dict and make an array of the Dicts for each column, but only if their column index is 3, 4, or 5.

Then there is a constructor that makes a new MetaData instance based on that array of Dicts. This allows for the re-indexing to happen just as though we had a vector of Dicts, but without the overhead of having a vector of Dicts. I was under the impression this would be more performant, since only the objects that are needed are created.

If I understand correctly, you are saying that it is this copying of Dicts into an array that is expensive and undesirable. But can it really be less performant than a new array of for each metadata field (units, etc.)? Perhaps the answer hinges on our expectations for the ratio of unlabeled to labeled variables and the ratio of common to unique metadata fields.

@nalimilan
Copy link
Member

If that's the case, I'm not sure thats a great idea because unit might not be applicable to many columns. Rather, the user will add a unit metadata to columns on an as-needed basis. If a user wants to add a metadata, say, :transformation entry to just one column, that would involve creating a whole new array of strings, where there is only one non-empty value.

That's not a big deal. Typically that will use 64 bits per column, which is nothing compared to the size of the columns themselves. And dicts consist in three arrays, using 336 bits by default even when empty for Dict{Symbol,String}.

If I understand correctly, you are saying that it is this copying of Dicts into an array that is expensive and undesirable. But can it really be less performant than a new array of for each metadata field (units, etc.)? Perhaps the answer hinges on our expectations for the ratio of unlabeled to labeled variables and the ratio of common to unique metadata fields.

Exactly. I assume it's unlikely you will use lots of meta-data fields that will differ from one column to another. As I noted copying a dict involves copying three vectors, plus five Int. Copying one vector for each meta-data field should be cheap compared to that, unless you have much more fields than column.

@pdeffebach
Copy link
Contributor

pdeffebach commented Jun 10, 2018

Thank you for the guidance!

So my impression is that your vision of MetaData would be like

mutable struct MetaData
columndata::Dict{symbol => array of String} # maybe Union{Void, String}
end
...
addmeta(df, :var, :unit, "km/hr")
...
colindex = column index of :var in df
if adding `units` for first time 
    columndata[unit] = ["" for i in ncol(df)] # maybe `nothing` or something
    # or even a sparse vector if we were really worried about memory
end
columndata[unit][colindex] = "km/hr"

Then getindex etc. will essentially just be broadcasted or mapped for the columndata dict.

Unless you wanted a more DataFrames type scenario where there is a Dict that just maps symbols to indices, then another array of arrays (or matrix) that actually contains the information. I'm not 100% sure if its just having many Dicts that are undesirable or if having one larger Dict is also a problem, and its better to store the info in a better format as well.

`

@nalimilan
Copy link
Member

Yes, more or less something like that. It's fine to have a single dict to map meta-data fields to the vectors that hold the values.

@khughitt
Copy link

khughitt commented Oct 6, 2018

This may be outside the desired scope of this effort, but have you considered extending support to include both column and row metadata?

An example of where this would be useful would be something like genomic data where each row might be a gene, and each column a sample or patient. Having associated metadata for both would be useful.

@nalimilan
Copy link
Member

Per-row metadata is just... a column? Am I missing something? Do you know of other software which supports this?

@khughitt
Copy link

khughitt commented Oct 6, 2018

@nalimilan I guess I am thinking of cases where you have something more like homogeneous numeric data in most columns and you wouldn't necessarily want to mix in "metadata" columns that represent something else. In this case though I suppose an annotated multidimensional array is really what I am looking for..

As far as something already implementing this, I've started work on something like this in R, but it is still very immature and far-from-perfect, which is why I'm exploring what's been done in other communities ;)

@nalimilan
Copy link
Member

I guess you could use per-column meta-data to indicate which columns are "real data" and which are "metadata". :-)

We haven't implemented anything to select columns based on criteria (regexes, name ranges, types...) yet, but we should certainly investigate this area (like dplyr and JuliaDB).

@bkamins
Copy link
Member

bkamins commented Jul 24, 2019

@gcalderone + @pdeffebach

I am not entirely clear on the relation of this PR and #1458.
If I understand the things correctly they are overlapping (but I might have missed something - then please correct me).

So my question is: which of them should be left open, or they both should be closed and we should reopen a new PR that is rebased to v0.19 and implements target functionality (from my experience it is sometime simpler than trying to update old PRs)?

@pdeffebach
Copy link
Contributor

This PR should be closed, as #1458 supersedes it.

Yes I think that we should open a new PR attempting this again with the progress of #1458 added on top.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2019

OK - so I am closing this and leave #1458 open as a "placeholder" until a new PR is opened (when #1458 should be closed).

@bkamins bkamins closed this Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants