-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A second attempt at DataFrames Metadata #1429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It's really nice to have metadata
behave so close to index
, this makes the code easier to follow. My main remark is that we'd better avoid exporting any new convenience functions for now, and instead rely on calling getindex
and setindex!
on the MetaData
object itself (see inline comment).
src/dataframe/dataframe.jl
Outdated
@@ -292,7 +302,8 @@ end | |||
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector) | |||
selected_columns = index(df)[col_inds] | |||
new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]] | |||
return DataFrame(new_columns, Index(_names(df)[selected_columns])) | |||
new_metadata = metadata(df)[selected_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_metadata
isn't used. BTW, you could do the same in the getindex
method above, that's more similar to what we do for the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I fixed this by calling a constructor with new_metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: new_metadata
still isn't used (and that's fine, just remove it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
src/dataframe/dataframe.jl
Outdated
@@ -709,6 +724,7 @@ function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::S | |||
end | |||
insert!(index(df), col_ind, name) | |||
insert!(columns(df), col_ind, item) | |||
insert!(metadata(df), col_ind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better insert nothing
for clarity. Then you can remove that insert!
method for MetaData
and always require a value to be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. And I think I removed all references to making things "String".
Though thinking about the broader uses of labels, like with @df
in statsplots, there is value in at least a more tightly controlled :label
field, which must be a string. This was other packages can interface with :label
and know what they are getting.
src/dataframe/dataframe.jl
Outdated
############################################################################## | ||
|
||
""" | ||
addlabel!(df::DataFrame, var::Symbol, label::String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not add special functions like that. As noted on the other issue, better just export metadata(::DataFrame)
, and implement methods like getindex(::MetaData, field::Symbol[, column::ColumnIndex])
and setindex!(::MetaData, value::Any, field::Symbol[, column::ColumnIndex])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing with this approach is that metadata
doesn't know anything about the symbols and the colindex
of the dataframe. So any function would have to include the dataframe
in it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.
I don't know yet, that's why I'd rather start with the strictly minimal API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm changed the addlabel
to something more generic. tbh I just had it there because I was lazy when testing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also changed getmeta
and showmeta
to be more generic. showmeta
now returns a dataframe, but its probably a bad idea cause dataframes aren't that readable for long strings. But it was easy to write and not a horrible idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, let's start with a minimal implementation. We can always add convenience method later if that's useful.
I'd also rather rename getmeta
to metadata
and addmeta!
to setmetadata!
or metadata!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you are saying more about setindex
. Might be nice to write
metadata(df, :x1, :label) = "A variable label"
work.
src/other/metadata.jl
Outdated
abstract type AbstractMetaData end | ||
|
||
mutable struct MetaData <: AbstractMetaData | ||
columndata::Dict{Symbol, Vector{String}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use four-space indent (here and elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed my sublime settings.
src/other/metadata.jl
Outdated
end | ||
|
||
function newfield!(x::MetaData, ncol::Int, field::Symbol,) | ||
x.columndata[field] = ["" for i in 1:ncol] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use nothing
than the empty string. It would also be nice to support any type, not just String
. That shouldn't make the code really more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added. Vector{Any}([nothing for i in 1:N])
this way the user can include arbitrary things. However I am kind of worried that the ability to include arbitrary objects in metadata will cause people to abuse metadata to make it hold actual (non-meta) data.
src/dataframe/dataframe.jl
Outdated
@@ -749,6 +765,10 @@ merge!(df, df2) # column z is added, column id is overwritten | |||
""" | |||
function Base.merge!(df::DataFrame, others::AbstractDataFrame...) | |||
for other in others | |||
notinother = setdiff(names(other), names(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be cleaner to define merge!(::MetaData, ::MetaData...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to make all the code in metadata.jl
not know anything about the dataframe attached to it. Since this code is really about working with the distinct names of both dataframes, i put it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm a bit concerned about the fact that this allocates, even when one isn't using meta-data at all. We should really ensure there's a minimal or zero overhead in that case. Maybe we can handle that by checking whether metadata
is empty first?
In terms of code organization, maybe this could be moved to a function taking Index
objects for two data frames, and it could also be used for join
operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a function diff_indices
in index.jl that returns just the indices of the columns in one dataframe that aren't in the other. This might be useful in joins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I meant to start a review.
I have addressed your comments, but unfortunately couldn't change everything. Mostly because it's hard to think of adding metadata to select variables without using a wrapper function due to finding the right index to use for the column name.
A large issue to start thinking about is operations that call constructors. join
should probably have metadata persist, but it doesn't currently because it calls a new constructor. Saving the metadata from both dataframes and tacking it on after the constructor is called seems inelegant.
To clarify my above comments, if we had the user use
We would run into the problem where Perhaps we should have a setup similar to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would run into the problem where metadata objects don't know about the names of the columns in the dataframe. getfield(metadata(df), :columndata) will only return a Dict that is a bunch of arrays. So info in the above argument would have to be a vector of the right length, and with all the existing metadata just right.
Perhaps we should have a setup similar to a dataframerow, so that metadata can see the dataframe it is attached to? But this hurts us because it means metadata(df) behaves less like columns(df).
Good points. Let's take the other approach then: make the MetaData
type invisible to the user, and provide metadata
and metadata!
(or setmetadata!
?) methods to set it (the internal metadata
function can be renamed to e.g. meta
).
src/dataframe/dataframe.jl
Outdated
@@ -292,7 +302,8 @@ end | |||
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector) | |||
selected_columns = index(df)[col_inds] | |||
new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]] | |||
return DataFrame(new_columns, Index(_names(df)[selected_columns])) | |||
new_metadata = metadata(df)[selected_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: new_metadata
still isn't used (and that's fine, just remove it).
src/dataframe/dataframe.jl
Outdated
end | ||
|
||
function DataFrame(columns::Vector{Any}, colindex::Index, metadata::MetaData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this with metadata::MetaData=MetaData()
in the constructor above. Here you're bypassing all consistency checks done by the existing constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Let me know if we should add metadata as an optional argument for all existing constructors. This would require a good deal of consistency checks though.
src/other/metadata.jl
Outdated
return "Field does not exist" | ||
end | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/other/metadata.jl
Outdated
if haskey(x.columndata, field) | ||
return x.columndata[field][col_ind] | ||
else | ||
return "Field does not exist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I throw an error. I'll need help on error type though.
src/dataframe/dataframe.jl
Outdated
############################################################################## | ||
|
||
""" | ||
addlabel!(df::DataFrame, var::Symbol, label::String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.
I don't know yet, that's why I'd rather start with the strictly minimal API.
src/other/metadata.jl
Outdated
end | ||
|
||
# For creating a new column | ||
function addcolumn!(x::MetaData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated method. Also as noted above it should push nothing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
src/other/metadata.jl
Outdated
end | ||
|
||
function newfield!(x::MetaData, ncol::Int, field::Symbol,) | ||
x.columndata[field] = Vector{Any}([nothing for i in 1:ncol]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using Any
, I think Union{eltype(info), Nothing}
would be more appropriate. That would be more efficient, and it's probably flexible enough for meta-data. We can always add API to choose a different type later if it turns out to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
src/other/metadata.jl
Outdated
Base.:(==)(x::MetaData, y::MetaData) = isequal(x, y) | ||
|
||
Base.copy(x::MetaData) = MetaData(copy(x.columndata)) | ||
Base.deepcopy(x::MetaData) = MetaData(copy(x.columndata)) # field is immutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's immutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just copying index.jl
. I'll delete it.
src/other/metadata.jl
Outdated
|
||
MetaData() = MetaData(Dict{Symbol,Vector}()) | ||
|
||
Base.isequal(x::MetaData, y::MetaData) = isequal(x.columndata, y.columndata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need these definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought they were standard boilerplate for new structs. they are deleted now.
src/dataframe/dataframe.jl
Outdated
@@ -749,6 +765,10 @@ merge!(df, df2) # column z is added, column id is overwritten | |||
""" | |||
function Base.merge!(df::DataFrame, others::AbstractDataFrame...) | |||
for other in others | |||
notinother = setdiff(names(other), names(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm a bit concerned about the fact that this allocates, even when one isn't using meta-data at all. We should really ensure there's a minimal or zero overhead in that case. Maybe we can handle that by checking whether metadata
is empty first?
In terms of code organization, maybe this could be moved to a function taking Index
objects for two data frames, and it could also be used for join
operations?
src/dataframe/dataframe.jl
Outdated
@@ -279,7 +284,8 @@ end | |||
function Base.getindex(df::DataFrame, row_ind::Real, col_inds::AbstractVector) | |||
selected_columns = index(df)[col_inds] | |||
new_columns = Any[dv[[row_ind]] for dv in columns(df)[selected_columns]] | |||
return DataFrame(new_columns, Index(_names(df)[selected_columns])) | |||
# no subsetting required for metadata cause rows dont matter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't sound very useful, meta-data just works as the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/dataframe/dataframe.jl
Outdated
@@ -749,6 +759,20 @@ merge!(df, df2) # column z is added, column id is overwritten | |||
""" | |||
function Base.merge!(df::DataFrame, others::AbstractDataFrame...) | |||
for other in others | |||
d = diff_indices(index(other), index(df)) | |||
#= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go to the docstring for append!
. Also, better find another name since it doesn't follow the signature of the generic append!
.
src/other/index.jl
Outdated
""" | ||
Returns returns the indices of the columns in x that are not in y. | ||
""" | ||
function diff_indices(x::Index, y::Index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, when I suggested having a separate function for this, I was thinking about a MetaData
-aware function which would avoid calling setdiff
if the meta-data is empty. But it can't live in index.jl, which is only about Index
. Since the function is just x[setdiff(names(x), names(y))]
, it's not very useful. Maybe just move all that stuff into append!
and pass it the two Index
objects. That way it can be a no-op when meta-data is empty. I guess it depends on what can be shared with the join
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a new function in metadata
called merge!
that takes in indices with DataFrames.
src/dataframe/dataframe.jl
Outdated
############################################################################## | ||
|
||
""" | ||
addlabel!(df::DataFrame, var::Symbol, label::String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, let's start with a minimal implementation. We can always add convenience method later if that's useful.
I'd also rather rename getmeta
to metadata
and addmeta!
to setmetadata!
or metadata!
.
src/other/metadata.jl
Outdated
newfield!(x, ncol, field, info) | ||
end | ||
x.columndata[field][col_ind] = info | ||
return nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could as well remove this and return info
, that can be useful for chaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/dataframe/dataframe.jl
Outdated
""" | ||
function addmeta!(df::DataFrame, var::Symbol, field::Symbol, info) | ||
ind = index(df)[var] | ||
# pass the number of columns to the function so that it can create a new array of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds OK to me (and no need for a comment).
BTW, these functions would be turned into one-liners if you skip creating the ind
variable.
src/dataframe/dataframe.jl
Outdated
@@ -147,7 +149,8 @@ function DataFrame(; kwargs...) | |||
end | |||
|
|||
function DataFrame(columns::AbstractVector, | |||
cnames::AbstractVector{Symbol}=gennames(length(columns)); | |||
cnames::AbstractVector{Symbol}=gennames(length(columns)), | |||
metadata = MetaData(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused argument. Anyway for now MetaData
is purely internal, like Index
, so it shouldn't appear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay this part is still a bit confusing for me. but i think this makes sense.
src/other/metadata.jl
Outdated
|
||
|
||
function newfield!(x::MetaData, ncol::Int, field::Symbol, info) | ||
x.columndata[field] = Vector{Union{typeof(info), Nothing}}([nothing for i in 1:ncol]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union{typeof(info), Nothing}[nothing for i in 1:ncol]
avoids a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
src/other/metadata.jl
Outdated
abstract type AbstractMetaData end | ||
|
||
mutable struct MetaData <: AbstractMetaData | ||
columndata::Dict{Symbol, Vector} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct
would be enough, right?
Also, columndata
is a bit of a weird name for this, since it sounds like there are other fields with non-column data. Maybe just dict
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
src/dataframe/dataframe.jl
Outdated
@@ -263,7 +267,8 @@ end | |||
function Base.getindex(df::DataFrame, col_inds::AbstractVector) | |||
selected_columns = index(df)[col_inds] | |||
new_columns = columns(df)[selected_columns] | |||
return DataFrame(new_columns, Index(_names(df)[selected_columns])) | |||
new_metadata = metadata(df)[selected_columns] | |||
return DataFrame(new_columns, Index(_names(df)[selected_columns]), new_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use the same pattern as elsewhere and drop the new_metadata
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I think I got my push and pulls confused for this, somehow making me add the new changes to master... let me know what to do to sort it out, because I'm not sure what to do in this situation. I think you reject these and I re-submit? I responded to all the changes, but there are a few issues to work on.
Major issue still to come is |
Better continue the conversation in this PR. You should be able to fix this with |
i did a fetch, a rebase, and then i manually resolved changes and conflicts. I hope this works. Last week i realized i was doing too much on In the future, I have my fork with a |
@nalimilan i re-organized everthing with git to try and fix this. In the process I guess I closed this branch. If it's okay, can I submit another PR? I have all the code still, and it is now up to date with current master. I can go through and add comments where you left off too, to make the transition easier. |
You should have been able to push your branch to this PR even if locally it has a different name, but now that the PR has been closed GitHub won't leave us reopening it anyway for strange reasons, so you'll have to file another one. |
This is another attempt at adding metadata support to DataFrames, modeled after #1413. The ultimate api for metadata between this approach and that one is the same. Users will be able to create arbitrary metadata fields, with a special case for a
label
that hopefully other packages, like plotting, will be exposed to.However the implementation of metadata differs substantially. A
MetaData
type is just a wrapper for aDict
fromSymbol
toVector{String}
. The goal is fordf.metadata
to behave exactly likedf.columns
. This means that theindex.jl
code does not need to be changed at all.getindex
andsetindex
will work the same fordf.metadata
as fordf.columns
, enabling code that looks like this:I have deliberately touched only a small amount of code for this PR.
src/other/metadata/jl
which defines basic operations likegetindex
,setindex
andappend
for metadata.dataframes
, such that a new dataframe always creates a dataframe with empty metadata.getindex
,setindex
etc. such that any subsetting, adding, and merging will work and preserve metadata.addlabel!
,showlabel
andshowlabels
. While adding arbitrary metadata fields (:unit
etc) is feasible under the current system, I didn't want to complicate the api while we sort out how the interface in general might work.""
. When a new columns gets added to the dataframe, I just push""
onto the end of each vector in themetadata
dictionary.This just a stab at one implementation, and if people decide metadata should be implemented differently, that's fine and there can be another PR for another method.
Appreciate any feedback, thanks.