A second attempt at DataFrames Metadata #1429

pdeffebach · 2018-06-15T17:16:30Z

This is another attempt at adding metadata support to DataFrames, modeled after #1413. The ultimate api for metadata between this approach and that one is the same. Users will be able to create arbitrary metadata fields, with a special case for a label that hopefully other packages, like plotting, will be exposed to.

However the implementation of metadata differs substantially. A MetaData type is just a wrapper for a Dict from Symbol to Vector{String}. The goal is for df.metadata to behave exactly like df.columns. This means that the index.jl code does not need to be changed at all. getindex and setindex will work the same for df.metadata as for df.columns, enabling code that looks like this:

function insert(df, col_ind, item)
    insert!(index(df), col_ind, name)
    insert!(columns(df), col_ind, item)
    insert!(metadata(df), col_ind) # the only change that had to be made to the insert method. 
end

I have deliberately touched only a small amount of code for this PR.

Added a new file src/other/metadata/jl which defines basic operations like getindex, setindex and append for metadata.
Added only a single new constructor to dataframes, such that a new dataframe always creates a dataframe with empty metadata.
- This is probably not desirable, but new constructors can be added later.
Implemented getindex, setindex etc. such that any subsetting, adding, and merging will work and preserve metadata.
Added only three functions exposed to the user. addlabel!, showlabel and showlabels. While adding arbitrary metadata fields (:unit etc) is feasible under the current system, I didn't want to complicate the api while we sort out how the interface in general might work.
Metadata is only strings, and an empty metadata is just an empty string "". When a new columns gets added to the dataframe, I just push "" onto the end of each vector in the metadata dictionary.
global metadata, i.e. metadata that is tied to a dataframe as a whole and not just a column is not supported, because this will presumably be easy to implement in the future.

This just a stab at one implementation, and if people decide metadata should be implemented differently, that's fine and there can be another PR for another method.

Appreciate any feedback, thanks.

nalimilan

Thanks! It's really nice to have metadata behave so close to index, this makes the code easier to follow. My main remark is that we'd better avoid exporting any new convenience functions for now, and instead rely on calling getindex and setindex! on the MetaData object itself (see inline comment).

nalimilan · 2018-06-17T15:31:14Z

src/dataframe/dataframe.jl

@@ -292,7 +302,8 @@ end
 function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector)
    selected_columns = index(df)[col_inds]
    new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]]
-    return DataFrame(new_columns, Index(_names(df)[selected_columns]))
+    new_metadata = metadata(df)[selected_columns]


new_metadata isn't used. BTW, you could do the same in the getindex method above, that's more similar to what we do for the index.

I think I fixed this by calling a constructor with new_metadata

I don't understand: new_metadata still isn't used (and that's fine, just remove it).

nalimilan · 2018-06-17T15:32:31Z

src/dataframe/dataframe.jl

@@ -709,6 +724,7 @@ function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::S
    end
    insert!(index(df), col_ind, name)
    insert!(columns(df), col_ind, item)
+    insert!(metadata(df), col_ind)


Better insert nothing for clarity. Then you can remove that insert! method for MetaDataand always require a value to be passed.

Done. And I think I removed all references to making things "String".

Though thinking about the broader uses of labels, like with @df in statsplots, there is value in at least a more tightly controlled :label field, which must be a string. This was other packages can interface with :label and know what they are getting.

nalimilan · 2018-06-17T15:37:26Z

src/dataframe/dataframe.jl

+##############################################################################
+
+"""
+    addlabel!(df::DataFrame, var::Symbol, label::String)


I'd rather not add special functions like that. As noted on the other issue, better just export metadata(::DataFrame), and implement methods like getindex(::MetaData, field::Symbol[, column::ColumnIndex]) and setindex!(::MetaData, value::Any, field::Symbol[, column::ColumnIndex]).

Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.

Another thing with this approach is that metadata doesn't know anything about the symbols and the colindex of the dataframe. So any function would have to include the dataframe in it, right?

Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.

I don't know yet, that's why I'd rather start with the strictly minimal API.

I'm changed the addlabel to something more generic. tbh I just had it there because I was lazy when testing it out.

I also changed getmeta and showmeta to be more generic. showmeta now returns a dataframe, but its probably a bad idea cause dataframes aren't that readable for long strings. But it was easy to write and not a horrible idea.

As I said, let's start with a minimal implementation. We can always add convenience method later if that's useful.

I'd also rather rename getmeta to metadata and addmeta! to setmetadata! or metadata!.

I understand what you are saying more about setindex. Might be nice to write

metadata(df, :x1, :label) = "A variable label" work.

nalimilan · 2018-06-17T15:37:55Z

src/other/metadata.jl

+abstract type AbstractMetaData end
+
+mutable struct MetaData <: AbstractMetaData
+	columndata::Dict{Symbol, Vector{String}}


Use four-space indent (here and elsewhere).

fixed my sublime settings.

nalimilan · 2018-06-17T15:39:28Z

src/other/metadata.jl

+end
+
+function newfield!(x::MetaData, ncol::Int, field::Symbol,)
+	x.columndata[field] = ["" for i in 1:ncol] 


Better use nothing than the empty string. It would also be nice to support any type, not just String. That shouldn't make the code really more complex.

added. Vector{Any}([nothing for i in 1:N]) this way the user can include arbitrary things. However I am kind of worried that the ability to include arbitrary objects in metadata will cause people to abuse metadata to make it hold actual (non-meta) data.

nalimilan · 2018-06-18T19:44:48Z

src/dataframe/dataframe.jl

@@ -749,6 +765,10 @@ merge!(df, df2)  # column z is added, column id is overwritten
 """
 function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
    for other in others
+        notinother = setdiff(names(other), names(df))


It would probably be cleaner to define merge!(::MetaData, ::MetaData...).

I was trying to make all the code in metadata.jl not know anything about the dataframe attached to it. Since this code is really about working with the distinct names of both dataframes, i put it here.

OK. I'm a bit concerned about the fact that this allocates, even when one isn't using meta-data at all. We should really ensure there's a minimal or zero overhead in that case. Maybe we can handle that by checking whether metadata is empty first?

In terms of code organization, maybe this could be moved to a function taking Index objects for two data frames, and it could also be used for join operations?

I added a function diff_indices in index.jl that returns just the indices of the columns in one dataframe that aren't in the other. This might be useful in joins.

pdeffebach

I'm not sure I meant to start a review.

I have addressed your comments, but unfortunately couldn't change everything. Mostly because it's hard to think of adding metadata to select variables without using a wrapper function due to finding the right index to use for the column name.

A large issue to start thinking about is operations that call constructors. join should probably have metadata persist, but it doesn't currently because it calls a new constructor. Saving the metadata from both dataframes and tacking it on after the constructor is called seems inelegant.

pdeffebach · 2018-06-26T15:49:36Z

To clarify my above comments, if we had the user use

setfield!(metadata(df), :columndata, info...)

We would run into the problem where metadata objects don't know about the names of the columns in the dataframe. getfield(metadata(df), :columndata) will only return a Dict that is a bunch of arrays. So info in the above argument would have to be a vector of the right length, and with all the existing metadata just right.

Perhaps we should have a setup similar to a dataframerow, so that metadata can see the dataframe it is attached to? But this hurts us because it means metadata(df) behaves less like columns(df).

nalimilan

We would run into the problem where metadata objects don't know about the names of the columns in the dataframe. getfield(metadata(df), :columndata) will only return a Dict that is a bunch of arrays. So info in the above argument would have to be a vector of the right length, and with all the existing metadata just right.

Perhaps we should have a setup similar to a dataframerow, so that metadata can see the dataframe it is attached to? But this hurts us because it means metadata(df) behaves less like columns(df).

Good points. Let's take the other approach then: make the MetaData type invisible to the user, and provide metadata and metadata! (or setmetadata!?) methods to set it (the internal metadata function can be renamed to e.g. meta).

nalimilan · 2018-06-26T16:14:47Z

src/dataframe/dataframe.jl

@@ -292,7 +302,8 @@ end
 function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector)
    selected_columns = index(df)[col_inds]
    new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]]
-    return DataFrame(new_columns, Index(_names(df)[selected_columns]))
+    new_metadata = metadata(df)[selected_columns]


I don't understand: new_metadata still isn't used (and that's fine, just remove it).

nalimilan · 2018-06-26T16:23:10Z

src/dataframe/dataframe.jl

    end
+
+    function DataFrame(columns::Vector{Any}, colindex::Index, metadata::MetaData)


Replace this with metadata::MetaData=MetaData() in the constructor above. Here you're bypassing all consistency checks done by the existing constructor.

fixed.

Let me know if we should add metadata as an optional argument for all existing constructors. This would require a good deal of consistency checks though.

nalimilan · 2018-06-26T16:24:47Z

src/other/metadata.jl

+        return "Field does not exist"
+    end
+end
+


Remove empty lines.

nalimilan · 2018-06-26T16:25:01Z

src/other/metadata.jl

+    if haskey(x.columndata, field)
+        return x.columndata[field][col_ind]
+    else
+        return "Field does not exist"


Throw an error?

Now I throw an error. I'll need help on error type though.

nalimilan · 2018-06-26T16:27:13Z

src/dataframe/dataframe.jl

+##############################################################################
+
+"""
+    addlabel!(df::DataFrame, var::Symbol, label::String)


Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.

I don't know yet, that's why I'd rather start with the strictly minimal API.

nalimilan · 2018-06-26T16:29:17Z

src/other/metadata.jl

+end
+
+# For creating a new column
+function addcolumn!(x::MetaData)


Duplicated method. Also as noted above it should push nothing.

nalimilan · 2018-06-26T16:33:44Z

src/other/metadata.jl

+end
+
+function newfield!(x::MetaData, ncol::Int, field::Symbol,)
+    x.columndata[field] = Vector{Any}([nothing for i in 1:ncol]) 


Rather than using Any, I think Union{eltype(info), Nothing} would be more appropriate. That would be more efficient, and it's probably flexible enough for meta-data. We can always add API to choose a different type later if it turns out to be useful.

nalimilan · 2018-06-26T16:35:43Z

src/other/metadata.jl

+Base.:(==)(x::MetaData, y::MetaData) = isequal(x, y)
+
+Base.copy(x::MetaData) = MetaData(copy(x.columndata))
+Base.deepcopy(x::MetaData) = MetaData(copy(x.columndata)) # field is immutable


What's immutable?

I was just copying index.jl. I'll delete it.

nalimilan · 2018-06-26T16:36:21Z

src/other/metadata.jl

+
+MetaData() = MetaData(Dict{Symbol,Vector}())
+
+Base.isequal(x::MetaData, y::MetaData) = isequal(x.columndata, y.columndata)


Do we really need these definitions?

I thought they were standard boilerplate for new structs. they are deleted now.

nalimilan · 2018-06-26T16:46:31Z

src/dataframe/dataframe.jl

@@ -749,6 +765,10 @@ merge!(df, df2)  # column z is added, column id is overwritten
 """
 function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
    for other in others
+        notinother = setdiff(names(other), names(df))


OK. I'm a bit concerned about the fact that this allocates, even when one isn't using meta-data at all. We should really ensure there's a minimal or zero overhead in that case. Maybe we can handle that by checking whether metadata is empty first?

In terms of code organization, maybe this could be moved to a function taking Index objects for two data frames, and it could also be used for join operations?

nalimilan · 2018-07-09T08:10:40Z

src/dataframe/dataframe.jl

@@ -279,7 +284,8 @@ end
 function Base.getindex(df::DataFrame, row_ind::Real, col_inds::AbstractVector)
    selected_columns = index(df)[col_inds]
    new_columns = Any[dv[[row_ind]] for dv in columns(df)[selected_columns]]
-    return DataFrame(new_columns, Index(_names(df)[selected_columns]))
+    # no subsetting required for metadata cause rows dont matter


This comment doesn't sound very useful, meta-data just works as the index.

nalimilan · 2018-07-09T08:12:03Z

src/dataframe/dataframe.jl

@@ -749,6 +759,20 @@ merge!(df, df2)  # column z is added, column id is overwritten
 """
 function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
    for other in others
+        d = diff_indices(index(other), index(df))
+        #=


This should go to the docstring for append!. Also, better find another name since it doesn't follow the signature of the generic append!.

nalimilan · 2018-07-09T08:14:40Z

src/other/index.jl

+"""
+Returns returns the indices of the columns in x that are not in y. 
+"""
+function diff_indices(x::Index, y::Index)


Sorry, when I suggested having a separate function for this, I was thinking about a MetaData-aware function which would avoid calling setdiff if the meta-data is empty. But it can't live in index.jl, which is only about Index. Since the function is just x[setdiff(names(x), names(y))], it's not very useful. Maybe just move all that stuff into append! and pass it the two Index objects. That way it can be a no-op when meta-data is empty. I guess it depends on what can be shared with the join code.

I made a new function in metadata called merge! that takes in indices with DataFrames.

nalimilan · 2018-07-09T08:18:39Z

src/dataframe/dataframe.jl

+##############################################################################
+
+"""
+    addlabel!(df::DataFrame, var::Symbol, label::String)


As I said, let's start with a minimal implementation. We can always add convenience method later if that's useful.

I'd also rather rename getmeta to metadata and addmeta! to setmetadata! or metadata!.

nalimilan · 2018-07-09T08:24:28Z

src/other/metadata.jl

+        newfield!(x, ncol, field, info)
+    end
+    x.columndata[field][col_ind] = info
+    return nothing


Could as well remove this and return info , that can be useful for chaining.

nalimilan · 2018-07-09T08:37:27Z

src/dataframe/dataframe.jl

+"""
+function addmeta!(df::DataFrame, var::Symbol, field::Symbol, info)
+    ind = index(df)[var]
+    # pass the number of columns to the function so that it can create a new array of 


That sounds OK to me (and no need for a comment).

BTW, these functions would be turned into one-liners if you skip creating the ind variable.

nalimilan · 2018-07-09T08:39:05Z

src/dataframe/dataframe.jl

@@ -147,7 +149,8 @@ function DataFrame(; kwargs...)
 end

 function DataFrame(columns::AbstractVector,
-                   cnames::AbstractVector{Symbol}=gennames(length(columns));
+                   cnames::AbstractVector{Symbol}=gennames(length(columns)),
+                   metadata = MetaData();


Unused argument. Anyway for now MetaData is purely internal, like Index, so it shouldn't appear here.

okay this part is still a bit confusing for me. but i think this makes sense.

nalimilan · 2018-07-09T08:41:45Z

src/other/metadata.jl

+
+
+function newfield!(x::MetaData, ncol::Int, field::Symbol, info)
+    x.columndata[field] = Vector{Union{typeof(info), Nothing}}([nothing for i in 1:ncol])


Union{typeof(info), Nothing}[nothing for i in 1:ncol] avoids a copy.

nalimilan · 2018-07-09T08:51:17Z

src/other/metadata.jl

+abstract type AbstractMetaData end
+
+mutable struct MetaData <: AbstractMetaData
+    columndata::Dict{Symbol, Vector}


struct would be enough, right?

Also, columndata is a bit of a weird name for this, since it sounds like there are other fields with non-column data. Maybe just dict?

nalimilan · 2018-07-09T08:51:49Z

src/dataframe/dataframe.jl

@@ -263,7 +267,8 @@ end
 function Base.getindex(df::DataFrame, col_inds::AbstractVector)
    selected_columns = index(df)[col_inds]
    new_columns = columns(df)[selected_columns]
-    return DataFrame(new_columns, Index(_names(df)[selected_columns]))
+    new_metadata = metadata(df)[selected_columns]
+    return DataFrame(new_columns, Index(_names(df)[selected_columns]), new_metadata)


Better use the same pattern as elsewhere and drop the new_metadata variable.

pdeffebach · 2018-07-14T05:06:54Z

I think I got my push and pulls confused for this, somehow making me add the new changes to master... let me know what to do to sort it out, because I'm not sure what to do in this situation. I think you reject these and I re-submit?

I responded to all the changes, but there are a few issues to work on.

allocating for merge! . It seems intuitive that that the new metadata should be a combination of the two in the merge, I'm not sure what a non-allocating version would look like, since the current implementation needs full vectors in the columndata Dict (now dict Dict).
I like the idea of having metadata(df, :x1, :field) = "my information. This requires overloading setindex and getindex right?
I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.

Major issue still to come is join.

nalimilan · 2018-07-14T13:04:18Z

I think I got my push and pulls confused for this, somehow making me add the new changes to master... let me know what to do to sort it out, because I'm not sure what to do in this situation. I think you reject these and I re-submit?

Better continue the conversation in this PR. You should be able to fix this with git fetch && git rebase -i origin/master, and removing lines which correspond to unrelated commits. Then if everything it OK, do a git push --force. (In general, better work in special branches and keep master in sync with origin.)

pdeffebach · 2018-07-14T22:25:50Z

i did a fetch, a rebase, and then i manually resolved changes and conflicts.

I hope this works. Last week i realized i was doing too much on master but when I made new branches I didn't set their upstream right, I think.

In the future, I have my fork with a master i keep up to date with commits, then I make a branch and push and pull from the branch of that fork exclusively.

pdeffebach · 2018-07-17T14:50:57Z

@nalimilan i re-organized everthing with git to try and fix this. In the process I guess I closed this branch. If it's okay, can I submit another PR? I have all the code still, and it is now up to date with current master. I can go through and add comments where you left off too, to make the transition easier.

nalimilan · 2018-07-17T19:36:04Z

You should have been able to push your branch to this PR even if locally it has a different name, but now that the PR has been closed GitHub won't leave us reopening it anyway for strange reasons, so you'll have to file another one.

nalimilan reviewed Jun 18, 2018

View reviewed changes

pdeffebach commented Jun 25, 2018

View reviewed changes

nalimilan reviewed Jun 26, 2018

View reviewed changes

nalimilan reviewed Jul 9, 2018

View reviewed changes

pdeffebach force-pushed the master branch from 94445ba to 174b924 Compare July 14, 2018 22:09

pdeffebach closed this Jul 16, 2018

pdeffebach force-pushed the master branch from c4265a0 to e731982 Compare July 16, 2018 03:38

pdeffebach mentioned this pull request Jul 18, 2018

Continue adding Metadata to dataframes #1458

Closed

		end

		function DataFrame(columns::Vector{Any}, colindex::Index, metadata::MetaData)


		MetaData() = MetaData(Dict{Symbol,Vector}())

		Base.isequal(x::MetaData, y::MetaData) = isequal(x.columndata, y.columndata)



		function newfield!(x::MetaData, ncol::Int, field::Symbol, info)
		x.columndata[field] = Vector{Union{typeof(info), Nothing}}([nothing for i in 1:ncol])

A second attempt at DataFrames Metadata #1429

A second attempt at DataFrames Metadata #1429

Conversation

pdeffebach commented Jun 15, 2018

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach left a comment • edited Loading

Choose a reason for hiding this comment

pdeffebach commented Jun 26, 2018 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Jul 14, 2018

nalimilan commented Jul 14, 2018

pdeffebach commented Jul 14, 2018

pdeffebach commented Jul 17, 2018

nalimilan commented Jul 17, 2018

pdeffebach left a comment •

edited

Loading

pdeffebach commented Jun 26, 2018 •

edited

Loading