Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue adding Metadata to dataframes #1458

Closed
wants to merge 88 commits into from

Conversation

pdeffebach
Copy link
Contributor

This PR picks up where #1429 left off, before I got messed up and deleted the branch.

Here are some outstanding issues with the current code:

Allocation for merge!

If df_left' has a :label field and df_right does not, then merge! creates a metadata for the unique columns on df_right and adds a :label field populated with nothings. This is undesirable because

  1. It allocates a new Dict, and a new metadata in general
  2. It adds a bunch of nothings to an array.

My intuition is that (1) is bad but (2) is consistent with what merge! does for the columns of dataframes. So I should change the behavior to just append nothings to all the fields in the metadata for df_left. This is particularly important when df_right is empty. But the concept still applies for if df_left and df_right have different fields.

Getindex and setindex

I like the idea of having metadata(df, :x1, :field) = "my information assign metadata and just metadata(df, :x1, :field retrieve metadata. This means overriding getindex and setindex but only allowing them to be accessed via metadata()?

Mis-typing

I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.

Future work

joining dataframes calls the new constructor, so to make joins work we might have to touch some of the constructor code to allow metadata.

We will also have to profile this extensively to make sure we aren't losing performance for people who don't wish to use metadata.

@pdeffebach pdeffebach changed the title Continue addint Metadata to dataframes Continue adding Metadata to dataframes Jul 18, 2018
@pdeffebach
Copy link
Contributor Author

pdeffebach commented Aug 16, 2018

Resurrecting this a little bit.

What is the API of metadata?
I was thinking about this recently and I thought of the an API centered around one function, metadata that functions like getindex and setindex

metadata(df, :x, :label) = "A label here"
metadata(df, :x, :label) 
> "A label here"

This works very nicely with broadcasting.

metadata.(Ref(df), names(df), :label) 
> an array of labels

There are only a few helper functions I can think of:
What if we want to add a new field to our metadata? Rather than silently create a new field with a call to metadata(df, :x, :newfield) we should have helper functions that

  • Allow the user to allocate a new field to metadata
    • This is because the cost of a new field might be somewhat high for large dataframes.
  • Allow the user to delete a field in metadata
  • Return the fields currently allowed in the metadata.
    • This could probably be done with keys since the metadata is ultimately a Dict. Or we could write a new function for this.

With the third of these bullets, you could use broadcasting again to do

metadata(Ref(df), :x,  getmetadatafields(df))
> Vector{Any} of various fields for variable :x 

That is 4 functions added to the namespace of DataFrames.

@pdeffebach
Copy link
Contributor Author

I'm gonna put together a larger document about how metadata is handled in various systems to try and work out an API.

here is an issue on pandas which is eerily similar to discussions on discourse about this.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. Below are a few reactions to your points.

I like the idea of having metadata(df, :x1, :field) = "my information" assign metadata and just metadata(df, :x1, :field) retrieve metadata. This means overriding getindex and setindex but only allowing them to be accessed via metadata()?

No, that's not possible. The closest syntax which could work with setindex! is metadata(df)[:x1, :field] = "my information" or metadata(df, :field)[:x1] = "my information", assuming metadata would return an object with a custom type. Else, we have to do metadata!(df, :x1, :field, "my information").

I guess the main advantage of having metadata(df) return a special object is that we could use keys(metdata(df)) to get the list of keys. Then metadata(df)[:somekey] would return a vector or a named tuple, metadata(df)[:, :somecol] would return a named tuple of keys.

Here's a comparison of the syntaxes for common operations:

setindex!/getindex function
Get key for all columns metadata(df)[:somekey] metadata(df, :somekey)
Get key for one column metadata(df)[:somekey, :somecol] metadata(df, :somekey, :somecol)
Get all keys for one column metadata(df)[:, :somecol] metadata(df, :, :somecol)
Set key for one column metadata(df)[:somekey, :somecol] = "x" metadata!(df, :somekey, :somecol, "x")
Get list of keys keys(metadata(df)) metadatakeys(df)

More generally, returning a special object would allow supporting more operations in the future without exporting new functions. For example, rename!(metadata(df), :key1 => :key2) could be used to rename a key.

Mis-typing

I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.

Well, that's what happens if you mistype df[:mycol] = 1 already. To prevent that, we would have to require people to call addfield! manually, which would be quite strict. Also that would make the API more complex.

joining dataframes calls the new constructor, so to make joins work we might have to touch some of the constructor code to allow metadata.
We will also have to profile this extensively to make sure we aren't losing performance for people who don't wish to use metadata.

Honestly I'm not too concerned about that, as the amount of work to do is really small compared with the join itself. However, in general it would be a good idea to check whether the meta-data is empty, and take a fast path in that case to eliminate the overhead.

notonright = setdiff(keys(leftmeta.dict), keys(rightmeta.dict))

for field in notonleft
newfield!(leftmeta, length(leftindex), field, nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted in the issue description, this approach is not very efficient, and it doesn't work for rightmeta since it shouldn't be modified. Another way of doing this is in for key in keys(leftmeta.dict), to check whether the key exists in the left and right data frames. If it exists in both, call vcat as you currently do. If it exists only in one of the data frames, allocate a Vector{Union{Nothing, eltype(key_vec)}} and call copyto! to fill the corresponding entries.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2018

I have not been in this discussion earlier, and as I have mentioned to @pdeffebach ealier it would be great to have a write-up specifying what we want to achieve here.

Looking at the code I understand that we want to be able to attach arbitrary metadata to columns of a DataFrame (and not to a DataFrame as a whole). Right?

I can see that the design is to have a dictionary mapping keys to vectors of values for all columns of a DataFrame. So my first instinct is that we will have to be careful with maintenance and testing because there are many ways you can modify set of columns in a DataFrame and it is easy to loose integrity between metadata and the rest of the DataFrame structure. I can test this when some working prototype is passing CI and some docstrings allow me to be sure about the assumptions.

@nalimilan
Copy link
Member

Yes, that's basically the goal. See #1413 and https://discourse.julialang.org/t/how-to-add-metadata-info-to-a-dataframe/11168.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2018

Thanks! Seeing the volume of the discussion I would recommend to document how it should work (even something like minimal dev docs would be good as we are getting to the point that the design of the whole package might become so complex that we are not sure when we change it one place what other parts might break).

@pdeffebach
Copy link
Contributor Author

@bkamins thanks for the feedback. I have a plan to write a Stata script and some R package and do a battery of transforms and merges and see how they behave, but I haven't had time for it recently. However as far as I can tell there aren't that many systems out there that do this like Stata. So there will likely still be some behaviors unsettled.

I will try and get a non-joining API working based on Milan's comments. Then as you suggested a markdown document will be very helpful. The whole set of changes is simple for now, or at least isolated to a few files (a lot of the conversation is based on my lack of git understanding), but will get far more complicated once joins and constructors get involved. Hopefully I can work on this next week.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2018

Great. So just a few issues that came to my mind and should be decided upon (maybe they are):

  • do we handle makeunique;
  • do we handle column in insertion/deletion;
  • do we handle hcat and vcat.
  • do we handle getting metadata using column name and column number;
  • what do we do with groupby and stack/unstack.

(of course maybe most of those in the first round can be ignored and an appropriate mechanisms added later - the key is to be sure that we do not loose sync between metadata and data)

Also (referring to the discussion above) - if someone wants to assign metadata to the column that does not exist I think an error should be thrown.

@nalimilan
Copy link
Member

I don't think these issues are hard, really. It's fine to drop meta-data in some operations like joins or grouping for now. We just have to handle column insertion and deletion, or the feature would really be confusing (as we would have to drop all meta-data). makeunique is completely orthogonal AFAICT: the column names are not even stored in the meta-data, it's just based on the ordering of columns. So we can automatically support retrieving columns by name or index.

@bkamins
Copy link
Member

bkamins commented Aug 31, 2018

Agreed they do not have to be hard 😄, I just learned the hard way that sometimes simple things are not so simple as they seem.

But I agree that dropping metadata, at least for now, is OK. I would say that most of the time when you have metadata attached to a DataFrame you do not mutate it that much but rather use it as a reference.

@pdeffebach
Copy link
Contributor Author

I would like to continue working on this now that the vcat has been merged and #1620 is about to be merged.

Are we still on board with this general approach? Do either of you trust me to rebase this myself?

@bkamins
Copy link
Member

bkamins commented May 8, 2019

Rebasing src/DataFrames.jl and src/abstractdataframe/join.jl should be easy.
As for src/dataframe/dataframe.jl given the amount of conflicts I would consider (but choose what you prefer :)) to actually use the file from master and update it with the changes you need - this might be simpler and less error prone (especially for constructors).

From my side I would greatly appreciate if you added a section in docs/src/lib/types.md on medatata laying out the design assumptions - this would help me review the implementation. Also please remember to update docs/src/lib/functions.md.

Also adding some info to docs/src/man section would be welcome (but is not crucial for the review).

Thank you!

@asinghvi17
Copy link
Contributor

Bump @pdeffebach

@bkamins
Copy link
Member

bkamins commented Jul 13, 2019

@pdeffebach If you plan to be at JuliaCon2019 this functionality is perfect to work on and discuss during the conference. We could move things forward fast there.

@pdeffebach
Copy link
Contributor Author

I will unfortunately not be at JuliaCon. I have been at home for the summer before grad school so I haven't been thinking much about Julia.

I will try rebasing today at the very least so this PR isn't dead. After that I will work on documentation to write a contract for all this behavior.

@bkamins
Copy link
Member

bkamins commented Jul 16, 2019

Thank you. A "contract" is exactly what we need. As I have said earlier - having thing like https://juliadata.github.io/DataFrames.jl/stable/lib/indexing.html is a must, in my opinion, for a complex change like this because:

  • we are sure that what we design is consistent
  • we can use test driven development (you can write test cases against a contract, even before implementing stuff)
  • it is easier, when reading the actual code, to check if it does what it should

Adding metadata will be a fantastic enchantment and I know that many people would love to have it.

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added this to the 2.0 milestone Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Jan 31, 2022

@pdeffebach - probably we should close this and keep the discussion on what we do in #2961. OK?

@pdeffebach
Copy link
Contributor Author

Sounds good!

@pdeffebach pdeffebach closed this Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants