Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision: remove _names and make names not perform a copy #1655

Closed
bkamins opened this issue Dec 26, 2018 · 16 comments
Closed

Decision: remove _names and make names not perform a copy #1655

bkamins opened this issue Dec 26, 2018 · 16 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 26, 2018

We could decide to rename _names to names and avoid making a copy of names held in index (see #1648 (comment)).

@pdeffebach
Copy link
Contributor

pdeffebach commented Dec 26, 2018

names(df) = [:a, :b, :c] always seems intuitive. Would be cool if it were just a getindex / setindex call with some error checking.

@bkamins
Copy link
Member Author

bkamins commented Dec 26, 2018

The idea is that names are stored in the index either as a Vector{Symbol} or as a view (in case of SubDataFrame or DataFrameRow) and we could give a direct access to the user to these objects (this is a bit risky, as they should really be read-only).

@nalimilan maybe names should return a read-only custom version of an AbstractVector? (similar to what we have with codeunits and String)

@nalimilan
Copy link
Member

Yes, the point isn't to support names(df) = ..., which isn't idiomatic Julia AFAIK.

We could indeed return a custom AbstractVector to ensure it cannot be modified, but may be a bit overkill. Or maybe a package should provide a simple ReadOnlyVector wrapper that we could use here and in CategoricalArrays? I think we should ask for opinions among Julia developers, as this issue arises in many other situations.

@bkamins
Copy link
Member Author

bkamins commented Dec 26, 2018

@nalimilan a general purpose package would be good (or a feature in some standard package e.g. from https://github.com/JuliaCollections).
Could you please ask as I will be out for several days and let us know (please mention me in the question so that I can track it)? Thank you.

@bkamins
Copy link
Member Author

bkamins commented Jan 5, 2019

@nalimilan Have you had a chance to discuss this with Base devs? If no I can ask a question on Discourse.

@nalimilan
Copy link
Member

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins
Copy link
Member Author

bkamins commented Jan 20, 2019

@nalimilan I have created a proposal of a read-only array type here https://github.com/bkamins/ReadOnlyArrays.jl.

@bkamins
Copy link
Member Author

bkamins commented Jan 24, 2019

Any opinion on https://github.com/bkamins/ReadOnlyArrays.jl?

  1. is it OK?
  2. should we use it (then probably I should register this package)?
  3. or should we move this code somewhere else?

@nalimilan
Copy link
Member

Sounds good, but I'd like a broad agreement that ReadOnlyArray should be the standard way to do this. It would be weird if this particular function in DataFrames was the only user of that feature.

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2019

Given the response and discussion I would propose that we remove _names and expose internal vector names (like levels in CategoricalArrays.jl). We should also document it.

Probably we should throw a depwarn here then giving a hint to use copy(names(df)) if one wants a copy.

Do we go ahead this way?

@nalimilan
Copy link
Member

Fine we me. The advantage with this solution is that we can change it later to return a ReadOnlyArray if we want, it won't break valid code. Adding a deprecation would be annoying since there's no way to avoid it (except for adding a temporary argument to names, which then will have to be deprecated later...).

An alternative solution would be to have names return the Index object and make it <:AbstractVector{Symbol} as evoked at #1712. That way it couldn't be modified as a plain Vector.

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2019

The problem with:

An alternative solution would be to have names return the Index object

is that getindex on AbstractIndex does not return Symbol but Int (or vector of them). That is why in #1712 I considered returning AbstractIndex but not when names is called, but when keyindex/colindex is called.

@bkamins
Copy link
Member Author

bkamins commented Sep 2, 2019

@nalimilan - I am not sure what to do with this. But for 1.0 I think we can leave names as is as I would assume everyone is used to it and it is almost surely not called in performance critical code (if you agree please close - or comment).

@nalimilan
Copy link
Member

How about returning the internal vector without copying, as you suggested above? It probably won't break existing code, and that way we can do whatever we want after 1.0 (including making a copy if it turns out it creates bugs in user code).

@bkamins
Copy link
Member Author

bkamins commented Sep 3, 2019

This goes against "safety" approach we have taken recently. Who knows where people used names(df) in their code. Maybe it gets modified.

I would go for this if the performance gain would be really important, but in this case I think it does not really matter if we are fast with this operation or not, so it is safer to perform copying.

@nalimilan
Copy link
Member

OK, as you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants