Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split syntax parsing into separate modules (eventually packages) #3219

Closed
wants to merge 8 commits into from

Conversation

krynju
Copy link

@krynju krynju commented Nov 8, 2022

Still WIP, but I got to the first passing using DataFrames after splitting into separate modules

I need to spend some more time on this, so early draft stages for now

Aiming to get rid of all DataFrames references in SeparateModule, so that I can use these two modules in DTables.jl without the DataFrames dependency

Changelog:

  1. create SeparateModuleIndex (temp name) module that stores everything moved out
  2. selection.jl
    1. move out broadcast_pair
    2. move out funname
    3. move out make_pair_concrete
    4. move out normalize_selection (except proprow and groupindices related)
  3. index.jl
    1. move out complete file and remove original
    2. except: rename!, which was moved back to utils.jl due to heavy reliance on DataFrame
  4. utils.jl
    1. move out AsTable & related methods
    2. move out make_unique!
    3. move out make_unique
    4. move out funname
    5. move out _findall
    6. move out _blsr (only used in _findall)

@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Nov 8, 2022
@bkamins bkamins added this to the 1.5 milestone Nov 8, 2022
@krynju
Copy link
Author

krynju commented Nov 16, 2022

To my surprise this is going pretty well after spending a few hours digging at this
Tests pass locally with some unrelated errors I think

I've separated two modules, prepared rough export/import lists
The SeparateModule (parts of selection.jl) depends on SeparateModuleIndex (index.jl & parts of utils.jl)

I moved the modules straight to DTables and replaced the DataFrames dependency with them and all tests pass

The only pain point is the DataFrames ownership of AsTable, but it's now owned by SeparateModuleIndex, which would just need to become a package to sort this out

@bkamins
Copy link
Member

bkamins commented Nov 16, 2022

AsTable should be owned by the module parsing "operation specification syntax" - but this is what I understand you do.

Two small comments:

  1. please do not make any implementation changes - just move the code (I assume this is what you do, but I want to confirm; the point is that with diff it is hard to visually confirm this with 100% certainty). For DataFrames.jl these changes should be transparent (i.e. they should be made in a way that no code that is left in DataFrames.jl needs changing - again I see this is exactly what you do, but I want to confirm)
  2. please do not change code layout (e.g. I see you change indentation): first - your changes are incorrect, second - it will be much easier to automatically check point 1. above if code is not changed in any way.

Thank you very much for working on this.

@krynju
Copy link
Author

krynju commented Nov 17, 2022

Yes, AsTable landed in the "SeparateModuleIndex (index.jl & parts of utils.jl) module" and stayed in utils.jl

I've been only moving code and trying to figure out where does the code belong to.
I think I made 1-2 edits, which I have to revert anyway as they're not necessary (just some side effects of messing around)

When it's ready I'll make a changelog where everything has moved.
Generally full files or big chunks of them were moved to the separate modules and then I moved stuff back to DataFrames if I had to, so it should be pretty simple to review once I write it all down.

@bkamins
Copy link
Member

bkamins commented Nov 17, 2022

OK, great. Why do you want to create two separate packages for this? It would be probably easier to maintain if it were one package (but I have not thought a lot about it).

@krynju
Copy link
Author

krynju commented Nov 19, 2022

I merged it into one - had to split it early in the process, but it's not necessary now
Also reverted any formatting/weird changes, so the git diff is somewhat readable now

I need to prepare a changelog later and probably have a look at imports/exports again

end
return x
end

function rename!(x::Index, nms::AbstractVector{Pair{Symbol, Symbol}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this one too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, didn't look at the one below that closely - the first one immediately caught my attention due to explicit usage of DataFrame

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved rename! out of the module completely 1b6de20

normalize_selection(idx, first(sel) => Symbol(last(sel)), renamecols)
normalize_selection(idx::AbstractIndex, sel::typeof(eachindex), renamecols::Bool) =
normalize_selection(idx, eachindex => :eachindex, renamecols)

normalize_selection(idx::AbstractIndex, sel::Pair{typeof(groupindices), Symbol},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these methods be moved to the same place as others? Code will become quite complex if methods are split across files/packages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalize_selections that were leftover have a dependency on groupindices and proprow, which I didn't move out of DataFrames.
It's not difficult to move them out, so I suppose we should do it for the sake of completeness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move everything for completeness, i.e. we should make sure that any operation specification syntax works the same on DataFrames.jl and DTables.jl. Even if in DTables.jl you do not provide these functionalities it should throw an informative error (i.e. that operation is not supported yet instead of producing some garbage or incorrect result).

Having said that I want to ask the following question that I wanted to ask for some time now (but hesitated as I know you spent a lot of time working on this PR):

Let us summarize the pros and cons of splitting the interface from DataFrames.jl

I understand there are two options (I list their pros):

Split the module and make it DTables.jl dependency:

  • faster loading time (DataFrames.jl load time is 1 second); the question is if this is indeed an important factor? (i.e. most likely e.g. starting Julia cluster anyway takes much longer)
  • less indirect dependencies (this could be relevant, but we make sure to keep the number of dependencies of DataFrames.jl low and only allow well maintained ones)
  • less frequent updates of dependency (assuming that operation specification syntax is updated less frequently than the whole package; but I think that the reality is that "something" will change here with every release minor release of DataFrames.jl anyway)

Make DTables.jl depend on DataFrames.jl:

  • you do not have to export DataFrames.jl if you do not want to (you can just import it internally)
  • you will be in sync with DataFrames.jl automatically (i.e. there will be no issue - like the issue that prompted me to write this comment that API is only partially supported by the interface package)
  • you can use DataFrames.jl fallback methods without a problem (we will not have to go into a long discussion what methods should go into DataAPI.jl interface and what interface they should have - if you see that the user uses DataFrames.jl as a backend for computations you can easily use specialized methods from DataFrames.jl to do partial steps in DTables.jl)
  • most of your users will use DataFrames.jl most likely anyway, so it is better to make sure we are in sync by design.
  • whatever dependencies DataFrames.jl takes you can be sure that we track them with our tests (and I assure you that we have caught dozens of errors in external packages this way; the same is with user base - they constantly check if what we deploy actually works) - so, this puts a less burden on DTables.jl with making sure that all packages you depend on keep working correctly (except packages that you want to depend on and DataFrames.jl does not depend on)

In summary:

  • taking whole DataFrames.jl does not seem to me as a huge limitation for DTables.jl development (you do not have to export anything; and can decide to leverage)
  • you will be able to call methods from DataFrames.jl without a problem if you detect that underlying table type that the user uses is DataFrames.jl type (this does not restrict you from handling a general case using other code path).
  • users will get a clear message that "newbie friendly" combination is DTables.jl + DataFrames.jl (so they do not have to think what to choose); while experts still are free to choose whatever they wish.

What would be needed to be done if we decided not to split the package from DataFrames.jl is to document the API you want to use in DTables.jl as public, so that you can safely rely on it (and that is why the work you do now is anyway useful, because we learn what actually should be a part of this API).

Please let me know what you think.

@bkamins
Copy link
Member

bkamins commented Nov 25, 2022

This patch #3231 will minimally affect this PR.

1 similar comment
@bkamins
Copy link
Member

bkamins commented Nov 25, 2022

This patch #3231 will minimally affect this PR.

@bkamins bkamins removed this from the 1.5 milestone Feb 5, 2023
@bkamins
Copy link
Member

bkamins commented Feb 5, 2023

@krynju - I am removing a milestone from this. This PR will not likely be needed if I understand things correctly. Right?

@krynju
Copy link
Author

krynju commented Feb 8, 2023

Yes, it's not needed at all
I'll close it and we can get back to it if we need to revisit the idea of splitting this part of dataframes out into a separate package

@krynju krynju closed this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants