-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split syntax parsing into separate modules (eventually packages) #3219
Conversation
To my surprise this is going pretty well after spending a few hours digging at this I've separated two modules, prepared rough export/import lists I moved the modules straight to DTables and replaced the DataFrames dependency with them and all tests pass The only pain point is the DataFrames ownership of |
af2b447
to
865dd24
Compare
865dd24
to
57ff04c
Compare
Two small comments:
Thank you very much for working on this. |
Yes, I've been only moving code and trying to figure out where does the code belong to. When it's ready I'll make a changelog where everything has moved. |
OK, great. Why do you want to create two separate packages for this? It would be probably easier to maintain if it were one package (but I have not thought a lot about it). |
I merged it into one - had to split it early in the process, but it's not necessary now I need to prepare a changelog later and probably have a look at imports/exports again |
src/new_module_index/index.jl
Outdated
end | ||
return x | ||
end | ||
|
||
function rename!(x::Index, nms::AbstractVector{Pair{Symbol, Symbol}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, didn't look at the one below that closely - the first one immediately caught my attention due to explicit usage of DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved rename! out of the module completely 1b6de20
normalize_selection(idx, first(sel) => Symbol(last(sel)), renamecols) | ||
normalize_selection(idx::AbstractIndex, sel::typeof(eachindex), renamecols::Bool) = | ||
normalize_selection(idx, eachindex => :eachindex, renamecols) | ||
|
||
normalize_selection(idx::AbstractIndex, sel::Pair{typeof(groupindices), Symbol}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these methods be moved to the same place as others? Code will become quite complex if methods are split across files/packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalize_selection
s that were leftover have a dependency on groupindices
and proprow
, which I didn't move out of DataFrames.
It's not difficult to move them out, so I suppose we should do it for the sake of completeness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move everything for completeness, i.e. we should make sure that any operation specification syntax works the same on DataFrames.jl and DTables.jl. Even if in DTables.jl you do not provide these functionalities it should throw an informative error (i.e. that operation is not supported yet instead of producing some garbage or incorrect result).
Having said that I want to ask the following question that I wanted to ask for some time now (but hesitated as I know you spent a lot of time working on this PR):
Let us summarize the pros and cons of splitting the interface from DataFrames.jl
I understand there are two options (I list their pros):
Split the module and make it DTables.jl dependency:
- faster loading time (DataFrames.jl load time is 1 second); the question is if this is indeed an important factor? (i.e. most likely e.g. starting Julia cluster anyway takes much longer)
- less indirect dependencies (this could be relevant, but we make sure to keep the number of dependencies of DataFrames.jl low and only allow well maintained ones)
- less frequent updates of dependency (assuming that operation specification syntax is updated less frequently than the whole package; but I think that the reality is that "something" will change here with every release minor release of DataFrames.jl anyway)
Make DTables.jl depend on DataFrames.jl:
- you do not have to export DataFrames.jl if you do not want to (you can just
import
it internally) - you will be in sync with DataFrames.jl automatically (i.e. there will be no issue - like the issue that prompted me to write this comment that API is only partially supported by the interface package)
- you can use DataFrames.jl fallback methods without a problem (we will not have to go into a long discussion what methods should go into DataAPI.jl interface and what interface they should have - if you see that the user uses DataFrames.jl as a backend for computations you can easily use specialized methods from DataFrames.jl to do partial steps in DTables.jl)
- most of your users will use DataFrames.jl most likely anyway, so it is better to make sure we are in sync by design.
- whatever dependencies DataFrames.jl takes you can be sure that we track them with our tests (and I assure you that we have caught dozens of errors in external packages this way; the same is with user base - they constantly check if what we deploy actually works) - so, this puts a less burden on DTables.jl with making sure that all packages you depend on keep working correctly (except packages that you want to depend on and DataFrames.jl does not depend on)
In summary:
- taking whole DataFrames.jl does not seem to me as a huge limitation for DTables.jl development (you do not have to export anything; and can decide to leverage)
- you will be able to call methods from DataFrames.jl without a problem if you detect that underlying table type that the user uses is DataFrames.jl type (this does not restrict you from handling a general case using other code path).
- users will get a clear message that "newbie friendly" combination is DTables.jl + DataFrames.jl (so they do not have to think what to choose); while experts still are free to choose whatever they wish.
What would be needed to be done if we decided not to split the package from DataFrames.jl is to document the API you want to use in DTables.jl as public, so that you can safely rely on it (and that is why the work you do now is anyway useful, because we learn what actually should be a part of this API).
Please let me know what you think.
This patch #3231 will minimally affect this PR. |
1 similar comment
This patch #3231 will minimally affect this PR. |
@krynju - I am removing a milestone from this. This PR will not likely be needed if I understand things correctly. Right? |
Yes, it's not needed at all |
Still WIP, but I got to the first passing
using DataFrames
after splitting into separate modulesI need to spend some more time on this, so early draft stages for now
Aiming to get rid of all DataFrames references in
SeparateModule
, so that I can use these two modules in DTables.jl without the DataFrames dependencyChangelog:
SeparateModuleIndex
(temp name) module that stores everything moved outselection.jl
broadcast_pair
funname
make_pair_concrete
normalize_selection
(exceptproprow
andgroupindices
related)index.jl
rename!
, which was moved back toutils.jl
due to heavy reliance onDataFrame
utils.jl
AsTable
& related methodsmake_unique!
make_unique
funname
_findall
_blsr
(only used in_findall
)