Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move DataFrame type definition to lightweight package #1764

Open
piever opened this issue Apr 1, 2019 · 46 comments
Open

move DataFrame type definition to lightweight package #1764

piever opened this issue Apr 1, 2019 · 46 comments

Comments

@piever
Copy link

piever commented Apr 1, 2019

See discussion starting from this comment. The key issue is that some CSV reading packages like to have a DataFrame as a default output: it is probably the most popular tabular data format in Julia and is preferrable to the "typed alternatives" when the data is wide (NamedTuple based alternatives have poor performance when the number of columns is in the thousands). In practice, both CSV.jl and TableReader.jl have a dependency on DataFrames which they only use so that they can output a DataFrame. Unfortunately, this causes a slowdown when loading those packages due to the complexity of the code of DataFrames. It also makes those packages less appealing as dependencies for tabular data packages that rely on alternative formats (like IndexedTables, StructArrays or TypedTables).

A proposed solution is to split the DataFrame definition and constructors to a DataFramesBase packages which would have basically zero dependencies and would be very fast to load, so that CSV and TableReader can just depend on DataFramesBase. DataFrames would then reexport DataFramesBase.DataFrame so this change would not affect users.

As an added bonus, I suspect this may make it more appealing to port IndexedTables from their current "mutable table" structure (a dictionary of columns) to just using a DataFrame: converting IndexedTable to DataFrame, replacing a few columns in various ways and converting back to IndexedTable is a useful workflow.

@KristofferC
Copy link
Contributor

Methods that overload Base functions and other extend functions from other packages would probably need to move to the new lightweight package as well (type piracy).

@bkamins
Copy link
Member

bkamins commented Apr 1, 2019

I am OK to go this way, but could you be more specific what you expect that can be stripped?

I am asking, because I expect that most of the things will have to stay anyway. The majority of dependencies are needed by various DataFrame constructors.

@quinnj
Copy link
Member

quinnj commented Apr 1, 2019

Yes, I'm in favor of both cleaning up DataFrames dependency (I think the StatsBase could be inside a Requires.jl block, for example) and moving the core type definition + constructor to a separate package (DataFrameTypes.jl?), I guess we'd need AbstractDataFrame and Index as well, but that doesn't seem too bad.

@JeffBezanson
Copy link
Contributor

The majority of dependencies are needed by various DataFrame constructors.

I think it's fine to have some constructors defined in DataFramesBase and some in DataFrames, if that helps.

@nalimilan
Copy link
Member

I see the reasoning, but wouldn't it be weird if CSV parsers returned a DataFrame object without providing most functions that you can call on it? That sounds confusing. I'd rather have CSV parsers use the Tables.jl interface to return a custom iterable like CSV.File, so that they no longer depend on DataFrames (they don't even need to depend on Tables.jl as they just need to return an iterable). Users would have to call DataFrame(CSV.File(path)) instead of CSV(path), which I kind of regret, but at least it would be clear and fully generic.

We could probably get rid of a few dependencies like CategoricalArrays and PooledArrays by using Requires.jl, but apparently it currently incurs a significant slowdown at load time.

@bkamins
Copy link
Member

bkamins commented Apr 2, 2019

We could probably get rid of a few dependencies like CategoricalArrays and PooledArrays

Actually most of the dependencies can be lazily loaded and most of them will be never loaded in a normal workflow. Maybe this is the way to go?.

@nalimilan
Copy link
Member

One issue is that AFAIK we cannot load CategoricalArrays automatically when calling categorical!.

@JeffBezanson
Copy link
Contributor

wouldn't it be weird if CSV parsers returned a DataFrame object without providing most functions that you can call on it?

Depends how we define "most" :) I think it is highly valuable to have a standard data structure like Array for this. Huge numbers of functions over the julia ecosystem can work on Arrays, but you only need a handful to get started, and DataFrame should be similar. Perhaps it needs a few more functions. But not a huge number.

I hate to say this, but categorical! is not a crucial tabular data operation. You can analyze data without it. A DataFramesBase that lacks it and some other things could still be perfectly useful.

@nalimilan
Copy link
Member

nalimilan commented Apr 2, 2019

categorical! and other data management functions could live in a different package. But you also evoked constructors, and I was thinking about indexing, views, printing... where do you put the limit?

EDIT: Also we rely on Tables.allocatecolumn for essential things such as vcat. Since Tables.jl is slow to load due to Requires.jl, we would also have to drop vcat from the basic package. At that point it would be quite limited...

@JeffBezanson
Copy link
Contributor

I'd be happy to pick a place to draw the line. We do that in API design all the time.

I also humbly submit that if something as byzantine as Requires.jl is needed to concatenate tables, we are doing something wrong.

@bkamins
Copy link
Member

bkamins commented Apr 2, 2019

Let me write down the DataFrames.jl dependency list and where they are used (if I have missed something please add here):

  • CategoricalArrays: constructor, categorical array handling, split-apply-combine
  • DataStreams: dependency will be removed soon by @quinnj
  • IteratorInterfaceExtensions: constructor
  • Missings: used in functions handling missing values
  • PooledArrays: split-apply-combine
  • Reexport, Compat - standard utility usage
  • SortingAlgorithms - sorting
  • StatsBase: describe method extends function defined there
  • Tables: constructor
  • TableTraits: constructor
  • WeakRefStrings: dependency will be removed soon by @quinnj

@piever
Copy link
Author

piever commented Apr 3, 2019

but wouldn't it be weird if CSV parsers returned a DataFrame object without providing most functions that you can call on it?

I am a bit confused by this point. I imagine most users do using DataFrames, CSV anyway as CSV does not reexport DataFrames functionality (RDatasets seems to be the odd one out here in that it also reexports DataFrames). Returning a DataFrame without having to load a lot of code is to me a good compromise as:

  • Users of DataFrames keep doing using DataFrames, CSV or using DataFrames, TableReader and are completely unaffacted by this
  • Users of TypedTables or IndexedTables can use TableReader just by doing TypedTable(TableReader.readcsv(df)) or table(TableReader.readcsv(df)) and no longer have to wait the extra load time (or dependencies, or potential, even though very unlikely, breakage) of the largish DataFrames codebase.

OTOH I appreciate that splitting the package without type piracy may be tricky. I imagine that the things to take out of DataFramesBase for sure are the functions defined in DataFrames that are not imported from elsewhere (groupby, stack, unstack, categorical!) which hopefully gets rid of the PooledArrays extension.

I'm not sure that IteratorInterfaceExtensions and TableTraits constructors are needed as direct dependencies now that Tables "incorporates them". Sorting is a bit trickier though: I've had similar difficulties with StructArrays: it was difficult to define efficient sorting by multiple columns without extra dependencies (and in the end I needed PooledArrays).

@nalimilan
Copy link
Member

@piever even if functions are usually not reexported, methods are implicitly (e.g. getindex). So to avoid weird MethodErrors about getindex we would have to move all definitions of methods defined in packages other than DataFrames to the new lightweight package.

Do we actually know what makes loading slow? Is it DataFrames per se, one of more dependencies, or the fact that Tables.jl uses Requires? I think we should experiment with this. For example, Tables.allocatecolumn is arguably so generic that it could leave out of Tables.jl (it's actually a variant of similar but taking an entry instead of an array), which would allow getting rid of its use of Requires.jl.

Users of TypedTables or IndexedTables can use TableReader just by doing TypedTable(TableReader.readcsv(df)) or table(TableReader.readcsv(df)) and no longer have to wait the extra load time (or dependencies, or potential, even though very unlikely, breakage) of the largish DataFrames codebase.

This was precisely the goal of defining a generic interface in Tables.jl. In an ideal world I would like CSV and TableReaders to return a DataFrame by default, but also provide a way to create any table without creating a DataFrame first. The latter is possible using CSV.File, but unfortunately we don't have a mechanism currently to have CSV.read return a DataFrame but only load DataFrames when that function is called.

Therefore an argument for not splitting DataFrames is that it will annoy @JeffBezanson so much that he will implement a way to avoid that (e.g. that's possible in R, though I guess it may be more difficult in Julia). :-) In the meantime, as I suggested above, I would just require people to write DataFrame(CSV.File(...)) and leave DataFrames alone.

@JeffBezanson
Copy link
Contributor

This proposal would leave DataFrames alone. using DataFrames would be exactly the same as it is now. It just gives other package authors the option of depending on part of it instead of all of it. That seems much simpler to me than coming up with various convoluted approaches to avoid returning a DataFrame.

I agree it looks possible to stop using Requires.jl in Tables.

I think there are a few reasons DataFrames is bulky:

  1. It's pretty big on its own, >9000 LOC. (Including 1300 lines of deprecations --- when can they be removed?)
  2. StatsBase is a pretty big package, >6000 LOC, and it also brings in DataStructures (~5700 LOC)
  3. CategoricalArrays defines a large number of invasive methods. It's also big-ish (2000 LOC) compared to e.g. PooledArrays (440 LOC).

@nalimilan
Copy link
Member

This proposal would leave DataFrames alone. using DataFrames would be exactly the same as it is now. It just gives other package authors the option of depending on part of it instead of all of it. That seems much simpler to me than coming up with various convoluted approaches to avoid returning a DataFrame.

Yet that requires keeping the two packages in sync, which is painful for development. That's not too terrible if we just move out the type definitions, but as I said I would find it confusing that things like getindex would then throw MethodError. BTW, I'm not aware of any other package which does that currently, right?

I think there are a few reasons DataFrames is bulky:
[...]

What determines the time to load a package? I guess it's more the number of methods (and which methods) than the number of lines? I'd like to see a few experiments to see which dependencies are the most costly.

Deprecations can go away at any time, but we've had complaints that we changed too many things quickly and that users may have code that only runs every few months. Again, this can be benchmarked.

@bkamins
Copy link
Member

bkamins commented Apr 3, 2019

A minimal DataFramesBase.jl specification would be:

  • AbstractDataFrame definition
  • DataFrame definition
  • Base.getproperty
  • Base.setproperty!
  • Base.propertynames
  • Base.names (needed by Base.propertymanes)
  • Base.dump
  • Base.hash
  • Base.show (this is pretty heavy)
  • DataFrame(::AbstractVector{<:AbstractVector}, ::AbstractVector{Symbol}) outer constructor
  • DataFrame(::NTuple{N, AbstractVector}, ::NTuple{N, Symbol}) outer constructor
  • Base.copy
  • AbstractIndex
  • Index
  • additionally all internally defined functions that are called by these functions (or the functions above should be rewritten to reduce their dependencies)

This would have no dependencies. This set would allow to create a DataFrame + we would define all functions that otherwise would have some default unwanted behavior (i.e. when DataFrames.jl changes the behavior from Base, so that loading DataFrames.jl later would not change how DataFrame behaves).

Conclusion:

  • pros: this is doable and certainly would cut down loading time
  • cons: we will have massive fragmentation of code between two packages

Now regarding dependency costs (each call on clean environment):

julia> @time using Missings
  0.072074 seconds (48.59 k allocations: 2.598 MiB)

julia> @time using CategoricalArrays
  0.794799 seconds (2.09 M allocations: 134.363 MiB, 4.85% gc time)

julia> @time using StatsBase
  0.471554 seconds (676.41 k allocations: 41.223 MiB, 4.70% gc time)

julia> @time using SortingAlgorithms
  0.323994 seconds (460.25 k allocations: 29.060 MiB, 2.31% gc time)

julia> @time using Reexport
  0.033949 seconds (36.56 k allocations: 1.849 MiB)

julia> @time using WeakRefStrings
  0.137408 seconds (143.37 k allocations: 8.611 MiB)

julia> @time using DataStreams
  0.271019 seconds (240.39 k allocations: 14.873 MiB, 2.35% gc time)

julia> @time using Compat
  0.136365 seconds (199.66 k allocations: 10.929 MiB, 4.67% gc time)

julia> @time using Tables
  0.383070 seconds (599.92 k allocations: 32.199 MiB, 1.86% gc time)

julia> @time using IteratorInterfaceExtensions
  0.030523 seconds (33.91 k allocations: 1.735 MiB)

julia> @time using TableTraits
  0.054261 seconds (57.07 k allocations: 2.884 MiB)

julia> @time using PooledArrays
  0.069829 seconds (108.97 k allocations: 7.699 MiB)

@JeffBezanson
Copy link
Contributor

Load time is one thing, but I think the real problem is more conceptual (e.g. I'd be perfectly happy if we can just speed up julia to reduce the load time for the same code). It should be possible to read a csv file or make some simple tabular data without needing StatsBase, DataStructures, and CategoricalArrays. I see it as a design and code factoring issue, which currently happens to have load time (and code size) consequences as well.

Yes, the number and nature of methods is important. In fact, the real cost is often not measured as part of the load time, since adding methods can cause us to recompile code later (e.g the repl gets laggy again for a while). It can also cause issues like JuliaLang/julia#31535 (though fortunately that has been fixed). Having lots of unnecessary methods loaded has other minor consequences too, like polluting tab-completion results.

@JeffBezanson
Copy link
Contributor

  • cons: we will have massive fragmentation of code between two packages

I can see it would be a problem to constantly need to commit to both packages, but my impression is that DataFramesBase would not need to change very much if at all. Is that true? Or is there some other problem?

@pdeffebach
Copy link
Contributor

If we decide on a minimum set of operations in TableOperations.jl that all tabular formats need to support, then we can create an abstract type AbstractTable and an implementation of AbstractTable that contains only the bare minimum needed for the interface.

That seems like ultimately what we want, but it would be a long way off and users would have to put up with compile times for a while.

Also this makes something like #1458 adding meta-data to dataframes very hard, since now it's the defacto "minimum" table implementation that we cannot add features to.

@bkamins
Copy link
Member

bkamins commented Apr 3, 2019

Maybe let us try to discuss the different heavy dependencies:

  • CategoricalArrays: this is difficult as it is baked-into many methods in the package; @nalimilan is the best person to comment on it
  • StatsBase: if we change describe to some other sensible name we can remove it and save 0.5 second
  • SortingAlgorithms: probably should stay (at least for now)
  • WeakRefStrings, DataStreams: will go away soon, so 0.4 second saved
  • Tables: here probably @quinnj is the best to comment
  • deprecated code: will go away before JuliaCon (this is my target 😄), so I would not worry about it

So in short - now DataFrames.jl loads in 3 seconds - we can speed up loading time by 1 second easily (if we decide to rename describe) and another 1 second depends mostly on CategoricalArrays.jl and Tables.jl.

@KristofferC
Copy link
Contributor

It seems odd to me that Tables.jl, which is supposed to be the ultra light interface package is one of the packages with the highest impact on loading time.

@piever
Copy link
Author

piever commented Apr 3, 2019

I have the impression that there are some basic methods that are needed from "categorical array" type of packages and WeakRefStrings that complicate the scenario. Tables provides a method to allocate a vector (to collect an iterable into a column based table) based on the element type so that some special types get special vectors (for example CategoricalValue gets categorical vector and WeakRefString gets StringArray). To me this type of functionality should live in Base as defaultarray(T::Type, sz) = Array{T}(undef, sz) and then CategoricalArrays and WeakRefStrings would overload it. This would already remove the need to require CategoricalArrays in Tables and in my view simplify the design.

I wonder whether there is a good solution for the unweakref here. StructArrays also has WeakRefStrings in their REQUIRE (which I'd rather it didn't) but so far it has not been obvious what is the right abstraction to avoid this.

@nalimilan
Copy link
Member

nalimilan commented Apr 3, 2019

Load time is one thing, but I think the real problem is more conceptual (e.g. I'd be perfectly happy if we can just speed up julia to reduce the load time for the same code). It should be possible to read a csv file or make some simple tabular data without needing StatsBase, DataStructures, and CategoricalArrays. I see it as a design and code factoring issue, which currently happens to have load time (and code size) consequences as well.

@JeffBezanson StatsBase is supposed to go away anyway. We could move describe to Statistics (and Compat for earlier releases) and that dependency would be gone (we just need to override the function), as well as the DataStructures dependency.

Regarding CategoricalArrays and PooledArrays, we notably depend on them to implement optimized functions (grouping, hashing). It would sound natural to me to use Requires for that. That doesn't sound like a conceptual issue at all. @piever noted StructArrays has the same issue. Do you suggest we also create StructArraysBase? I rather think the performance problem should be fixed.

I can see it would be a problem to constantly need to commit to both packages, but my impression is that DataFramesBase would not need to change very much if at all. Is that true? Or is there some other problem?

@JeffBezanson It depends on what you put there, but even the list @bkamins gave above covers a lot of code. Note that we're currently redesigning quite a few fundamental things, which is why I don't see splitting the package into pieces as a high priority thing. Hopefully things will stabilize later, but it's almost certain is that many changes will require adding version bounds, which is painful (as we already experience with StatsBase-StatsModels-GLM).

It seems odd to me that Tables.jl, which is supposed to be the ultra light interface package is one of the packages with the highest impact on loading time.

@KristofferC AFAIK that's due to Requires (as we discussed the other day).

@JeffBezanson
Copy link
Contributor

Also this makes something like #1458 adding meta-data to dataframes very hard, since now it's the defacto "minimum" table implementation that we cannot add features to.

Reminds me of the situation in Images.jl, where they also have a metadata-carrying type. The solution adopted there was to allow any array to be used as an image, and make the image-with-metadata type an optional wrapper. Hopefully something similar could be done here.

But ok --- it sounds like things are already moving in the right direction, and we will get pretty far by de-coupling from StatsBase and adding defaultarray somewhere. +1 to that.

Do you suggest we also create StructArraysBase?

Fair question. But StructArrays has exactly 2 dependencies (Requires, PooledArrays) and ~700 LOC, compared to DataFrames with 11 dependencies and 9000 LOC. However I agree that the dependency on PooledArrays is ugly. Ideally we could have some abstractions allowing both PooledArrays and CategoricalArrays to provide optimized functionality when they are used as columns.

@KristofferC
Copy link
Contributor

Do you suggest we also create StructArraysBase?

https://github.com/KristofferC/LightStructArrays.jl ;)

@davidanthoff
Copy link
Contributor

TableTraits.jl provides an ultralight table interop story: Less than 50 LOC, and no business with Requires.jl (which I think packages that are low in the stack should avoid at all costs, not just for the performance impact but especially because it significantly complicates the dependency versioning story, IMHO).

@piever
Copy link
Author

piever commented Apr 3, 2019

I've opened JuliaLang/julia#31601 with the above proposal on how to solve the need for allocatecolumns in the Requires block of Tables.

Do you suggest we also create StructArraysBase?

I'll just add my thoughts here (sorry for off topic) as I think it's a relevant comparison.

I still aim to keep the package lightweight. To get rid of WeakRefStrings in the Requires block and PooledArrays I would need the following two things:

  • some way to know for each array type which is the fastest between permute!(v, p) and copyto!(v, v[p]) (at the moment it depends on the type which is most efficient, as there are types with optimized permute! like StringVector and PooledVector but for normal arrays copyto!(v, v[p]) tends to be more performant)
  • some optimized roweq(v, i, j) that checks whether the i-th and j-th element of a vector are equal (for PooledArrays comparing the values of i-th and j-th element directly is not the optimal way to do this by far, one should compare the refs)

I like supporting the Tables interface, but when I added it there was some pushback so I put it in a Requires block, but this doesn't feel optimal either: I'm just not sure what the optimal solution is.

@ararslan
Copy link
Member

ararslan commented Apr 3, 2019

no business with Requires.jl (which I think packages that are low in the stack should avoid at all costs, not just for the performance impact but especially because it significantly complicates the dependency versioning story, IMHO)

💯 As an abstraction that's intended for general "common denominator" use, Tables should not be wilfully bypassing dependency resolution.

@KristofferC
Copy link
Contributor

KristofferC commented Apr 3, 2019

Tables could still have the packages as dependencies in the Project.toml file, just chose to conditionally load them. Installing packages is cheap.

@piever
Copy link
Author

piever commented Apr 3, 2019

Ideally we could have some abstractions allowing both PooledArrays and CategoricalArrays to provide optimized functionality when they are used as columns.

I've added a proposal to address this in JuliaLang/julia#31606.

@xgdgsc
Copy link
Contributor

xgdgsc commented Jun 24, 2019

https://discourse.julialang.org/t/sluggish-performance-when-loading-in-modules-dataframes-csv/17857 Slow loading time on a startup package really make people think julia is slow and would decrease potential users. Hope this gets fixed soon.

@nalimilan
Copy link
Member

Surprisingly, most of that is due to CSV.jl AFAICT.

Anyway, there's been some progress on that front via https://github.com/JuliaData/DataAPI.jl.

@dgkf
Copy link

dgkf commented Jan 18, 2020

Just to connect some pieces here, the idea of splitting up DataFrames.jl was also raised in #2080 from the perspective of decoupling the underlying DataTypes from the user-facing API.

Since data manipulation syntax is something that people can be quite opinionated about (myself certainly among them), and because the nomenclature that different paradigms use often overlaps (namely select, which was the root of the discussion), it seems appropriate to break apart the foundational type definitions from the user-facing conveniences so that a user can choose among available APIs without it being coupled to such a ubiquitous DataType as a DataFrame.

@bkamins
Copy link
Member

bkamins commented Jan 18, 2020

As noted earlier:

  1. I support the effort to decouple the type definitions from high-level API. If someone is willing to undertake this task in a non-breaking way I will gladly review such a proposal (probably it is best to fist prepare a detailed design document before it is being started implemented).
  2. I believe that we should create a new package (e.g. DataFramesBase.jl) for this decoupled "definitions" code. Although I agree that it would be linguistically more clean to keep this package for definitions only the problem we have is legacy code.
  3. Finally this is a very good point that people will be opinionated about paradigms we have to provide "some" default paradigm - and this is the goal of DataFrames.jl. If someone does not like it - it is not a problem. One would then load DataFramesBase.jl + a favorite high-level API layer package. However, vast majority of DataFrames.jl users are not opinionated, as they are not experts, who have an opinion but novice/casual users who just write several lines of code (often even not reading the manual).

@bkamins
Copy link
Member

bkamins commented Jan 18, 2020

One more thought.

@dgkf - actually it might be easier for you to get going by using a bit different approach. You can create a package of whatever name you like and reexport from it only those functions from DataFrames.jl that you want while you can define your own functions even with the same names that DataFrames.jl defines in whatever way you like without any problems.

The only restriction here is that functions that are defined in Base cannot be handled this way, but I guess you would treat them as low-level anyway (and we tried to keep them consistent with Base as much as possible with these functions).

It seems to me that taking this approach would be much less work than decoupling "internals" of DataFrames.jl from high-level API and in this way you could field-test your ideas more quickly and easily.

@dgkf
Copy link

dgkf commented Jan 18, 2020

Thanks, @bkamins for hearing me out on this one. I feel like I either entered the conversation at exactly the right time before a v1.0 release or exactly the wrong time as to throw a huge wrench into everyone's plans - maybe both! I'm very sorry for that!

I think that as far as API experimentation goes, it's certainly a manageable space currently.

My priority would be to separate out the underlying DataFrame type declarations. I'd be happy to help in that effort, and can do some of the grunt work of combing through and splitting out code, but I'd definitely need a lot of input about what should and shouldn't make its way into each package. I don't think I have nearly enough familiarity with the direction of the DataFrames package or of Julia package development etiquette to be able to draw those lines.

@bkamins
Copy link
Member

bkamins commented Jan 18, 2020

I feel like I either entered the conversation at exactly the right time before a v1.0 release or exactly the wrong time as to throw a huge wrench into everyone's plans, so I'm very sorry for that!

Discussion is highly appreciated. The issue is that you are touching points that are difficult to get right and different options have pros and cons (and people with different backgrounds tend to significantly differ in their opinions), which means we cannot expect to reach definitive conclusions quickly and easily (see e.g. discussions around Nullable type or string-handling ecosystem in the past in Julia Base).

E.g. having DataFrames.jl split into two packages is appealing but we need to understand the benefits of it, as there are clear and significant costs, as having two packages makes development of the ecosystem much harder (you have to synchronize releases of both packages, CI is harder etc.). And then the benefits and costs should be compared and decision what to do made.

what should and shouldn't make its way into each package.

This is exactly the problem currently no one came with a good solution.
The only practical way to do it is to put a PR or a specification on a table and get the feedback.

That is why I have proposed to start with a simpler approach - if you would consider doing it.
Namely you can wrap DataFrames.jl within another package. Then you can check if what you expose is usable. Here is a minimal example:

julia> module MyDataFrames
       import DataFrames: DataFrame
       export DataFrame
       end
Main.MyDataFrames

julia> using .MyDataFrames

julia> df = DataFrame(a=1,b=2)
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> df[1, 1]
1

julia> select(df, :a) # this fails, as select is not exposed by MyDataFrames
ERROR: UndefVarError: select not defined

In this way you can easily experiment what is the minimal set of functions that you feel should go into each of the packages + you can see what else (instead of removed functionality) can be implemented as an alternative.

The key thing in this approach is that even if someone does not like e.g. select but we provide it in DataFrames.jl one can simply make it effectively non-existent (so your package behaves as if it was never defined - it becomes private). You can either not define it at all or define it in a different way that you prefer.

@bkamins
Copy link
Member

bkamins commented Jan 18, 2020

Two additional thoughts:

  1. if you investigate the issue and PR history in DataFrames.jl you will notice that the vast majority of requests is for adding functionality (mostly from regular users that just want to get things done), and only a relatively small fraction regards low-level design things. I feel that such a situation has two reasons:
    • firstly, as noted earlier regular users do not spend time investigating the package ecosystem - they want to have "Pandas" for Julia that will just work as a single package
    • secondly, having one such package like DataFrames.jl means that the maintenance and development process is much easier (as it is not going to be abandoned and someone always will be there to respond to the issue/pr; and we do bug-fix releases in a range days from having bugs reported); this is important as having a single package (maybe not ideal - I also have things that I would prefer to change - but the majority of users prefers something else so they stay) allows the community to focus on it - and I feel that in open source this is also a crucial factor
  2. you might consider starting with a small/simple/easy to decide PRs - just to get a better feeling of the package - this might make it easier later to move to "grand challenge" issues.

Thank you for your support and thoughts!

@dgkf
Copy link

dgkf commented Jan 18, 2020

Thanks @bkamins, you're right. This probably isn't my place to contribute so early on in getting a feel for the package trajectory. I'd be happy to contribute to something smaller - please keep me posted on where I can help. I only offered my support for this particular issue so that the limiting factor isn't someone's time as I see it as a pretty critical step for the package ecosystem.

I completely agree with your vision for the package. As a new user myself, the "not spending time investigating the package ecosystem" and wanting something that "will just work" really speak to my experience. I wouldn't say I expected all the bells and whistles to come from DataFrames.jl, though I can totally see why some new users would expect that.

@davidanthoff
Copy link
Contributor

Here is my view on this discussion (as a large scale user of DataFrames.jl in my lab group, not as the queryverse author): I think the number 1 priority should be to get to a 1.0 and stability, ideally with as little breaking from the status quo as possible. The API that we have at this moment in DataFrames.jl is literally used by thousands, if not tens of thousands pieces of code, so I think the benefits of not breaking those pieces of code far outweigh pretty much all other considerations. I think there will always be disagreement on what the ideal approach in terms of surface API for a package like DataFrames.jl should be, but I really hope that we would consider that debate as done and over for DataFrames.jl at this point, and just consider that it is what it is at this point (namely a package that is not a minimal data structure, but a one-stop package that includes a lot of manipulation functionality), simply because anything else is just too disruptive for the end-users of this package, and there are too many of them.

That does not mean that we shouldn't experiment with lots and lots of other models, designs etc. in this space! I'm a huge fan of trying to come up with new designs etc. BUT, I think that should happen in other packages, not in DataFrames.jl. At this point we have really excellent interop of table types in the ecosystem, so if someone wants to introduce a lightweight table data structure only package that doesn't have much beyond that, then it is actually super easy to pull that off, and then implement a few basic interfaces and interop with pretty much everything that is out there in terms of table support. Integration with file IO, plotting, viewing etc. all pretty much comes at very, very minimal code cost at this point. And then we can see how that catches on or not, without disrupting the huge existing user base of DataFrames.jl.

Here is my view as the queryverse author: from that end it really doesn't matter :) Query.jl and all the other packages in queryverse have worked with all tabular data structures that exist in Julia land for more than two years now, so I think that is a pretty clear demonstration that one can create alternate APIs that are not tied to a particular tabular data structure in the system we have right now. Query.jl works with DataFrames.jl, but it equally well works with TypedTables, IndexedTables, JuliaDB.jl, whatever. If a user wants to have a tabular data type that is more light weight than DataFrames.jl and wants to use it with Query.jl, they can. If they prefer the batteries included approach of DataFrames.jl and want to use it with Query.jl, they also can. I see no reason why other manipulation frameworks that take a different API approach than Query.jl couldn't work exactly the same way. So I think the status quo is already such that it allows really easy experimentation with N table types and M manipulation frameworks, and have all N to M combinations just work.

@bkamins
Copy link
Member

bkamins commented Jan 18, 2020

@davidanthoff - thank you for the comment. I must say that it really clarifies for me some final design concerns we recently have trying to ship out 1.0 release. Essentially it says that when in doubt we should try not to break things.

@dgkf
Copy link

dgkf commented Jan 26, 2020

Probably it is best to fist prepare a detailed design document before it is being started implemented

The only practical way to do it is to put a PR or a specification on a table and get the feedback.

I don't have enough experience with Julia to make decisions regarding the breakdown of the package(s), but I figured I could help by trying to make the decision-making process a bit easier. I made this script to create a table of exported DataFrames.jl functions, and for methods that extend base functions, itemize the signature and keyword arguments. If there's other data people would like to see in here or find an issue, please let me know - I like doing this type of code analysis work.

You get a table that looks something like this (showing 5 of ~350 rows):

name type is_base file line signature kwargs
get Function 1 src/other/index.jl 143 [Index] missing
getindex Function 1 src/dataframerow/dataframerow.jl 203 [DataFrameRow, Union{Signed, Symbol, Unsigned}, Any] missing
getindex Function 1 src/dataframerow/dataframerow.jl 205 [Union{Function, Type}, DataFrameRow, Union{Signed, Symbol, Unsigned}] missing
getindex Function 1 src/groupeddataframe/groupeddataframe.jl 407 [GroupedDataFrame, Union{Tuple, NamedTuple}, Any] missing
getindex Function 1 src/other/index.jl 265 [Index, Symbol] missing

@quinnj
Copy link
Member

quinnj commented Feb 1, 2020

I'll just comment here that I largely agree with @davidanthoff's comments that we definitely want to avoid breakage in DataFrames at this point, and that we have pretty powerful interfaces that mean things are decently abstracted anyway.

In my mind, I love efforts like the Selections.jl package from @Drvi , which aim to take functionality, currently specific to DataFrames, and make it more general to all table types. There's power there because you have lightweight "table types" like CSV.File, or a potential SQLite.Table, that don't necessarily want to recreate the wheel when it comes to all the kinds of table operations you can do, yet it'd be convenient in a lot of cases if things "just worked" on them like they do for DataFrames. I think DataFrames.jl is a crucial place to hammer out apis, design, and implementations, but I'd love to see more and more things abstracted out into separate packages as possible, both to make the DataFrames.jl package more lightweight itself, and allow other table types to benefit from the thought, design, and functionality.

@bkamins
Copy link
Member

bkamins commented Apr 26, 2020

My experience from last several months of trying to push out DataFrames.jl to 1.0 release (like #2206 discussion currently) lead me to thinking that actually we should split "core" of DataFrames.jl and "utility functions". Tentatively I would keep DataFrames.jl name for the package with utility functions and DataFramesBase.jl for the core (but this can be discussed).

The issue is (and I guess most of people who commented in this thread were clear about it from the start 😄):

  1. packages depending on DataFrames.jl currently probably mostly need only DataFramesBase.jl, as they need types and not fancy functions like stack
  2. DataFramesBase.jl will be very low on dependencies and much faster to load (in particular CategoricalArrays.jl can be dropped then I think)
  3. DataFramesBase.jl will change much less frequently, which again is very important for other package development
  4. for the "utility functions" part it is very hard to get the design right, so this takes time and might require breaking things in the future, so it is better not to mix it with DataFramesBase.jl
  5. "utility functions" part is, well - just utility, so actually users might prefer other ecosystem than the default one (e.g. Query.jl, Selectors.jl or whaterver other). Of course DataFrames.jl with "utility functions" should be maintained and developed so that there is a "default set of utility functions" that users have access to.

What I would keep in DataFramesBase.jl:

  • types: AbstractDataFrame, DataFrame, SubDataFrame, DataFrameRow, DataFrameRows, DataFrameColumns
  • the list of functions that should be kept should be discussed (even this is not an easy list - as some functions are on the border between "core" and "utility" - but I think it would be doable)
  • I would add unsafe_propertynames that would expose column names as Symbols without copying (sometimes this is useful to have access to)

What notably I would exclude from DataFramesBase.jl and keep in DataFrames.jl:

  • GroupedDataFrame type (it is not needed for the "core" to work) and related functions
  • stack/unstack and related types
  • select and related types and functions

If we wanted to do this this would be done post 0.21 release. Then we could quickly have DataFramesBase.jl 1.0 release (needed for package developers to have a firm base) and DataFrames.jl would not have a pressure to stabilize as there are constant discussions what would be best in the "utility functions" layer.

EDIT

Also then it would be much easier to maintain DataFramesBase.jl with a limited number of contributors and I would try to encourage more people to contribute to DataFrames.jl "utility part".

@quinnj
Copy link
Member

quinnj commented Apr 28, 2020

I would very much support this plan.

@bkamins
Copy link
Member

bkamins commented Jul 28, 2020

Some more thoughts before I move with the implementation. I think we have two options:

  1. (easy) - in the lightweight DataFramesBase.jl package just define types (probably but only core, i.e. AbstractDataFrame, DataFrame nd Index). This move has the benefits that: 1) it is easy to implement, 2) loading DataFramesBase.jl will be super fast. The problem is that this approach is usable only for packages that only need to have a type name defined, but things will not work without loading DataFrames.jl on users side (and DataFrames.jl would do type piracy). Actually this could be defined then in DataAPI.jl without introducing a new package.
  2. (harder) - in the lightweight DataFramesBase.jl package we define all constructors for the type, all methods that are extending methods from external packages and all types that are needed to support these methods. This move has benefits that loading DataFramesBase.jl will make all things work at the expense that this will be a bit more heavy dependency and later it will be harder to maintain (some really complex code like for push!, append!, show, getindex, setindex!, broadcasting ...) would then go into the package. This would be almost the whole package save for split/apply/combine, joins and stack/unstack. In fact if we resolve the CategoricalArrays.jl issue probably loading time of DataFrames.jl as it is now and of such package would be very similar.

Do you have an opinion which one of them would be better? In general the (easy) option (only type definitions) should be enough for package development, but user would need to load DataFrames.jl later anyway for things to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests