-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can MLUtils play nicely with Tables.jl? #61
Comments
JuliaML/MLDatasets.jl#96 adds data containers that are MLUtils compatible including Perhaps |
@darsnack Thanks! That's good to know.
Makes sense to me. |
What are the downsides of using trait-based dispatch to avoid the wrapping, apart from meaning Tables.jl has to be a dependency of BaseLearn.jl (or whatever replaces that)? That is, assuming the tuples issue is resolved another way. |
You mean using a traits package that turns |
Mmm, on further reflection, I guess there's no way to do this without breaking the API for current implementations, right? |
Depends on what you're thinking. With SimpleTraits.jl that's not the case though I haven't checked whether Breaking changes to the API is not an issue, but I'd like to avoid needing to call |
I guess the other way of thinking about this is what is an example where something is Tables.jl compatible, but wrapping it in something like |
I think these are fair arguments. And, by the way, I do think the TableDataset work is a cool and valuable contribution (which should probably not be hidden away in MLDatasets). However, I believe the run-of-the mill statistics user will find wrapping tables off-putting, to say the least. |
I'm not sure why |
Good point I hadn't thought of. By always wrapping, you hide the original table type, for which there might be efficient random access for which you could otherwise specialise on . Instead you always have to go with the random access through the Tables.jl API which is generally bad. |
Yeah that's a good point. I will see if there's an overhead to a trait-based approach and file a PR if not. Note that tuples would still be excluded from this path. |
Thanks indeed for consdering this. Will this addresses @ToucheSir 's important point? That is, would I still feel the current interpretation of tuples in MLUtils, as way to dispatch special behaviour (and inherited from MLDataPattern) is intrinsically flawed. This isn't just about tables. Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)? What about @ToucheSir 's suggestion to use varargs, which "only" breaks use of |
I will double check, but if
Unfortunately, Julia forces a one-or-none situation here. Let's say MLUtils had no special path for tuples, but we did have a trait-based path for Tables.jl tables. Then tuples will automatically take that path. If some other Foo.jl interface that's not Tables.jl is also compatible with tuples, then there is no way for tuples to selectively participate as a Tables.jl table or a Foo.jl object. My point is that treating tuples as tables is the same as treating them the way we do now. The only difference is what interpretation is given favoritism. So the alternative to what we have now is that tuples have no meaning attached to them, or we decide which interpretation is the "golden" one.
At face value, it looks like the tuple decision is about dispatching the inputs to a function, but it is really a consequence of the outputs. In Julia, there is a single, canonical way to return multiple values, and that is a tuple. If I do The alternative for MLUtils is to create some So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior. Maybe no package should take ownership of |
^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package. |
Another option is to enforce explicit splatting in the pipeline |
Aha. Yes, I get this now. You want tuples for the pipelining.
For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?
Thanks for being open about this. I'm kind of playing devil's advocate here, and this more subtle than I realised at first. It may may be that wrapping tables that don't have native implementations of Mmm. What if we put logic in the wrapping operation that detects whether |
Yeah we just use
You mean that if someone tries to wrap a |
I think it would be good to support Tables.jl's tables without any wrappers. For function numobs(data)
if Tables.istable(data)
return length(Tables.rows(data))
else
return length(data)
end
end
function getobs(data, idx::Union{Integer, AbstractVector{<:Integer}})
if Tables.istable(data)
return data[idx, :]
else
return data[idx]
end
end At least this leaves us in a better position than where we currently stand. Tuples and NameTuples current specializations should stay as they are instead. |
@CarloLucibello (cc edited) Thanks for that. I agree it would be good to avoid wrapping. Summarizing discussions at JuliaData/Tables.jl#276 and on slack, I think there will be no change at Tables.jl to rule out tuple-tables, and I suggest MLUtils.jl continues to treat tuples as multi-containers. On further investigation, ruling out tuple tables for ML is probably fine, at least for now. I did not realize that MLUtils treats named tuples in a special way. Can you say more about this? That could be a deal-breaker as named tuples of equi-length vectors (called column tables) is a fundamental native Julia table type used all over the place. As for the your |
It's treated the same as tuple: a multi-container like |
Oh no 😭 . |
AFAICT MLUtils and Tables jl are consistent in treating a namedtuple of vectors or tuples as column tables, can you elaborate on your concern? |
Oh, I see. I guess that means column tables can only be indexed as if they were multi-containers of vectors (or tuples) (in the MLUtils sense). Nothing that Tables.jl does about column tables (eg, some change to the implementation of |
It will be possible to solve this issue using |
I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).
Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example
shuffleobs((x1, x2))
treatsx1
andx2
as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.
I wonder:
Possibly this discussion is related.
Tables that are tuples are problematic elsewhere.
@oxinabox @rikhuijzer @darsnack
The text was updated successfully, but these errors were encountered: