Can MLUtils play nicely with Tables.jl? #61

ablaom · 2022-02-23T21:42:28Z

I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).

Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example shuffleobs((x1, x2)) treats x1 and x2 as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):

julia> X
((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))

julia> Tables.istable(X)
true

So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.

I wonder:

How attached are people to current tuple-based dispatch for coupled multi-container processing?
Is there a big use-case for tables that are also tuples? @quinnj

Possibly this discussion is related.

Tables that are tuples are problematic elsewhere.

@oxinabox @rikhuijzer @darsnack

The text was updated successfully, but these errors were encountered:

ablaom · 2022-02-23T21:44:08Z

@OkonSamuel

darsnack · 2022-02-23T22:21:19Z

JuliaML/MLDatasets.jl#96 adds data containers that are MLUtils compatible including TableDataset which wraps any Tables.jl compatible table. Tables.jl is just an interface, so we need a wrapper type to dispatch correctly. This way even a tuple that's intended to be a table can be treated as such.

Perhaps TableDataset is more fundamental and should be moved to MLUtils directly.

ablaom · 2022-02-23T22:29:54Z

@darsnack Thanks! That's good to know.

Perhaps TableDataset is more fundamental and should be moved to MLUtils directly.

Makes sense to me.

ablaom · 2022-02-24T00:38:46Z

What are the downsides of using trait-based dispatch to avoid the wrapping, apart from meaning Tables.jl has to be a dependency of BaseLearn.jl (or whatever replaces that)? That is, assuming the tuples issue is resolved another way.

darsnack · 2022-02-24T00:40:10Z

You mean using a traits package that turns istable into a dispatch instead of a run time check?

ablaom · 2022-02-24T00:48:38Z

Mmm, on further reflection, I guess there's no way to do this without breaking the API for current implementations, right?

darsnack · 2022-02-24T01:10:29Z

Depends on what you're thinking. With SimpleTraits.jl that's not the case though I haven't checked whether @traitimpl IsTable{X} <- istable(X) will incur no overhead.

Breaking changes to the API is not an issue, but I'd like to avoid needing to call istable on everything by default (also a bit arbitrary to special case Tables.jl). And I think having tuples/named tuples follow the current behavior is more desirable. A table might be a named tuple but not all named tuples are tables. I think for a Base type, it makes more sense for the user to specify a special meaning.

darsnack · 2022-02-24T01:13:15Z

I guess the other way of thinking about this is what is an example where something is Tables.jl compatible, but wrapping it in something like TableDataset is not desirable/possible? For TableDataset, we can make all the standard Tables.jl operations pass through to the underlying table so that the wrapper is nearly invisible beyond construction.

ablaom · 2022-02-24T03:24:03Z

I think these are fair arguments. And, by the way, I do think the TableDataset work is a cool and valuable contribution (which should probably not be hidden away in MLDatasets).

However, I believe the run-of-the mill statistics user will find wrapping tables off-putting, to say the least.
Therefore, in the "tables-must-be-wrapped" proposal, it becomes the responsibility of every model providing package or ML tooling package to wrap and unwrap tables, yes? But then such a package needs to test whether or not to do so (unless it only deals with tables) essentially duplicating the trait-based dispatch which I am proposing goes into this base ML API interface in the first place, no?

ToucheSir · 2022-02-24T04:10:20Z

I'm not sure why shuffleobs isn't a varargs function (historical reasons?), but the big issue with auto-wrapping Tables.jl-compatible sources is this. In an absence of an efficient random access API to rely on, our best option is to rely on users implementing getobs on their own dataset types. Always wrapping would mean consistently poor performance.

ablaom · 2022-02-24T05:46:07Z

Good point I hadn't thought of. By always wrapping, you hide the original table type, for which there might be efficient random access for which you could otherwise specialise on . Instead you always have to go with the random access through the Tables.jl API which is generally bad.

darsnack · 2022-02-24T17:32:33Z

Therefore, in the "tables-must-be-wrapped" proposal, it becomes the responsibility of every model providing package or ML tooling package to wrap and unwrap tables, yes? But then such a package needs to test whether or not to do so (unless it only deals with tables) essentially duplicating the trait-based dispatch which I am proposing goes into this base ML API interface in the first place, no?

Yeah that's a good point. I will see if there's an overhead to a trait-based approach and file a PR if not. Note that tuples would still be excluded from this path.

ablaom · 2022-02-24T21:10:43Z

I will see if there's an overhead to a trait-based approach and file a PR if not.

Thanks indeed for consdering this. Will this addresses @ToucheSir 's important point? That is, would getobs(my_table) always dispatch the inefficient tables fallback, even if an efficient getobs(::MyTable) is implemented? Perhaps we also we need a isdatacontainer trait (analogous to istable) to resolve this issue. Is there something like this?

I still feel the current interpretation of tuples in MLUtils, as way to dispatch special behaviour (and inherited from MLDataPattern) is intrinsically flawed. This isn't just about tables. Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)? What about @ToucheSir 's suggestion to use varargs, which "only" breaks use of shuffle and its cousins in the multi-container case, as far as I can tell. Are there other reasons for not using varargs?

darsnack · 2022-02-24T22:46:13Z

That is, would getobs(my_table) always dispatch the inefficient tables fallback, even if an efficient getobs(::MyTable) is implemented?

I will double check, but if getobs(::MyTable) exists, then it would take precedence over getobs(::X) where {X, IsTable{X}}. There is still the question of where these fast implementations live. Hopefully, DataFrames, etc. will be happy to accept a PR adding these definitions.

Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)?

Unfortunately, Julia forces a one-or-none situation here. Let's say MLUtils had no special path for tuples, but we did have a trait-based path for Tables.jl tables. Then tuples will automatically take that path. If some other Foo.jl interface that's not Tables.jl is also compatible with tuples, then there is no way for tuples to selectively participate as a Tables.jl table or a Foo.jl object.

My point is that treating tuples as tables is the same as treating them the way we do now. The only difference is what interpretation is given favoritism. So the alternative to what we have now is that tuples have no meaning attached to them, or we decide which interpretation is the "golden" one.

as way to dispatch special behaviour

At face value, it looks like the tuple decision is about dispatching the inputs to a function, but it is really a consequence of the outputs. In Julia, there is a single, canonical way to return multiple values, and that is a tuple.

If I do sample = getobs(A, i) for some array A, then I can apply the whole MLUtils pipeline (shuffleobs, etc.) on sample. This is a feature of the package—that I can just write out the pipeline interleaving data processing, transformation, and iteration utilities. For a container that returns multiple parts in each sample, sample will be a tuple (e.g. HDF5 datasets where various inputs and labels are stored under different HDF5 directories).

The alternative for MLUtils is to create some GroupedData which would be fine to use for data sources, but cumbersome to enforce at intermediate stages in the pipeline (every type should need to wrap every multi-return value as a GroupedData?). This would basically be MLUtils creating its own version of tuple.

So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior. Maybe no package should take ownership of tuple in its interface, and MLUtils should do some GroupedData thing. But I really do think we are adhering quite closely to the interpretation implied by Julia. IMO, it's Tables.jl that is misusing the notion of a tuple by turning it into a table (of course, I lack experience...maybe there is a really good reason that I'm missing).

darsnack · 2022-02-24T22:49:58Z

^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package.

darsnack · 2022-02-24T22:52:55Z

Another option is to enforce explicit splatting in the pipeline

ablaom · 2022-02-27T20:37:49Z

So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior.

Aha. Yes, I get this now. You want tuples for the pipelining.

Another option is to enforce explicit splatting in the pipeline

For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?

^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package.

Thanks for being open about this. I'm kind of playing devil's advocate here, and this more subtle than I realised at first. It may may be that wrapping tables that don't have native implementations of getobs is the safer path.

Mmm. What if we put logic in the wrapping operation that detects whether getobs is implemented, and if it is, we just return the original (unwrapped) object. This addresses @ToucheSir 's concern, right?

darsnack · 2022-02-28T17:31:41Z

For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?

Yeah we just use |> so this would mean asking the user to turn a tuple into a grouped container to pass to the next stage in the pipeline.

Mmm. What if we put logic in the wrapping operation that detects whether getobs is implemented, and if it is, we just return the original (unwrapped) object.

You mean that if someone tries to wrap a DataFrame as a TableDataset then we just pass through the DataFrame instead of wrapping it (similarly for anything that implements getobs directly)? We can certainly make that happen, but this would still require the user or downstream package to perform the wrapping operation? Just trying to understand where auto-wrapping would come into play here.

ToucheSir · 2022-03-13T20:14:48Z

FYI: JuliaData/Tables.jl#278

CarloLucibello · 2022-04-03T18:18:45Z

I think it would be good to support Tables.jl's tables without any wrappers. For numobs we can change the generic fallback to

function numobs(data)
    if Tables.istable(data)
        return length(Tables.rows(data))
    else
        return length(data)
    end
end

getobs is more problematic since Tables.jl doesn't expose a getrow(table, i) interface. We could just implement the following, that works for DataFrames

function getobs(data, idx::Union{Integer, AbstractVector{<:Integer}})
    if Tables.istable(data)
        return data[idx, :]
    else
        return data[idx]
    end
end

At least this leaves us in a better position than where we currently stand.

Tuples and NameTuples current specializations should stay as they are instead.

ablaom · 2022-04-03T22:14:00Z

@CarloLucibello (cc edited) Thanks for that. I agree it would be good to avoid wrapping. Summarizing discussions at JuliaData/Tables.jl#276 and on slack, I think there will be no change at Tables.jl to rule out tuple-tables, and I suggest MLUtils.jl continues to treat tuples as multi-containers. On further investigation, ruling out tuple tables for ML is probably fine, at least for now.

I did not realize that MLUtils treats named tuples in a special way. Can you say more about this? That could be a deal-breaker as named tuples of equi-length vectors (called column tables) is a fundamental native Julia table type used all over the place.

As for the your getobs proposal. I think we can improve on this and I suggest we wait and see how JuliaData/Tables.jl#278 resolves. I think there is momentum for progress there. Is there some urgency at your end? If so, I could have a bash at suggesting something better.

darsnack · 2022-04-03T23:02:37Z

It's treated the same as tuple: a multi-container like (features = ..., targets = ...) will call getobs, etc. on each entry in the named tuple as if it were a container.

ablaom · 2022-04-04T00:15:46Z

Oh no 😭 .

ToucheSir · 2022-04-04T03:14:01Z

AFAICT MLUtils and Tables jl are consistent in treating a namedtuple of vectors or tuples as column tables, can you elaborate on your concern?

ablaom · 2022-04-04T04:29:30Z

Oh, I see. I guess that means column tables can only be indexed as if they were multi-containers of vectors (or tuples) (in the MLUtils sense). Nothing that Tables.jl does about column tables (eg, some change to the implementation of Tables.rows) will have any effect on behaviour, no matter what we implement here for generic tables, but I guess that doesn't matter too much??

CarloLucibello · 2022-06-28T07:09:03Z

It will be possible to solve this issue using Tables.getrows from JuliaData/Tables.jl#284 and DataAPI.nrow.

ablaom mentioned this issue Mar 3, 2022

Tables that are tuples JuliaData/Tables.jl#276

Open

ablaom mentioned this issue Mar 13, 2022

Start an idea of what an "in memory" requirement would look like JuliaData/Tables.jl#278

Open

This was referenced Mar 30, 2022

awkward use case: features that don't naturally combine into a table JuliaAI/MLJ.jl#915

Open

[Discussion] Support the MLDataPattern API for data containers JuliaAI/MLJ.jl#918

Open

darsnack mentioned this issue Apr 3, 2022

redesign the package JuliaML/MLDatasets.jl#73

Closed

This was referenced Jun 16, 2022

Possibility of TableOperations.jl for column access JuliaAI/MLJBase.jl#790

Open

Use new Tables API in implementation of selectrows JuliaAI/MLJBase.jl#791

Open

CarloLucibello mentioned this issue Jun 28, 2022

Tables.jl and DataAPI.jl interoperation #67

Closed

ablaom mentioned this issue Oct 12, 2022

Exclude certain kinds of dictionary tables from being interpreted as tables JuliaAI/ScientificTypes.jl#189

Merged

CarloLucibello mentioned this issue Oct 15, 2022

numobs and getobs support for Tables.jl's tables #124

Merged

CarloLucibello closed this as completed in #124 Oct 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can MLUtils play nicely with Tables.jl? #61

Can MLUtils play nicely with Tables.jl? #61

ablaom commented Feb 23, 2022 •

edited

Loading

ablaom commented Feb 23, 2022

darsnack commented Feb 23, 2022

ablaom commented Feb 23, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022 •

edited

Loading

darsnack commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022

ToucheSir commented Feb 24, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022 •

edited

Loading

darsnack commented Feb 24, 2022

darsnack commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 27, 2022

darsnack commented Feb 28, 2022

ToucheSir commented Mar 13, 2022

CarloLucibello commented Apr 3, 2022 •

edited

Loading

ablaom commented Apr 3, 2022 •

edited

Loading

darsnack commented Apr 3, 2022

ablaom commented Apr 4, 2022

ToucheSir commented Apr 4, 2022

ablaom commented Apr 4, 2022 •

edited

Loading

CarloLucibello commented Jun 28, 2022

Can MLUtils play nicely with Tables.jl? #61

Can MLUtils play nicely with Tables.jl? #61

Comments

ablaom commented Feb 23, 2022 • edited Loading

ablaom commented Feb 23, 2022

darsnack commented Feb 23, 2022

ablaom commented Feb 23, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022 • edited Loading

darsnack commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022

ToucheSir commented Feb 24, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022 • edited Loading

darsnack commented Feb 24, 2022

darsnack commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 27, 2022

darsnack commented Feb 28, 2022

ToucheSir commented Mar 13, 2022

CarloLucibello commented Apr 3, 2022 • edited Loading

ablaom commented Apr 3, 2022 • edited Loading

darsnack commented Apr 3, 2022

ablaom commented Apr 4, 2022

ToucheSir commented Apr 4, 2022

ablaom commented Apr 4, 2022 • edited Loading

CarloLucibello commented Jun 28, 2022

ablaom commented Feb 23, 2022 •

edited

Loading

ablaom commented Feb 24, 2022 •

edited

Loading

ablaom commented Feb 24, 2022 •

edited

Loading

CarloLucibello commented Apr 3, 2022 •

edited

Loading

ablaom commented Apr 3, 2022 •

edited

Loading

ablaom commented Apr 4, 2022 •

edited

Loading