Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating a DTable eagerly copy all entries? #59

Open
Moelf opened this issue Nov 8, 2023 · 6 comments
Open

creating a DTable eagerly copy all entries? #59

Moelf opened this issue Nov 8, 2023 · 6 comments

Comments

@Moelf
Copy link

Moelf commented Nov 8, 2023

julia> struct A <: AbstractVector{Int}
           v::Int
       end
julia> Base.size(::A) = (100, 1)

julia> Base.getindex(a::A, ::Int) = (println("??"); a.v)

julia> using DTables

julia> dt = DTable((a=A(3),), 10);
??
??
??
??
@carstenbauer
Copy link
Member

Came up during the Julia HEP 2023 workshop.

(cc @grasph)

@Moelf
Copy link
Author

Moelf commented Nov 8, 2023

but this seems to be fine?

julia> dt = DTable((a=A(3),));

julia>

@carstenbauer
Copy link
Member

Naive debugging during a talk :)

I think the critical line in DTables.jl is https://github.com/JuliaParallel/DTables.jl/blob/main/src/table/dtable.jl#L109. If I didn't make a debugging mistake, it jumps to https://github.com/JuliaData/Tables.jl/blob/175e431eadaae9e439b5a8e020afce06dc6cf4f4/src/namedtuples.jl#L187 and then [...] to https://github.com/JuliaData/Tables.jl/blob/175e431eadaae9e439b5a8e020afce06dc6cf4f4/src/fallbacks.jl#L265 which is

CopiedColumns(buildcolumns(schema(r), r))

@carstenbauer
Copy link
Member

carstenbauer commented Nov 8, 2023

When we do DTable((a=A(3),)); we have

typeof(partition) == NamedTuple{(:a,), Tuple{A}}

go into the sink call, which leads to https://github.com/JuliaData/Tables.jl/blob/main/src/namedtuples.jl#L192.

For DTable((a=A(3),), 10); OTOH we have

typeof(inner_partition) = TableOperations.ColumnPartitionRows{NamedTuple{(:a,), Tuple{A}}}

which leads to the fallback that I linked the post above.

@Moelf
Copy link
Author

Moelf commented Nov 8, 2023

I think it's reasonable for partition to allocate, closing as not actionable.

so DTable calls table operation, maybe it should not materialize like this?

@Moelf Moelf closed this as completed Nov 8, 2023
@Moelf Moelf reopened this Nov 8, 2023
@krynju
Copy link
Member

krynju commented Nov 8, 2023

The simple answer without going into details is that if you use the constructors that assume the partitioning of the input then it will not copy the input (at least it shouldn't)
When you use the constructors that take the chunksize arg then the constructor will be partitioning the data for you and ignore the partitioning of the input (fyi it will still use the partitions interface).
With these you can also control whether you want to merge rows between the partitions of the input, which might be helpful when your partitions are huge and you don't want two of them loaded at the same time

The above work on Tables/TableOperations, so they fallback to copies eventually even if the code doesn't really copy it explicitly (not aware if this can be improved)

I'm not super happy with the constructors for DTable. There's a lot of cool things we could do to make it more efficient when using different input sources. As of right now it's made to be universal and take any Tables.jl input
Good news is that they're not very tightly coupled with the rest of the code and if you really need to you can write your own constructor pretty easily, so we can extend/replace this without any concerns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants