-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Tables.jl interface #63
Conversation
Quick tests: As row store:
As column store:
Schema:
Integration with DataFrames.jl:
Integration with CSV.jl:
|
Codecov Report
@@ Coverage Diff @@
## master #63 +/- ##
=========================================
+ Coverage 92.46% 93.1% +0.63%
=========================================
Files 9 9
Lines 783 783
=========================================
+ Hits 724 729 +5
+ Misses 59 54 -5
Continue to review full report at Codecov.
|
* Include v1.0.0 perf test vs Pandas & ReadStat * Updated README with perf test summary
- Tables.jl support while maintaining backward compatibility (PR #63) - Updated performance benchmark vs. python/pandas and ReadStat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm a little slow in responding here; I was on holiday w/ limited internet access. This looks pretty good IMO! I added a few comments of things to think about, but overall it looks great to me. Feel free to ping me on the slack if you have any more questions or want to chat about something; I'm back to civilization now, so I'll be more responsive.
@@ -6,6 +6,7 @@ version = "1.0.0" | |||
[deps] | |||
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" | |||
StringEncodings = "69024149-9ee7-55f6-a4c4-859efe599b68" | |||
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c" | |||
TabularDisplay = "3eeacb1d-13c2-54cc-9b18-30c86af3cadb" | |||
|
|||
[compat] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compat, I'd suggest at least Tables 0.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, I've set it to Tables = "0.2.3"
(copied from DataFrame). Do you suggest "downgrading"?
Base.names(rs::ResultSet) = getfield(rs, :names) | ||
|
||
Base.size(rs::ResultSet) = getfield(rs, :size) | ||
Base.size(rs::ResultSet, i::Integer) = getfield(rs, :size)[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this 2nd size
method and the length
methods aren't needed if you implement the first size
method, but that also might require ResultSet
to be a subtype of AbstractArray
. Note that I recently switched CSV.File
to be CSV.File <: AbstractVector{CSV.Row}
and it's made things a little more convenient in a couple of ways.
# Return a single row as a tuple | ||
Base.getindex(rs::ResultSet, i::Integer) = Tuple([c[i] for c in rs.columns]) | ||
# Return a single row as a named tuple | ||
Base.getindex(rs::ResultSet, i::Integer) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pattern a lot of table types have moved to is having a "lazy row" struct instead of materializing full NamedTuples (which can be extremely costly for really wide datasets, like >1000 columns). It would look something like:
struct ResultSetRow <: AbstractVector{Any}
r::ResultSet
row::Int
end
and then you'd define getindex
, getproperty
, size
, and propertynames
on ResultSetRow
.
Just something to consider.
end | ||
println(io) | ||
end | ||
n < size(rs, 1) && println(io, "⋮") | ||
end | ||
|
||
# IteratableTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note if you'd still like to keep explicit IterableTables compatibility, you can use some of the convenience functions provided by Tables. DataFrames, for example, defines:
IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(columntable(df))
IteratorInterfaceExtensions.isiterable(x::AbstractDataFrame) = true
TableTraits.isiterabletable(x::AbstractDataFrame) = true
You'd have to add IteratorInterfaceExtensions
and TableTraits
as explicit dependencies, but just replace AbstractDataFrame
with ResultSet
and it should work.
@test Tables.rowaccess(typeof(rs)) === true | ||
@test Tables.columnaccess(typeof(rs)) === true | ||
@test Tables.rows(rs) |> first |> propertynames |> Tuple == Tuple(names(rs)) | ||
@test Tables.columns(rs) |> propertynames |> Tuple == Tuple(names(rs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest also using the Tables.jl-provided rowtable
and columntable
functions to test things. Like:
@test Tables.rowtable(rs) ==
@test Tables.columntable(rs) ==
Hey. No worries. I know it's a funny time of the year to ping anyone although this is also the time that I can actually focus and do some real work 😛 Thanks very much for your valuable comments. I'll certainly go through them and make it better. |
This PR fixes issue #54
Main changes are:
getindex(rs::ResultSet, i::Integer)
now returns a named tuple instead of plain tupleBase.propertynames
andBase.getproperty
methods are implemented forResultSet
.Other notes:
rs[:columnname]
continues to return the column array and the behavior is replicated as inrs.columnname
.So the only noticeable change should be the return of named tuples when used as a row store. Since named tuples can be used like regular tuples, this PR should be backward compatible. Hence a minor release is warranted.