Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Tables.jl interface #63

Merged
merged 8 commits into from
Jan 2, 2020
Merged

Support for Tables.jl interface #63

merged 8 commits into from
Jan 2, 2020

Conversation

tk3369
Copy link
Owner

@tk3369 tk3369 commented Jan 1, 2020

This PR fixes issue #54

Main changes are:

  • getindex(rs::ResultSet, i::Integer) now returns a named tuple instead of plain tuple
  • Base.propertynames and Base.getproperty methods are implemented for ResultSet.

Other notes:

  • Both row & column access are supported.
  • Direct ResultSet access methods are unchanged i.e. backward compatible. For example, rs[:columnname] continues to return the column array and the behavior is replicated as in rs.columnname.

So the only noticeable change should be the return of named tuples when used as a row store. Since named tuples can be used like regular tuples, this PR should be backward compatible. Hence a minor release is warranted.

@tk3369
Copy link
Owner Author

tk3369 commented Jan 1, 2020

Quick tests:

As row store:

julia> rs = readsas("test/data_pandas/productsales.sas7bdat")
Read test/data_pandas/productsales.sas7bdat with size 1440 x 10 in 0.2239 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
⋮

julia> rs[1]
(ACTUAL = 925.0, PREDICT = 850.0, COUNTRY = "CANADA", REGION = "EAST", DIVISION = "EDUCATION", PRODTYPE = "FURNITURE", PRODUCT = "SOFA", QUARTER = 1.0, YEAR = 1993.0, MONTH = 1993-01-01)

julia> sum(r.ACTUAL for r in rs)
730337.0

As column store:

julia> rs.ACTUAL
1440-element Array{Float64,1}:
 925.0
 999.0
 608.0
   ⋮  
 526.0
 652.0
 573.0

Schema:

julia> Tables.schema(rs)
Tables.Schema:
 :ACTUAL    Float64                   
 :PREDICT   Float64                   
 :COUNTRY   String                    
 :REGION    String                    
 :DIVISION  String                    
 :PRODTYPE  String                    
 :PRODUCT   String                    
 :QUARTER   Float64                   
 :YEAR      Float64                   
 :MONTH     Union{Missing, Dates.Date}

Integration with DataFrames.jl:

julia> DataFrame(rs)
1440×10 DataFrame
│ Row  │ ACTUAL  │ PREDICT │ COUNTRY │ REGION │ DIVISION  │ PRODTYPE  │ PRODUCT │ QUARTER │ YEAR    │ MONTH      │
│      │ Float64 │ Float64 │ String  │ String │ String    │ String    │ String  │ Float64 │ Float64 │ Dates…⍰    │
├──────┼─────────┼─────────┼─────────┼────────┼───────────┼───────────┼─────────┼─────────┼─────────┼────────────┤
│ 1    │ 925.0   │ 850.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0  │ 1993-01-01 │
│ 2    │ 999.0   │ 297.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0  │ 1993-02-01 │
│ 3    │ 608.0   │ 846.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0  │ 1993-03-01 │
│ 4    │ 642.0   │ 533.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0  │ 1993-04-01 │
│ 5    │ 656.0   │ 646.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0  │ 1993-05-01 │
│ 6    │ 948.0   │ 486.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0  │ 1993-06-01 │

Integration with CSV.jl:

julia> CSV.write("/tmp/test.csv", rs)
"/tmp/test.csv"

shell> head /tmp/test.csv
ACTUAL,PREDICT,COUNTRY,REGION,DIVISION,PRODTYPE,PRODUCT,QUARTER,YEAR,MONTH
925.0,850.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,1.0,1993.0,1993-01-01
999.0,297.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,1.0,1993.0,1993-02-01
608.0,846.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,1.0,1993.0,1993-03-01
642.0,533.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,2.0,1993.0,1993-04-01
656.0,646.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,2.0,1993.0,1993-05-01
948.0,486.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,2.0,1993.0,1993-06-01
612.0,717.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,3.0,1993.0,1993-07-01
114.0,564.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,3.0,1993.0,1993-08-01
685.0,230.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,3.0,1993.0,1993-09-01

@codecov
Copy link

codecov bot commented Jan 2, 2020

Codecov Report

Merging #63 into master will increase coverage by 0.63%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #63      +/-   ##
=========================================
+ Coverage   92.46%   93.1%   +0.63%     
=========================================
  Files           9       9              
  Lines         783     783              
=========================================
+ Hits          724     729       +5     
+ Misses         59      54       -5
Impacted Files Coverage Δ
src/ResultSet.jl 95.34% <ø> (ø) ⬆️
src/tables.jl 100% <0%> (+83.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e9ea75...2d5f657. Read the comment docs.

@coveralls
Copy link

coveralls commented Jan 2, 2020

Coverage Status

Coverage increased (+0.2%) to 93.25% when pulling 2d5f657 on tables-interface into 1fbb143 on master.

@tk3369 tk3369 merged commit ebb35b2 into master Jan 2, 2020
tk3369 added a commit that referenced this pull request Jan 2, 2020
- Tables.jl support while maintaining backward compatibility (PR #63)
- Updated performance benchmark vs. python/pandas and ReadStat
Copy link

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm a little slow in responding here; I was on holiday w/ limited internet access. This looks pretty good IMO! I added a few comments of things to think about, but overall it looks great to me. Feel free to ping me on the slack if you have any more questions or want to chat about something; I'm back to civilization now, so I'll be more responsive.

@@ -6,6 +6,7 @@ version = "1.0.0"
[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
StringEncodings = "69024149-9ee7-55f6-a4c4-859efe599b68"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TabularDisplay = "3eeacb1d-13c2-54cc-9b18-30c86af3cadb"

[compat]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compat, I'd suggest at least Tables 0.2

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I've set it to Tables = "0.2.3" (copied from DataFrame). Do you suggest "downgrading"?

Base.names(rs::ResultSet) = getfield(rs, :names)

Base.size(rs::ResultSet) = getfield(rs, :size)
Base.size(rs::ResultSet, i::Integer) = getfield(rs, :size)[i]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 2nd size method and the length methods aren't needed if you implement the first size method, but that also might require ResultSet to be a subtype of AbstractArray. Note that I recently switched CSV.File to be CSV.File <: AbstractVector{CSV.Row} and it's made things a little more convenient in a couple of ways.

# Return a single row as a tuple
Base.getindex(rs::ResultSet, i::Integer) = Tuple([c[i] for c in rs.columns])
# Return a single row as a named tuple
Base.getindex(rs::ResultSet, i::Integer) =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pattern a lot of table types have moved to is having a "lazy row" struct instead of materializing full NamedTuples (which can be extremely costly for really wide datasets, like >1000 columns). It would look something like:

struct ResultSetRow <: AbstractVector{Any}
    r::ResultSet
    row::Int
end

and then you'd define getindex, getproperty, size, and propertynames on ResultSetRow.

Just something to consider.

end
println(io)
end
n < size(rs, 1) && println(io, "⋮")
end

# IteratableTables
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note if you'd still like to keep explicit IterableTables compatibility, you can use some of the convenience functions provided by Tables. DataFrames, for example, defines:

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(columntable(df))
IteratorInterfaceExtensions.isiterable(x::AbstractDataFrame) = true
TableTraits.isiterabletable(x::AbstractDataFrame) = true

You'd have to add IteratorInterfaceExtensions and TableTraits as explicit dependencies, but just replace AbstractDataFrame with ResultSet and it should work.

@test Tables.rowaccess(typeof(rs)) === true
@test Tables.columnaccess(typeof(rs)) === true
@test Tables.rows(rs) |> first |> propertynames |> Tuple == Tuple(names(rs))
@test Tables.columns(rs) |> propertynames |> Tuple == Tuple(names(rs))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest also using the Tables.jl-provided rowtable and columntable functions to test things. Like:

@test Tables.rowtable(rs) ==
@test Tables.columntable(rs) == 

@tk3369
Copy link
Owner Author

tk3369 commented Jan 4, 2020

Sorry I'm a little slow in responding here; I was on holiday w/ limited internet access. This looks pretty good IMO! I added a few comments of things to think about, but overall it looks great to me. Feel free to ping me on the slack if you have any more questions or want to chat about something; I'm back to civilization now, so I'll be more responsive.

Hey. No worries. I know it's a funny time of the year to ping anyone although this is also the time that I can actually focus and do some real work 😛 Thanks very much for your valuable comments. I'll certainly go through them and make it better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants