Support for Tables.jl interface #63

tk3369 · 2020-01-01T23:46:14Z

This PR fixes issue #54

Main changes are:

getindex(rs::ResultSet, i::Integer) now returns a named tuple instead of plain tuple
Base.propertynames and Base.getproperty methods are implemented for ResultSet.

Other notes:

Both row & column access are supported.
Direct ResultSet access methods are unchanged i.e. backward compatible. For example, rs[:columnname] continues to return the column array and the behavior is replicated as in rs.columnname.

So the only noticeable change should be the return of named tuples when used as a row store. Since named tuples can be used like regular tuples, this PR should be backward compatible. Hence a minor release is warranted.

tk3369 · 2020-01-01T23:54:27Z

Quick tests:

As row store:

julia> rs = readsas("test/data_pandas/productsales.sas7bdat")
Read test/data_pandas/productsales.sas7bdat with size 1440 x 10 in 0.2239 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
⋮

julia> rs[1]
(ACTUAL = 925.0, PREDICT = 850.0, COUNTRY = "CANADA", REGION = "EAST", DIVISION = "EDUCATION", PRODTYPE = "FURNITURE", PRODUCT = "SOFA", QUARTER = 1.0, YEAR = 1993.0, MONTH = 1993-01-01)

julia> sum(r.ACTUAL for r in rs)
730337.0

As column store:

julia> rs.ACTUAL
1440-element Array{Float64,1}:
 925.0
 999.0
 608.0
   ⋮  
 526.0
 652.0
 573.0

Schema:

julia> Tables.schema(rs)
Tables.Schema:
 :ACTUAL    Float64                   
 :PREDICT   Float64                   
 :COUNTRY   String                    
 :REGION    String                    
 :DIVISION  String                    
 :PRODTYPE  String                    
 :PRODUCT   String                    
 :QUARTER   Float64                   
 :YEAR      Float64                   
 :MONTH     Union{Missing, Dates.Date}

Integration with DataFrames.jl:

julia> DataFrame(rs)
1440×10 DataFrame
│ Row  │ ACTUAL  │ PREDICT │ COUNTRY │ REGION │ DIVISION  │ PRODTYPE  │ PRODUCT │ QUARTER │ YEAR    │ MONTH      │
│      │ Float64 │ Float64 │ String  │ String │ String    │ String    │ String  │ Float64 │ Float64 │ Dates…⍰    │
├──────┼─────────┼─────────┼─────────┼────────┼───────────┼───────────┼─────────┼─────────┼─────────┼────────────┤
│ 1    │ 925.0   │ 850.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0  │ 1993-01-01 │
│ 2    │ 999.0   │ 297.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0  │ 1993-02-01 │
│ 3    │ 608.0   │ 846.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 1.0     │ 1993.0  │ 1993-03-01 │
│ 4    │ 642.0   │ 533.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0  │ 1993-04-01 │
│ 5    │ 656.0   │ 646.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0  │ 1993-05-01 │
│ 6    │ 948.0   │ 486.0   │ CANADA  │ EAST   │ EDUCATION │ FURNITURE │ SOFA    │ 2.0     │ 1993.0  │ 1993-06-01 │

Integration with CSV.jl:

julia> CSV.write("/tmp/test.csv", rs)
"/tmp/test.csv"

shell> head /tmp/test.csv
ACTUAL,PREDICT,COUNTRY,REGION,DIVISION,PRODTYPE,PRODUCT,QUARTER,YEAR,MONTH
925.0,850.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,1.0,1993.0,1993-01-01
999.0,297.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,1.0,1993.0,1993-02-01
608.0,846.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,1.0,1993.0,1993-03-01
642.0,533.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,2.0,1993.0,1993-04-01
656.0,646.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,2.0,1993.0,1993-05-01
948.0,486.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,2.0,1993.0,1993-06-01
612.0,717.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,3.0,1993.0,1993-07-01
114.0,564.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,3.0,1993.0,1993-08-01
685.0,230.0,CANADA,EAST,EDUCATION,FURNITURE,SOFA,3.0,1993.0,1993-09-01

codecov · 2020-01-02T00:11:14Z

Codecov Report

Merging #63 into master will increase coverage by 0.63%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master     #63      +/-   ##
=========================================
+ Coverage   92.46%   93.1%   +0.63%     
=========================================
  Files           9       9              
  Lines         783     783              
=========================================
+ Hits          724     729       +5     
+ Misses         59      54       -5

Impacted Files	Coverage Δ
src/ResultSet.jl	`95.34% <ø> (ø)`	⬆️
src/tables.jl	`100% <0%> (+83.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e9ea75...2d5f657. Read the comment docs.

coveralls · 2020-01-02T00:28:30Z

Coverage increased (+0.2%) to 93.25% when pulling 2d5f657 on tables-interface into 1fbb143 on master.

* Include v1.0.0 perf test vs Pandas & ReadStat * Updated README with perf test summary

- Tables.jl support while maintaining backward compatibility (PR #63) - Updated performance benchmark vs. python/pandas and ReadStat

quinnj

Sorry I'm a little slow in responding here; I was on holiday w/ limited internet access. This looks pretty good IMO! I added a few comments of things to think about, but overall it looks great to me. Feel free to ping me on the slack if you have any more questions or want to chat about something; I'm back to civilization now, so I'll be more responsive.

quinnj · 2020-01-04T18:24:34Z

Project.toml

@@ -6,6 +6,7 @@ version = "1.0.0"
 [deps]
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 StringEncodings = "69024149-9ee7-55f6-a4c4-859efe599b68"
+Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
 TabularDisplay = "3eeacb1d-13c2-54cc-9b18-30c86af3cadb"

 [compat]


For compat, I'd suggest at least Tables 0.2

Right now, I've set it to Tables = "0.2.3" (copied from DataFrame). Do you suggest "downgrading"?

quinnj · 2020-01-04T18:37:07Z

src/ResultSet.jl

+Base.names(rs::ResultSet) = getfield(rs, :names)
+
+Base.size(rs::ResultSet) = getfield(rs, :size)
+Base.size(rs::ResultSet, i::Integer) = getfield(rs, :size)[i]


I think this 2nd size method and the length methods aren't needed if you implement the first size method, but that also might require ResultSet to be a subtype of AbstractArray. Note that I recently switched CSV.File to be CSV.File <: AbstractVector{CSV.Row} and it's made things a little more convenient in a couple of ways.

quinnj · 2020-01-04T18:41:04Z

src/ResultSet.jl

-# Return a single row as a tuple
-Base.getindex(rs::ResultSet, i::Integer) = Tuple([c[i] for c in rs.columns])
+# Return a single row as a named tuple
+Base.getindex(rs::ResultSet, i::Integer) = 


One pattern a lot of table types have moved to is having a "lazy row" struct instead of materializing full NamedTuples (which can be extremely costly for really wide datasets, like >1000 columns). It would look something like:

struct ResultSetRow <: AbstractVector{Any} r::ResultSet row::Int end

and then you'd define getindex, getproperty, size, and propertynames on ResultSetRow.

Just something to consider.

quinnj · 2020-01-04T18:46:40Z

src/ResultSet.jl

        end
        println(io)
    end
    n < size(rs, 1) && println(io, "⋮")
 end

-# IteratableTables


Note if you'd still like to keep explicit IterableTables compatibility, you can use some of the convenience functions provided by Tables. DataFrames, for example, defines:

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(columntable(df)) IteratorInterfaceExtensions.isiterable(x::AbstractDataFrame) = true TableTraits.isiterabletable(x::AbstractDataFrame) = true

You'd have to add IteratorInterfaceExtensions and TableTraits as explicit dependencies, but just replace AbstractDataFrame with ResultSet and it should work.

quinnj · 2020-01-04T18:48:40Z

test/runtests.jl

+        @test Tables.rowaccess(typeof(rs)) === true
+        @test Tables.columnaccess(typeof(rs)) === true
+        @test Tables.rows(rs) |> first |> propertynames |> Tuple == Tuple(names(rs))
+        @test Tables.columns(rs) |> propertynames |> Tuple == Tuple(names(rs))


I'd suggest also using the Tables.jl-provided rowtable and columntable functions to test things. Like:

@test Tables.rowtable(rs) == @test Tables.columntable(rs) ==

tk3369 · 2020-01-04T19:35:38Z

Sorry I'm a little slow in responding here; I was on holiday w/ limited internet access. This looks pretty good IMO! I added a few comments of things to think about, but overall it looks great to me. Feel free to ping me on the slack if you have any more questions or want to chat about something; I'm back to civilization now, so I'll be more responsive.

Hey. No worries. I know it's a funny time of the year to ping anyone although this is also the time that I can actually focus and do some real work 😛 Thanks very much for your valuable comments. I'll certainly go through them and make it better.

support for Tables.jl interface

bd32271

fixed named tuple type test & added propertynames/getproperty tests

eb1d5ba

tk3369 added 6 commits January 1, 2020 21:45

Added Tables-specific test cases

7191e98

No more pre-0.7 code

8a6a253

Perf test v1.0.0 (#62)

7e9ea75

* Include v1.0.0 perf test vs Pandas & ReadStat * Updated README with perf test summary

Merge branch 'master' into tables-interface

ec51ca5

Tables.jl coverage (direct calls)

4075e2e

Updated examples for Tables.jl

2d5f657

tk3369 merged commit ebb35b2 into master Jan 2, 2020

tk3369 added a commit that referenced this pull request Jan 2, 2020

Minor release 1.1.0

9da40e6

- Tables.jl support while maintaining backward compatibility (PR #63) - Updated performance benchmark vs. python/pandas and ReadStat

quinnj reviewed Jan 4, 2020

View reviewed changes

davidanthoff mentioned this pull request Jan 8, 2020

Fix queryverse integration #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Tables.jl interface #63

Support for Tables.jl interface #63

tk3369 commented Jan 1, 2020 •

edited

Loading

tk3369 commented Jan 1, 2020 •

edited

Loading

codecov bot commented Jan 2, 2020 •

edited

Loading

coveralls commented Jan 2, 2020 •

edited

Loading

quinnj left a comment

quinnj Jan 4, 2020

tk3369 Jan 4, 2020

quinnj Jan 4, 2020

quinnj Jan 4, 2020

quinnj Jan 4, 2020

quinnj Jan 4, 2020

tk3369 commented Jan 4, 2020

Support for Tables.jl interface #63

Support for Tables.jl interface #63

Conversation

tk3369 commented Jan 1, 2020 • edited Loading

tk3369 commented Jan 1, 2020 • edited Loading

codecov bot commented Jan 2, 2020 • edited Loading

Codecov Report

coveralls commented Jan 2, 2020 • edited Loading

quinnj left a comment

Choose a reason for hiding this comment

quinnj Jan 4, 2020

Choose a reason for hiding this comment

tk3369 Jan 4, 2020

Choose a reason for hiding this comment

quinnj Jan 4, 2020

Choose a reason for hiding this comment

quinnj Jan 4, 2020

Choose a reason for hiding this comment

quinnj Jan 4, 2020

Choose a reason for hiding this comment

quinnj Jan 4, 2020

Choose a reason for hiding this comment

tk3369 commented Jan 4, 2020

tk3369 commented Jan 1, 2020 •

edited

Loading

tk3369 commented Jan 1, 2020 •

edited

Loading

codecov bot commented Jan 2, 2020 •

edited

Loading

coveralls commented Jan 2, 2020 •

edited

Loading