Skip to content

Commit

Permalink
termnames (#299)
Browse files Browse the repository at this point in the history
* termnames

* use StatsAPI directly, deprecate old termnames

* deprecations

---------

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
  • Loading branch information
4 people authored Sep 6, 2023
1 parent 4a7d159 commit 1632bff
Show file tree
Hide file tree
Showing 14 changed files with 229 additions and 104 deletions.
4 changes: 3 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name = "StatsModels"
uuid = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
version = "0.7.2"
version = "0.7.3"

[deps]
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
Expand All @@ -10,6 +10,7 @@ Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
ShiftedArrays = "1277b4bf-5013-50f5-be3d-901d8477a67a"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
StatsAPI = "82ae8749-77ed-4fe6-ae5f-f523153014b0"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
StatsFuns = "4c63d2b9-4356-54db-8cca-17b64c39e42c"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
Expand All @@ -21,6 +22,7 @@ DataAPI = "1.1"
DataFrames = "1"
DataStructures = "0.17, 0.18"
ShiftedArrays = "1, 2"
StatsAPI = "1"
StatsBase = "0.33.5, 0.34"
StatsFuns = "0.9, 1.0"
Tables = "0.2, 1"
Expand Down
1 change: 1 addition & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
StatsAPI = "82ae8749-77ed-4fe6-ae5f-f523153014b0"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
Expand Down
1 change: 1 addition & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ end
term
coefnames
modelcols
termnames
```

### Higher-order terms
Expand Down
10 changes: 5 additions & 5 deletions docs/src/internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ FormulaTerm{Term, Term}
```

!!! note

As always, you can introspect which method is called with

```julia
Expand Down Expand Up @@ -395,7 +395,7 @@ possible to use an existing function, the best practice is to define a new
function to make dispatch less ambiguous.

```jldoctest 1
using StatsBase
using StatsAPI
# syntax: best practice to define a _new_ function
poly(x, n) = x^n
Expand Down Expand Up @@ -444,7 +444,7 @@ StatsModels.termvars(p::PolyTerm) = StatsModels.termvars(p.term)
# number of columns in the matrix this term produces
StatsModels.width(p::PolyTerm) = p.deg
StatsBase.coefnames(p::PolyTerm) = coefnames(p.term) .* "^" .* string.(1:p.deg)
StatsAPI.coefnames(p::PolyTerm) = coefnames(p.term) .* "^" .* string.(1:p.deg)
# output
Expand Down Expand Up @@ -558,9 +558,9 @@ PolyTerm{Term, ConstantTerm{Int64}}
```

!!! note

The functions like `poly` should be exported by the package that provides
the special syntax for two reasons. First, it makes run-time term
the special syntax for two reasons. First, it makes run-time term
construction more convenient. Second, because of how the `@formula` macro
generates code, the function that represents special syntax must be
available in the namespace where `@formula` is _called_. This is because
Expand Down
6 changes: 5 additions & 1 deletion src/StatsModels.jl
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
module StatsModels

using Tables
using StatsAPI
using StatsBase
using ShiftedArrays
using ShiftedArrays: lag, lead
using DataStructures
using DataAPI
using DataAPI: levels
using Printf: @sprintf
using StatsAPI: coefnames, fit, predict, dof
using StatsFuns: chisqccdf

using SparseArrays
Expand All @@ -32,10 +34,11 @@ export
HelmertCoding,
SeqDiffCoding,
HypothesisCoding,

coefnames,
setcontrasts!,
formula,
termnames,

AbstractTerm,
ConstantTerm,
Expand Down Expand Up @@ -81,5 +84,6 @@ include("formula.jl")
include("modelframe.jl")
include("statsmodel.jl")
include("lrtest.jl")
include("deprecated.jl")

end # module StatsModels
50 changes: 26 additions & 24 deletions src/contrasts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ C(levels = ::Vector{Any}, base = ::Any) # specify levels and base
mean of the lower levels
* [`SeqDiffCoding`](@ref) - Code for differences between sequential levels of
the variable.
* [`HypothesisCoding`](@ref) - Manually specify contrasts via a hypothesis
* [`HypothesisCoding`](@ref) - Manually specify contrasts via a hypothesis
matrix, which gives the weighting for the average response for each level
* [`StatsModels.ContrastsCoding`](@ref) - Manually specify contrasts matrix,
which is directly copied into the model matrix.
Expand All @@ -79,15 +79,15 @@ The easiest way to specify custom contrasts is with `HypothesisCoding` or
contrast coding system, you can subtype `AbstractContrasts`. This requires a
constructor, a `contrasts_matrix` method for constructing the actual contrasts
matrix that maps from levels to `ModelMatrix` column values, and (optionally) a
`termnames` method:
`coefnames` method:
```julia
mutable struct MyCoding <: AbstractContrasts
...
end
contrasts_matrix(C::MyCoding, baseind, n) = ...
termnames(C::MyCoding, levels, baseind) = ...
coefnames(C::MyCoding, levels, baseind) = ...
```
# References
Expand All @@ -103,30 +103,32 @@ abstract type AbstractContrasts end
# Contrasts + Levels (usually from data) = ContrastsMatrix
struct ContrastsMatrix{C <: AbstractContrasts, M <: AbstractMatrix, T, U}
matrix::M
termnames::Vector{U}
coefnames::Vector{U}
levels::Vector{T}
contrasts::C
invindex::Dict{T,Int}
function ContrastsMatrix(matrix::M,
termnames::Vector{U},
coefnames::Vector{U},
levels::Vector{T},
contrasts::C) where {U, T, C <: AbstractContrasts, M <: AbstractMatrix}
allunique(levels) || throw(ArgumentError("levels must be all unique, got $(levels)"))
invindex = Dict{T,Int}(x=>i for (i,x) in enumerate(levels))
new{C,M,T,U}(matrix, termnames, levels, contrasts, invindex)
new{C,M,T,U}(matrix, coefnames, levels, contrasts, invindex)
end
end

# only check equality of matrix, termnames, and levels, and that the type is the
StatsAPI.coefnames(cm::ContrastsMatrix) = cm.coefnames

# only check equality of matrix, coefnames, and levels, and that the type is the
# same for the contrasts (values are irrelevant). This ensures that the two
# will behave identically in creating modelmatrix columns
Base.:(==)(a::ContrastsMatrix{C}, b::ContrastsMatrix{C}) where {C<:AbstractContrasts} =
a.matrix == b.matrix &&
a.termnames == b.termnames &&
a.coefnames == b.coefnames &&
a.levels == b.levels

Base.hash(a::ContrastsMatrix{C}, h::UInt) where {C} =
hash(C, hash(a.matrix, hash(a.termnames, hash(a.levels, h))))
hash(C, hash(a.matrix, hash(a.coefnames, hash(a.levels, h))))

"""
An instantiation of a contrast coding system for particular levels
Expand Down Expand Up @@ -166,7 +168,7 @@ function ContrastsMatrix(contrasts::C, levels::AbstractVector{T}) where {C<:Abst
# 3. contrast levels missing from data: would have empty columns, generate a
# rank-deficient model matrix.
c_levels = something(DataAPI.levels(contrasts), levels)

mismatched_levels = symdiff(c_levels, levels)
if !isempty(mismatched_levels)
throw(ArgumentError("contrasts levels not found in data or vice-versa: " *
Expand Down Expand Up @@ -198,7 +200,7 @@ function ContrastsMatrix(contrasts::C, levels::AbstractVector{T}) where {C<:Abst
"$c_levels."))
end

tnames = termnames(contrasts, c_levels, baseind)
tnames = coefnames(contrasts, c_levels, baseind)

mat = contrasts_matrix(contrasts, baseind, n)

Expand All @@ -224,7 +226,7 @@ function ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector)
return c
end

function termnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer)
function StatsAPI.coefnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer)
not_base = [1:(baseind-1); (baseind+1):length(levels)]
levels[not_base]
end
Expand All @@ -233,7 +235,7 @@ Base.getindex(contrasts::ContrastsMatrix, rowinds, colinds) =
getindex(contrasts.matrix, getindex.(Ref(contrasts.invindex), rowinds), colinds)

# Making a contrast type T only requires that there be a method for
# contrasts_matrix(T, baseind, n) and optionally termnames(T, levels, baseind)
# contrasts_matrix(T, baseind, n) and optionally coefnames(T, levels, baseind)
# The rest is boilerplate.
for contrastType in [:DummyCoding, :EffectsCoding, :HelmertCoding]
@eval begin
Expand All @@ -254,7 +256,7 @@ DataAPI.levels(c::AbstractContrasts) = nothing
FullDummyCoding()
Full-rank dummy coding generates one indicator (1 or 0) column for each level,
**including** the base level. This is sometimes known as
**including** the base level. This is sometimes known as
[one-hot encoding](https://en.wikipedia.org/wiki/One-hot).
Not exported but included here for the sake of completeness.
Expand Down Expand Up @@ -331,7 +333,7 @@ column is generated with 1 where `variable .== x` and -1 where `variable .== bas
of 0.
If `levels` are omitted or `nothing`, they are determined from the data
by calling the `levels` function when constructing `ContrastsMatrix`.
by calling the `levels` function when constructing `ContrastsMatrix`.
If `base` is omitted or `nothing`, the first level is used as the base.
When all levels are equally frequent, effects coding generates model matrix
Expand Down Expand Up @@ -373,7 +375,7 @@ Helmert coding codes each level as the difference from the average of the lower
levels.
If `levels` are omitted or `nothing`, they are determined from the data
by calling the `levels` function when constructing `Contrastsmatrix`.
by calling the `levels` function when constructing `Contrastsmatrix`.
If `base` is omitted or `nothing`, the first level is used as the base.
For each non-base level, Helmert coding generates a columns with -1 for each of
n levels below, n for that level, and 0 above.
Expand Down Expand Up @@ -462,7 +464,7 @@ function contrasts_matrix(C::SeqDiffCoding, _, n)
end

# TODO: consider customizing term names:
# termnames(C::SeqDiffCoding, levels::AbstractVector, baseind::Integer) =
# StatsAPI.coefnames(C::SeqDiffCoding, levels::AbstractVector, baseind::Integer) =
# ["$(levels[i])-$(levels[i-1])" for i in 2:length(levels)]

"""
Expand Down Expand Up @@ -591,7 +593,7 @@ function contrasts_matrix(C::HypothesisCoding, baseind, n)
C.contrasts
end

termnames(C::HypothesisCoding, levels::AbstractVector, baseind::Int) =
StatsAPI.coefnames(C::HypothesisCoding, levels::AbstractVector, baseind::Int) =
something(C.labels, levels[1:length(levels) .!= baseind])

DataAPI.levels(c::HypothesisCoding) = c.levels
Expand All @@ -602,8 +604,8 @@ DataAPI.levels(c::HypothesisCoding) = c.levels
Coding by manual specification of contrasts matrix. For k levels, the contrasts
must be a k by k-1 Matrix. The contrasts in this matrix will be copied directly
into the model matrix; if you want to specify your contrasts as hypotheses (i.e.,
weights assigned to each level's cell mean), you should use
into the model matrix; if you want to specify your contrasts as hypotheses (i.e.,
weights assigned to each level's cell mean), you should use
[`HypothesisCoding`](@ref) instead.
"""
mutable struct ContrastsCoding{T<:AbstractMatrix} <: AbstractContrasts
Expand Down Expand Up @@ -687,9 +689,9 @@ julia> StatsModels.hypothesis_matrix(cmat)
-1 0 0 1
```
For non-centered contrasts like `DummyCoding`, without including the intercept
the hypothesis matrix is incorrect. So while `intercept=true` is the default for
non-centered contrasts, you can see the (wrong) hypothesis matrix when ignoring
For non-centered contrasts like `DummyCoding`, without including the intercept
the hypothesis matrix is incorrect. So while `intercept=true` is the default for
non-centered contrasts, you can see the (wrong) hypothesis matrix when ignoring
it by forcing `intercept=false`:
```jldoctest hypmat
Expand All @@ -710,7 +712,7 @@ julia> StatsModels.hypothesis_matrix(cmat, tolerance=0) # ugly
1.0 -2.23753e-16 6.91749e-18 -1.31485e-16
-1.0 1.0 -2.42066e-16 9.93754e-17
-1.0 4.94472e-17 1.0 9.93754e-17
-1.0 1.04958e-16 -1.31044e-16 1.0
-1.0 1.04958e-16 -1.31044e-16 1.0
```
Finally, the hypothesis matrix for a constructed `ContrastsMatrix` (as stored by
Expand Down
13 changes: 13 additions & 0 deletions src/deprecated.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
@deprecate(termnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer),
coefnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer),
false)

function Base.getproperty(cm::ContrastsMatrix, x::Symbol)
if x === :termnames
Base.depwarn("The `termnames` field of `ConstrastsMatrix` is deprecated; use `coefnames(cm)` instead.",
:ContrastsMatrix)
return coefnames(cm)
else
return getfield(cm, x)
end
end
Loading

0 comments on commit 1632bff

Please sign in to comment.