Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove mean/var/min/max #223

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Apr 13, 2021

solves #222

@ararslan
Copy link
Member

Seems potentially quite breaking, as I imagine that some packages rely on being able to access the precomputed summary statistics for continuous terms.

@matthieugomez
Copy link
Contributor Author

Right, it should be in the 0.7 version.
That being said, do you have any example of an external package relying on these summary stats?

@ararslan
Copy link
Member

Nope. GLM or MixedModels might but I don't know for certain, which is why I said "potentially." Perhaps nothing is using it and it would be safe to remove, but we should just be aware of the breakage potential.

@kleinschmidt
Copy link
Member

I have some internal code that uses this. It might not be strictly necessary, but ideally I'd like to make sure there's some kind of mechanism for custom term types to request summary stats to be extracted from the data table at schema creation time. That's currently only possible via the hints Dict which is fine for things where you've manually specified the special handling but wouldn't work for things like splines implemented as functions in the formula (although those don't work very well with the current system either since ideally you need more than just mean/var/min/max for that).

It's good to know that the schema is a performance bottleneck though.

@kleinschmidt
Copy link
Member

Actually, another strategy that is actually used in #183 is to replace ContinuousTerm in some cases with plain Terms, which in that PR simply pass through the underlying values without changing. In the vast majority of cases this means that ContinuousTerm can just be eliminated.

@kleinschmidt kleinschmidt added this to the 0.7 milestone Apr 19, 2021
@matthieugomez
Copy link
Contributor Author

matthieugomez commented Feb 7, 2023

Would it be possible to consider this for the next release? Even if this not the default, there should be some way to avoid the computation of these summary statistics (esp. min/max) as computing them takes as much time as doing a regression.

using StatsModels
N = 10_000_000
df = (y = rand(N), x1 = rand(N), x2 = rand(N))
f = @formula(y~x1+x2)
@time StatsModels.schema(f, df)
#   0.164134 seconds (52 allocations: 3.297 KiB)
@time [df.x1 df.x2] \ df.y
#   0.128481 seconds (36 allocations: 381.536 MiB, 6.13% gc time)

@kleinschmidt
Copy link
Member

Would it be possible to consider this for the next release?

Unfortunately not, it would create breakage for users of at least one downstream package (StandardizedPredictors, I think) without an alternative mechanism for "requesting" that data invariants be computed (which is not gonna happen on the timeline we've set out for 1.0). I'd really like for 0.7/1.0 to be a "snapshot" that users can rely on without breakage vs. 0.6 as much as possible (package maintainers will have to deal w/ some breakage but we know what we're in for more or less 😂 )

Even if this not the default, there should be some way to avoid the computation of these summary statistics (esp. min/max) as computing them takes as much time as doing a regression.

One way to avoid the overhead that #183 quietly added support for is that modelcols(::Term, data) is now supported and just copies the referenced columns into the model matrix.

Another thing that MIGHT be possible is to provide a ContinuousTerm with nonsense values (all zeros for instance) as a hint when computing the schema.

@matthieugomez
Copy link
Contributor Author

Sure, I understand.

@kleinschmidt
Copy link
Member

Hm actually when I look at StandardizedPredictors, I think it's basically only the run-time constructors that use that information

https://github.com/beacon-biosignals/StandardizedPredictors.jl/blob/81dbba3b150627ce4df0e0a41c1b8c0335e55612/src/zscoring.jl#L118-L121

the concrete_term methods are the ones that get called during schema creation, and they get the information they need directly from the data...

@palday palday modified the milestones: 0.7, 1.0 Apr 5, 2023
@matthieugomez
Copy link
Contributor Author

Bump. This slows down simple regression and it would be better to change it before other packages start relying on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants