remove mean/var/min/max #223

matthieugomez · 2021-04-13T15:33:39Z

solves #222

ararslan · 2021-04-13T15:58:53Z

Seems potentially quite breaking, as I imagine that some packages rely on being able to access the precomputed summary statistics for continuous terms.

matthieugomez · 2021-04-13T16:01:51Z

Right, it should be in the 0.7 version.
That being said, do you have any example of an external package relying on these summary stats?

ararslan · 2021-04-13T18:42:11Z

Nope. GLM or MixedModels might but I don't know for certain, which is why I said "potentially." Perhaps nothing is using it and it would be safe to remove, but we should just be aware of the breakage potential.

kleinschmidt · 2021-04-13T23:58:56Z

I have some internal code that uses this. It might not be strictly necessary, but ideally I'd like to make sure there's some kind of mechanism for custom term types to request summary stats to be extracted from the data table at schema creation time. That's currently only possible via the hints Dict which is fine for things where you've manually specified the special handling but wouldn't work for things like splines implemented as functions in the formula (although those don't work very well with the current system either since ideally you need more than just mean/var/min/max for that).

It's good to know that the schema is a performance bottleneck though.

kleinschmidt · 2021-04-15T17:04:54Z

Actually, another strategy that is actually used in #183 is to replace ContinuousTerm in some cases with plain Terms, which in that PR simply pass through the underlying values without changing. In the vast majority of cases this means that ContinuousTerm can just be eliminated.

matthieugomez · 2023-02-07T14:29:58Z

Would it be possible to consider this for the next release? Even if this not the default, there should be some way to avoid the computation of these summary statistics (esp. min/max) as computing them takes as much time as doing a regression.

using StatsModels
N = 10_000_000
df = (y = rand(N), x1 = rand(N), x2 = rand(N))
f = @formula(y~x1+x2)
@time StatsModels.schema(f, df)
#   0.164134 seconds (52 allocations: 3.297 KiB)
@time [df.x1 df.x2] \ df.y
#   0.128481 seconds (36 allocations: 381.536 MiB, 6.13% gc time)

kleinschmidt · 2023-02-10T20:34:08Z

Would it be possible to consider this for the next release?

Unfortunately not, it would create breakage for users of at least one downstream package (StandardizedPredictors, I think) without an alternative mechanism for "requesting" that data invariants be computed (which is not gonna happen on the timeline we've set out for 1.0). I'd really like for 0.7/1.0 to be a "snapshot" that users can rely on without breakage vs. 0.6 as much as possible (package maintainers will have to deal w/ some breakage but we know what we're in for more or less 😂 )

Even if this not the default, there should be some way to avoid the computation of these summary statistics (esp. min/max) as computing them takes as much time as doing a regression.

One way to avoid the overhead that #183 quietly added support for is that modelcols(::Term, data) is now supported and just copies the referenced columns into the model matrix.

Another thing that MIGHT be possible is to provide a ContinuousTerm with nonsense values (all zeros for instance) as a hint when computing the schema.

matthieugomez · 2023-02-10T20:38:00Z

Sure, I understand.

kleinschmidt · 2023-02-10T20:38:45Z

Hm actually when I look at StandardizedPredictors, I think it's basically only the run-time constructors that use that information

https://github.com/beacon-biosignals/StandardizedPredictors.jl/blob/81dbba3b150627ce4df0e0a41c1b8c0335e55612/src/zscoring.jl#L118-L121

the concrete_term methods are the ones that get called during schema creation, and they get the information they need directly from the data...

matthieugomez · 2023-12-01T12:45:41Z

Bump. This slows down simple regression and it would be better to change it before other packages start relying on it.

matthieugomez added 3 commits April 13, 2021 08:32

remove mean/var/min/max

28fcfc9

Update schema.jl

b9f4714

Update terms.jl

401f07f

kleinschmidt added the breaking label Apr 19, 2021

kleinschmidt added this to the 0.7 milestone Apr 19, 2021

palday modified the milestones: 0.7, 1.0 Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove mean/var/min/max #223

remove mean/var/min/max #223

matthieugomez commented Apr 13, 2021 •

edited

Loading

ararslan commented Apr 13, 2021

matthieugomez commented Apr 13, 2021

ararslan commented Apr 13, 2021

kleinschmidt commented Apr 13, 2021

kleinschmidt commented Apr 15, 2021

matthieugomez commented Feb 7, 2023 •

edited

Loading

kleinschmidt commented Feb 10, 2023

matthieugomez commented Feb 10, 2023

kleinschmidt commented Feb 10, 2023

matthieugomez commented Dec 1, 2023

remove mean/var/min/max #223

Are you sure you want to change the base?

remove mean/var/min/max #223

Conversation

matthieugomez commented Apr 13, 2021 • edited Loading

ararslan commented Apr 13, 2021

matthieugomez commented Apr 13, 2021

ararslan commented Apr 13, 2021

kleinschmidt commented Apr 13, 2021

kleinschmidt commented Apr 15, 2021

matthieugomez commented Feb 7, 2023 • edited Loading

kleinschmidt commented Feb 10, 2023

matthieugomez commented Feb 10, 2023

kleinschmidt commented Feb 10, 2023

matthieugomez commented Dec 1, 2023

matthieugomez commented Apr 13, 2021 •

edited

Loading

matthieugomez commented Feb 7, 2023 •

edited

Loading