Speeding up the Highly Adaptive Lasso with recursive screening via MARS #105

Larsvanderlaan · 2023-01-03T16:46:42Z

Implements a variant of the selectively adaptive lasso (SAL) that uses MARS (earth) to learn important variables and important variable interactions. While MARS is used by SAL to select important variables and interaction variable subgroups, MARS is not used by SAL for selecting specific spline basis functions. In particular, all basis functions, as specified by the params max_degree and num_knots, are generated for the variables and variable subgroups found by MARS. So SAL tends to be much more expressive than MARS.

The earth model is fit with its own internal cross-validation method for pruning (so not the default gcv approach).
Also, earth parameters for the number of basis functions generated/searched are set high/maximal.

Nested cross-validation is implemented to take into account the outcome-dependent variable selection. SAL is able to handle both large n and large p very well and can lead to both substantial speedups and better performance than HAL when there are many noisy variables.

-- I changed defaults of num_knots argument so it varies as a function of sample size.
-- I made a minor change to how basis functions are generated with the num_knots argument. Before, an edge basis function was being generated that could lead to instability in the CV sometimes.
-- The formula bug fixes of #101 are also incorporated here.

Things still to do:
Write some tests
a bit more documentation
minor changes to make sure fit_hal and fit_sal have near identical functionality

Larsvanderlaan · 2023-01-03T16:53:43Z

Im getting the error ``Error: Unable to resolve action r-lib/actions@master, unable to find version `master`" in the code check. Anyone have any idea what this is?

nhejazi · 2023-01-03T19:05:09Z

Looks interesting, Lars --- I'd be curious to learn more about this. Is your intention to create a separate fit_sal() function, paralleling fit_hal(), or to have this alternative algorithm available via the same constructor?

For the question about the failing builds, this is because the GitHub Actions file is out of date, as r-lib/actions@master doesn't point to a valid branch anymore. Please change the relevant line in .github/workflows/R-CMD-check.yml to https://github.com/r-lib/actions#releases-and-tags

Larsvanderlaan · 2023-01-03T19:17:08Z

Hi @nhejazi, thank you for the help! The current implementation has separate functions, mainly because fit_sal() calls fit_hal() internally. I am fine with having a single constructor function, and I think this would be easier for users as well. Though, it might be cleanest to have the fitting parts of fit_hal() and fit_sal() be separate internal functions. What do you think?

If the functions are merged, we could add options screen_variables = TRUE and screen_interactions = TRUE.
We could also make a screening function an argument to fit_hal so that users can specify arbitrary screening algorithms and the internal CV handles it appropriately. glmnet has a similar feature with the exclude argument. However, since we want to exclude variables and not specific basis functions we can't use the glmnet implementation.
The fit_sal() implementation can be adjusted to work with any screening function that outputs either a hal_formula or a vector of column indices.

…o screeningHAL

Larsvanderlaan · 2023-01-04T03:38:53Z

I have incorporated fit_sal into fit_hal through the arguments: screen_variables, screen_interactions, and screener_max_degree. At the moment, all the original tests pass with screen_variables = TRUE and screen_interactions = TRUE as default.

nhejazi · 2023-01-05T22:18:18Z

thanks, @Larsvanderlaan, for these changes. thinking through it some more, i think it may be a good idea to keep the constructors for the two separate, as in retaining both fit_sal() and fit_hal() as user-facing functions. this way, there could be a distinct vignette for SAL, which could reference the pre-print on that algorithm. the advantage here is that there would be greater clarity of the user as to where to look for guidance on the given choice of algorithm (especially since SAL's construction differs somewhat from the more familiar LASSO construction of HAL). beyond that, i think it's important to avoid "argument creep" in the constructor, which we worked against in one of the major version changes. does this sound like a workable path forward (and sorry for the extra work in separating the two functions)?

also, a few comments on the construction of the PR:

in future, please try to use at least somewhat descriptive commit messages, instead of the repetitive messages seen when glancing above. this is important for keeping commits modular (by thinking of the message before you submit the change), so that bugs can be identified and so that reviews are possible to conduct. in this instance, we'll squash the commits at the merge phase, so that they disappear in the history, but that's generally a practice we should avoid.
as you get to the stage of wrapping up the commit, please bump the version of the package in DESCRIPTION and add information about the various changes (which you already very helpfully summarize in your first comment) in the NEWS file, taking care to follow the style already used in that document

Larsvanderlaan · 2023-01-05T23:40:11Z

Hi @nhejazi , I should note that the SAL implementation here is totally different from the algorithm in the preprint. Essentially I use MARS to learn a formula_hal object of the form ~ h(X1) + h(X2, X1) + h(X3, X2, X1) and then the standard HAL implementation provided by fit_hal is used with this formula. The basis functions are still generated using the original version of fit_hal and the arguments max_degree and num_knots. So MARS is really just used as a variable screener. I called it SAL because it has the same goal as the preprint (selectively choosing basis functions/variables) but accomplishes its goal without fundamentally changing HAL. For clarity, I will go ahead and remove the name SAL. Also, MARS uses the same basis functions as first-order HAL but fits them in a greedy manner. So even as a screening algorithm, MARS stays close to what HAL does.

I now personally think it should be a part of the fit_hal method due to the substantial speedup and performance gains I am seeing in simulations (including the datasets from ACIC2019). Regarding argument creep, that is a good point and to address this I can add a single ``screener_control" list argument that contains all the relevant arguments.

Also, thank you very much for the other comments. I agree that more informative commit names are a better route to go.

nhejazi · 2023-01-06T00:14:03Z

That's very helpful clarification, @Larsvanderlaan. I had taken the mention of SAL to reference the pre-print, and I see from your comment that it's similar, but the fact that it is a screening-based variant of the already well-studied HAL (and uses the same underlying functionality implemented in the package) makes a convincing case for it being included within the existing fit_hal() constructor. So, I think we agree on the design choice you've made, and I think removing the SAL name is helpful since it would be confusing to reference an algorithm other than what's implemented (maybe you could call it "filtered HAL" or "MARS HAL" or similar).

Also, I think following the style set in the constructor should be good enough to avoid argument creep, since by adding these arguments as a list-argument in screener_control, you'd only be adding a single argument, the components of which would be documented in the internal function to which they get passed. And that's very exciting about the speedup! Personally, I'll be curious to try this out in the settings where I've run into significant performance issues (e.g., hazard-based density estimation).

…ing methods

…o screeningHAL

…t_hal_with_screening

…more

rachaelvp · 2023-10-12T20:08:32Z

Hi @Larsvanderlaan I just merged your other PR (fix to quantile binning) into devel, leading to a merge conflict with this PR. Could you please address this so we can also merge this PR into devel?

Also, @nhejazi do you agree this PR is ready for merging once the conflict is addressed?

Larsvanderlaan and others added 5 commits January 2, 2023 14:02

hey

c079dcd

screener

bf8f112

screener

e32e21d

screener

cc46c39

Merge branch 'devel' into screeningHAL

0b16f14

Update hal.R

f5d5350

Larsvanderlaan requested review from nhejazi and jeremyrcoyle January 3, 2023 16:57

Larsvanderlaan added 9 commits January 3, 2023 09:35

screener

2327db1

screener

d6336b3

screener

9601b22

screener

defd0a9

screener

0f9d9d4

screener

0e67037

fix

7a39cfb

fix

3a7078a

fix

3285c75

Larsvanderlaan and others added 7 commits January 3, 2023 11:33

Update R-CMD-check.yml

4563fb8

screening

722e3e8

Merge branch 'screeningHAL' of https://github.com/tlverse/hal9001 int…

ce887fb

…o screeningHAL

screening

c96f56a

testspassed

3813b38

testspassed

154613a

testspassed

52c0c33

Larsvanderlaan added 2 commits January 3, 2023 19:58

testspassed

354268e

testspassed

ba95980

Larsvanderlaan and others added 11 commits January 4, 2023 08:45

testspassed

d85f1b2

testspassed

0717fc0

warningfix

79aa099

screenfamily

f4e360c

screenfamily

345dde3

screenfamily

06478cb

screenfamily

882640f

screenfamily

ab63e65

screenfamily

e206e1a

Merge branch 'devel' into screeningHAL

178462d

screenfamily

ba67e48

Larsvanderlaan added 13 commits January 5, 2023 16:27

Added screener_control argument as well as renamed earth-based screen…

a3c2ea9

…ing methods

file renaming and adding screening.R

c067308

Merge branch 'screeningHAL' of https://github.com/tlverse/hal9001 int…

b5133ad

…o screeningHAL

renaming screening parameters and making them a direct argument to fi…

b538336

…t_hal_with_screening

rename fit_hal_with_screening to fit_earth_hal

9bfba0d

fixed vignette error

29fab13

remove some num_knots documentation that isn't applicable as much any…

38e0754

…more

minor formating of docs

e6e34e2

fixes to docs

267aa49

fix defaults screener bug

d9a3123

minor changes

cfe4f32

allow for custom penalization in mars

85b87fa

10x speedup because earth does cv for no reason

afc4d66

Larsvanderlaan changed the title ~~Selectively adaptive lasso with MARS~~ Speeding up the Highly Adaptive Lasso with recursive screening via MARS Mar 22, 2023

Larsvanderlaan force-pushed the screeningHAL branch from 58c6fff to afc4d66 Compare August 11, 2023 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up the Highly Adaptive Lasso with recursive screening via MARS #105

Speeding up the Highly Adaptive Lasso with recursive screening via MARS #105

Larsvanderlaan commented Jan 3, 2023 •

edited

Loading

Larsvanderlaan commented Jan 3, 2023

nhejazi commented Jan 3, 2023

Larsvanderlaan commented Jan 3, 2023 •

edited

Loading

Larsvanderlaan commented Jan 4, 2023 •

edited by nhejazi

Loading

nhejazi commented Jan 5, 2023

Larsvanderlaan commented Jan 5, 2023 •

edited

Loading

nhejazi commented Jan 6, 2023

rachaelvp commented Oct 12, 2023

Speeding up the Highly Adaptive Lasso with recursive screening via MARS #105

Are you sure you want to change the base?

Speeding up the Highly Adaptive Lasso with recursive screening via MARS #105

Conversation

Larsvanderlaan commented Jan 3, 2023 • edited Loading

Larsvanderlaan commented Jan 3, 2023

nhejazi commented Jan 3, 2023

Larsvanderlaan commented Jan 3, 2023 • edited Loading

Larsvanderlaan commented Jan 4, 2023 • edited by nhejazi Loading

nhejazi commented Jan 5, 2023

Larsvanderlaan commented Jan 5, 2023 • edited Loading

nhejazi commented Jan 6, 2023

rachaelvp commented Oct 12, 2023

Larsvanderlaan commented Jan 3, 2023 •

edited

Loading

Larsvanderlaan commented Jan 3, 2023 •

edited

Loading

Larsvanderlaan commented Jan 4, 2023 •

edited by nhejazi

Loading

Larsvanderlaan commented Jan 5, 2023 •

edited

Loading