Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-means clustering based knot-point screening #113

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

tq21
Copy link

@tq21 tq21 commented Apr 19, 2024

Implementation of k-means clustering based knot-point screening algorithm. At each basis function level, we could run a k-means clustering to get k knots. The set of knot points obtain this way may be better than the previous quantile based discretization. R implementation of k-means is very fast, so the overhead of running this is minimal, but we reduce memory significantly due to reduced number of knots. This method coupled with variable-level screening (MARS-based) would be an ideal scalable version of HAL. Performance of k-means screening compared to quantile-based is demonstrated in the simulations below:

  1. When the marginal distribution of covariates are skewed (bottom three plots), full HAL (using all bases) overfitted the data in the dense region (potentially due to the large number of knots created using observations in that region, the little blue ticks at the bottom are all knots created), and quantile-based HAL fit failed to capture variations in the sparse data regions. K-means screened algorithm produces a more balanced fit:
Screenshot 2024-04-19 at 3 05 51 PM
  1. K-means has better prediction accuracy on test sets when number of covariates increases and covariate distributions are skewed. This simulation is purposely designed to be biased against k-means because the total number of knots created by quantile-based method cannot be predetermined, so it is hard to make sure the number of knots from k-means matches exactly. However, even with the disadvantage, i.e. thinner HAL design matrix, k-means still beats quantile-based method, so it is more memory efficient. This might be because quantile is a univariate measure. We run k-means at each basis function level, so for a basis of three variables, it will select k points in the three-dimensional space:
Screenshot 2024-04-19 at 3 06 45 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant