Random Neighbors Clustering
performs "random forest" style clustering on high-dimensional data. Provides sampling strategies
to bootstrap input rows and columns. Method iteratively fits clustering models to the
samples and optimizes for the features that provide the best distinct cluster separation.
Random Neighbors
algorithm is straight-forward:
- Bootstrap rows and columns based on inputted sampling strategy for a predefined set of iterations
- For each iteration on the sampled data, fit a clustering model and extract performance metrics
- Output the features that provide the most distinct clusters across all iterations
Like a random forest iteratively builds decision trees on bootstrapped data, random neighbors iteratively builds clustering models on bootstrapped data. Random Neighbors iterations can be used to extract the single set of randomly selected columns that provide largest cluster separation, or features can be averaged across iterations in a psuedo-feature importance metric. Dropout can be optionally applied to the history of iterations to avoid bias in the sampled fitting procedure.
Main benefits include:
- Faster fitting on high-dimensional data through bootstrapping
- Iterations search for a low-dimensional representation of the data that produces distinct clusters
- Random sampling guards against correlation in input features, which can skew clustering results
Product logs contain a wealth of information on every action with a business application. Naturally, business would like to segment users based on this activity in order to drive product strategy including pricing, new features, churn reduction, and in-app upgrades.
Product usage data can be extremely high-dimensional with a feature rich application. Size of input data grows with the total numbers of actions possible in the app and the total numbers of users. The size and dimension of this data makes fitting clustering methods on the entire dataset challenging. Fitting on all rows is time-consuming, and a large amount of columns, possibly correlated, can build disoriented results.
Product data like this drove the development of RandomNeighbors
. Fitting this method produced a 5-feature, 5-cluster
representation of the entire dataset based on a strong silhouette score
. Resulting representation enabled real-time
product clustering that was not possible applying model to entire feature-set.
Details of the implementation:
DBSCAN
kernel, the "decision tree" ofRandomNeighbors
- 50 total model iterations in our neighborhood
- Column sampling of at most
log2(features)
for each iteration - Row sampling of
sqrt(rows)
for each iteration - Extract
silhouette score
from each iteration, keep only iterations with positive score and at least two clusters - Keep history of columns that provided the best score
- Re-fit model on the full dataset using the best features selected from the method
from rnc.random_neighbors import RandomNeighbors
from sklearn.cluster import DBSCAN
import numpy as np
# high dimensional data
data = np.random.rand(100000, 10000)
# use vanilla DBSCAN as random neighbors kernel
# any sklearn.cluster method that uses sklearn.metrics `silhouette_score` is supported
cluster = DBSCAN()
# initialize with bootstrap strategy
rnc = RandomNeighbors(
select_columns='sqrt',
select_rows='log2',
sample_iter=200,
normalize_data=True,
scale_data=True
)
# fit method to our data
best_metric, best_iter, fit_history = rnc.fit_random_neighbors(x=data, cluster=cluster)
# see results
print(best_metric, best_iter)
# history shows scores and bootstrap indices by iteration
scores = [(k, v['score']) for (k, v) in fit_history.items()]
print(scores)
Available methods for fit_random_neighbors()
:
use_custom_axis_samples
: supply your own index samples for columns and rowsselect_columns
: sample policy for columns (sqrt
,log2
,percentile
,random
)select_rows
: sample policy for rows (sqrt
,log2
,percentile
,random
)sample_iter
: number of model fit iterationscustom_feature_sample_list
: list of custom sample indicesrandom_axis_max_pct
: threshold for percentile sampling strategynormalize_data
: normalize input datascale_data
: min-max scale input data
Method Returns:
best_metric
: best overallsilhouette score
across all iterationsbest_iter
: iteration that provided the best overall scorehistory_dict
: dictionary of rows, columns, score, labels, and fit object by model iteration
- Psuedo-Feature Importance Metrics
- Expanded Testing
- History dropout
- Data Examples