Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapper to convert arbitrary clusterer into a classifying one #768

Open
ablaom opened this issue Nov 18, 2021 · 0 comments
Open

Wrapper to convert arbitrary clusterer into a classifying one #768

ablaom opened this issue Nov 18, 2021 · 0 comments

Comments

@ablaom
Copy link
Member

ablaom commented Nov 18, 2021

Opening this issue after a nice suggestion of @davnn .

Some clusterers (eg, sckitlearn's DBSCAN) only deliver labels for the training data and cannot immediately label new unseen data. In that case one can use any ordinary classifier (eg, KNN from NearestNeighborModels.jl) to generate labels for new data.

If the classifier is a probabilistic predictor, we can even get "fuzzy" labels (like GMMClusterer from BetaML) - which could be useful even for clusterers that already generalise to new data.

Any design depends on firming up the API for clusterers: JuliaAI/MLJ.jl#852

One possible implementation (requiring MLJBase as a dependency) is to use a learning network (wrapped in a fit definition) to define the new model (see eg, TransformedTargetModel). One advantage would be that changes to the classifier hyper-parameters would not trigger re-training of the base clusterer. (I mean you could arrange that with a "hard-wired" implementation, but that would be duplicating logic we already have, extra testing, etc).

See below for a proof-of-concept.

Thoughts anyone?

@juliohm @jbrea @OkonSamuel @alyst

using MLJBase
using MLJModels

pure_clusterer = (@load DBSCAN pkg=ScikitLearn)()
classifier = (@load KNNClassifier)()

Xraw, yraw  = make_blobs(1000, rng=123)
X, Xtest = partition(Xraw, 0.5)
_, ytest = partition(yraw, 0.5)

# the learning network (with training data at the source node):

Xs = source(X)

# this clusterer stores the training labels in its fitted_params:
mach1 = machine(pure_clusterer, Xs)
Θ = node(fitted_params, mach1)
y = node-> θ.labels, Θ) # the training labels

# classifier will train using the training_labels `y`:
mach2 = machine(classifier, Xs, y)
ŷ = predict(mach2, Xs) # returns probability distributions

# train the network:
fit!(ŷ)

# getting "probabilistic" labels for new data:
ŷ(Xtest);

# getting labels for new data:
y = mode.(ŷ(Xtest));

# good agreement up to relabelling:
julia> zip(ytest, y) |> collect
 (1, 3)
 (2, 2)
 (2, 2)
 (1, 3)
 (3, 1)
 (1, 3)
 (2, 2)
 (1, 3)
 (1, 3)
 (2, 2)
 (2, 2)
 (3, 1)
 (1, 3)
 (3, 1)
 (3, -1)
...
@ablaom ablaom transferred this issue from JuliaAI/MLJClusteringInterface.jl May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant