Discussion of the API for Clustering models #852

ablaom · 2021-10-11T23:42:42Z

Clustering models in MLJ are implemented as Unsupervised
models. While they share common functionality, this has not been
properly documented, and there is no model subtype or trait that tags
a model as implementing that common interface.

I am opening this issue to summarize the existing interface for
purposes of discussion of possible enhancements/modifications, and how
we might go about formalizing the interface.

Existing interface

Existing "clustering" models in the MLJ registry share the following interface

fit:fit(model, verbosity, X) sees the training data X and learns
parameters, output as fitresult, required to label new data. As
for general models, training-related outcomes that are not part of
fitresult, but which the user may want to access, are returned in the
report (a named tuple with informative keys).
predict: predict(models, fitresult, Xnew), if implemented, returns either
(i) the clustering labels (assignments) for new data Xnew, as a
CategoricalVector (unordered) (scitype
AbstractVector{<:Multiclass}); or (ii) probabilistic predictions
for the the clustering labels (a vector of UnivariateFinite
distributions). Important: The categorical vector (or UnivariateFinite
vector) includes all cluster labels in its pool, not just the
predicted ones. So, for example, in the deterministic case,
levels(predict(models, fitresult, Xnew)) is the same for all
Xnew, a vector with one element per cluster.
transform: transform(models, fitresult, Xnew), if implemented, performs
dimension reduction, returning a table with Continuous columns,
one for each cluster.
models that do not generalize to new data (e.g., ScikitLearn's
DBSCAN, AgglomerativeClustering) implement neither predict nor
transform (because all MLJ operations are understood to generalize
to new data). The cluster labels (we could call them training
labels to distinguish them from new predictions in other models)
appear in the fitresult, which the user accesses using
fitted_params to get a user friendly
version.
The trait input_scitype(::Type{MyClusterer}) returns the required
scitype of X (always Table(Continuous)).
The trait output_scitype(::Type{MyClusterer}) returns the scitype
of the output of transform (also Table(Continuous))

I notice that the ScikitLearn clusterers just bundles all training outcomes into the fitresult
(and nothing in report) which does not strictly comply with the published API, eg here. Also, the same API would imply that "non-generalizing" models should place all training outcomes in the report, instead of the fitresult, but they do the opposite.

I also notice that when training labels are added to the report, they are often just integer vectors,
while for consistency they should be categorical vectors, as returned by predict.

I believe GMMClusterer is the only probabilistic clusterer.

Comment

So, does this interface rule out some clustering models we have yet to encounter?
Are there further requirements should we impose?
I have thought that models that do not generalise could be conceptualised
as Static transformers, but that imposes the requirement that transform
returns everything of interest (there is not fit to generate a report or fitresult) which can be
be awkward.
For consistency, I'd have thought the target_scitye trait should return
AbstractVector{<:Multiclass}, as this is the scittype of what predict (or predict_mode) returns.
But I see this has not been implemented consistently.
Currently there is no way to distinguish which models predict
probabilities, which predict actual labels, and which do not predict
at all. The existing prediction_type trait could make this
distinction (:probabilistic, :deterministic, :unknown). At
present the models I have checked all return :unknown (the
fallback).
Another question is whether we anchor the interface with a new
subtype(s) or use traits.

The text was updated successfully, but these errors were encountered:

davnn · 2021-12-07T07:46:17Z

So, does this interface rule out some clustering models we have yet to encounter?
Are there further requirements should we impose?

I think this is also a question of MLJ's vision for the future. Does MLJ want to standardize the different tasks (classification, regression, dimensionality reduction, outlier detection, clustering, association rules, ...), such that they are clearly defined, but restricted? In this case, it might make sense to focus on supervised learning like caret does?

Personally, I would like to see the core API task-independent and flexible, such that it does not rule out a lot of use cases.
Packages that provide functionality for specific tasks should be built on top of MLJ in my opinion.

I have thought that models that do not generalise could be conceptualised
as Static transformers, but that imposes the requirement that transform
returns everything of interest (there is not fit to generate a report or fitresult) which can be
be awkward.

I also think that's awkward. I think the resulting clustering assignments should live in the report, but if we want to enable evaluation, we might need to add something like fit_transform and fit_predict, because I don't think it makes sense to prescribe a specific report key.

For consistency, I'd have thought the target_scitye trait should return
AbstractVector{<:Multiclass}, as this is the scittype of what predict (or predict_mode) returns.
But I see this has not been implemented consistently.

Sounds good, although I would see that as a responsibility of the individual algorithm/package authors.

Currently there is no way to distinguish which models predict
probabilities, which predict actual labels, and which do not predict
at all. The existing prediction_type trait could make this
distinction (:probabilistic, :deterministic, :unknown). At
present the models I have checked all return :unknown (the
fallback).

Would make sense to rely on the trait for this imho. By the way what is the use of an :unknown prediction, is that not a transform?

Another question is whether we anchor the interface with a new
subtype(s) or use traits.

Having worked with the new outlier detection subtypes I'm pretty sure subtypes are not the way to go. Additionally, I've implemented the type hierarchy in JuliaAI/MLJBase.jl#656 (comment) for MMI/Base and I did not like the API. I'm pretty sure a trait-based system is preferable. However, I learned that refactoring is not that bad and quite easily doable.

The reason why subtypes don't work is that they mix up different concepts, e.g. the supervised/unsupervised type defines characteristics of the input data while the probabilistic/deterministic trait defines characteristics of the output data. Each time you want to define something that works for all probabilistic models (classifiers, clusterers, outlier detectors, ...), you'd have to define/rely on some type union. Mixins would capture such relationships, but Julia does not have Mixins.

The most consistent solution would probably be to directly subtype from Model and move all model aspects to traits, even though removing Probabilistic et al. from the hierarchy will be more initial refactoring effort.

ablaom · 2021-12-17T04:04:16Z

Thanks for chiming in here with some detailed feedback. Very much appreciated. Will get back to this eventually.

ablaom added api design discussion Discussing design issues labels Oct 11, 2021

ablaom mentioned this issue Oct 12, 2021

Add Clustering.DBSCAN to interface JuliaAI/MLJClusteringInterface.jl#11

Closed

This was referenced May 16, 2022

Wrapper to convert arbitrary clusterer into a classifying one JuliaAI/MLJBase.jl#768

Open

add Hierarchical Clustering & some docstring fixes JuliaAI/MLJClusteringInterface.jl#9

Merged

ablaom mentioned this issue Dec 21, 2021

Transformers that need to see target (eg, recursive feature elimination) #874

Closed

davnn mentioned this issue Jan 21, 2022

Discussion: Outlier Detection API in MLJ #780

Closed

sylvaticus mentioned this issue Jun 28, 2022

MLJ API for Missing Imputation ? #950

Closed

ablaom mentioned this issue Jul 12, 2022

Proposal to add support to allow non-generalizing models to contribute to a machine's report JuliaAI/MLJBase.jl#806

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion of the API for Clustering models #852

Discussion of the API for Clustering models #852

ablaom commented Oct 11, 2021 •

edited

Loading

davnn commented Dec 7, 2021

ablaom commented Dec 17, 2021

Discussion of the API for Clustering models #852

Discussion of the API for Clustering models #852

Comments

ablaom commented Oct 11, 2021 • edited Loading

Existing interface

Comment

davnn commented Dec 7, 2021

ablaom commented Dec 17, 2021

ablaom commented Oct 11, 2021 •

edited

Loading