add Hierarchical Clustering & some docstring fixes #9

jbrea · 2021-09-08T08:23:28Z

@ablaom predict(mach, Xnew) doesn't make any sense for hierarchical clustering, but predict(mach) does. Do you think this is still useful like this?

(ps. Sorry for mixing in docstring fixes...)

ablaom · 2021-09-10T01:11:32Z

@jbrea Thanks indeed for this contribution.

On a first glance this implementation makes sense, but after some reflection, certain aspects bother me. For example, the possibility that predict(mach) can give different answers without fit! being called in between sounds dangerous. (This should probably be ruled out in the API.) Although this is a weak argument, another objection is that type ambiguity prevents us from having operations with zero data arguments - you would have to call predict(mach, nothing) or something.

Instead, I would be inclined to conceptualise this model as a Static transformer. After all, you are not "learning" in the sense of "learning to generalise to unseen examples". Implemented this way, there is no fit to overload, just a transform method with the "training" data as input. There remains the question of what the output should be. Some options I can think of:

Just return the full Hclust object. Pros: flexibility to determine the right cut-off h/k in a later process. Con: To use the object, the user needs to be familiar with its interface
Return the cluster labels, as you do now, according to the h/k values in the struct. Pros: Simple. Interpretation of output is clear. Cons: Need to re-compute (recall transform) for every new h/k.
Return a closure that maps (k, h) or (k, ) or (h, ) to the labels for the cutoff defined by k, h. Pros: simple, flexible. No need for user to know the interface for Hclust. Cons: It is unusual to return callable objects. However, a nice show for the closure could mitigate the mystery. We did this for DecisionTrees here where the closure is for pretty printing of the tree and the argument is the tree depth. (Aside, there is already a scitype for callable objects, CallableReturning{S}, but I need to move it from MLJBase into ScientificTypesBase, which I'm happy to do.) edit I guess we would need to write a plot recipe to overload plot for our callable object.

My vote is for 3. What do you think? And do you have objectionis to implementing as Static?

jbrea · 2021-11-17T20:54:13Z

@ablaom Thanks for your feedback and suggestions! I like version 3 and implemented it.
I think the plot recipe is not urgent with plot(hc.dendrogram) (see docstring of HierarchicalClustering).
The only thing I dislike about this approach is that it is not a model and therefore not listed in models() and e.g. info("HierarchicalClustering") errs.

codecov-commenter · 2021-11-17T21:01:31Z

Codecov Report

Merging #9 (9f08382) into master (6f60a35) will decrease coverage by 1.22%.
The diff coverage is 84.61%.

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage   98.41%   97.18%   -1.23%     
==========================================
  Files           1        1              
  Lines          63       71       +8     
==========================================
+ Hits           62       69       +7     
- Misses          1        2       +1

Impacted Files	Coverage Δ
src/MLJClusteringInterface.jl	`97.18% <84.61%> (-1.23%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ablaom · 2021-11-18T03:29:49Z

@jbrea Thanks for that. Unfortunately, since making the suggestion I'm rethinking my Static suggestion. After discussions at the other open PR on DBSCAN, I'm not sure this is best after all. And I see that all the sklearn static clusterers that we wrap are implemented as Unsupervised transformers without transform or predict methods.

Can we sit on this until the API is firmed up?

ablaom · 2022-07-14T04:03:09Z

@jbrea Thanks for your patience. We may want to revisit this after JuliaAI/MLJBase.jl#806

jbrea · 2022-07-18T06:15:38Z

I'll have a look at this in early September, if you are not in a hurry.

jbrea · 2022-08-31T10:04:19Z

Hi @ablaom

I am struggling a bit with squeezing hierarchical clustering into the current design.
In hierarchical clustering the "heavy-lifting" is done when computing the dendrogram.
Once one has the dendrogram, it is cheap to make predictions by cutting the dendrogram at the desired height.
The current implementation uses a _cache field that holds the dendrogram and a hash of the data used to compute the dendrogram in order to trigger recomputation of the dendrogram, if the data changes significantly.
Do you see a better approach?

src/MLJClusteringInterface.jl

ablaom · 2022-09-04T21:23:01Z

@jbrea Thanks again for looking into this. I agree, this is just not easy to fit into the current design (or any general ML API, really).
I really like your idea, and it's probably the best solution from the point of view of user-friendliness.

However, I'm pretty sure storing state in model is going to lead to unexpected behaviour down the line. That is, I think predict needs to re-compute from scratch everytime it is called.

I think the best we can do is include your earlier closure in the report (in addition to the raw Clustering.Hclust object). The predict call should return the assignments corresponding to the (k, c) parameters at time of call.

What do you say?

jbrea · 2022-09-05T07:07:13Z

Thanks, @ablaom. This sounds good to me. I pushed the new version.

src/MLJClusteringInterface.jl

ablaom

Nice implementation. Perhaps we should add some tests?

At the least, we should add HierarchicalClusterer to the generic interface tests at the end of runtests.jl

https://github.com/jbrea/MLJClusteringInterface.jl/blob/61a249934465878fb6d5b10ea24309f1b5aca1bd/test/runtests.jl#L99

We also need to update this line to include the new model:

MLJClusteringInterface.jl/src/MLJClusteringInterface.jl

Line 175 in 6f60a35

(KMeans, KMedoids, DBSCAN),

Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>

jbrea · 2022-09-06T07:33:59Z

Thanks, @ablaom. Tests are added now. I think this PR should be ready now.

ablaom

Good to go. @jbrea Thanks for this valuable contribution, your patience, and careful design considerations.

add Hierarchical Clustering & some docstring fixes

f0aa08e

change to static

d60c982

jbrea added 2 commits August 31, 2022 10:50

Merge remote-tracking branch 'upstream/master'

c575b71

new version

2a2ed83

return categorical prediction

af60724

ablaom reviewed Sep 4, 2022

View reviewed changes

src/MLJClusteringInterface.jl Outdated Show resolved Hide resolved

cache in report; define DendrogramCutter

61a2499