Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Hierarchical Clustering & some docstring fixes #9

Merged
merged 12 commits into from
Sep 6, 2022

Conversation

jbrea
Copy link
Collaborator

@jbrea jbrea commented Sep 8, 2021

@ablaom predict(mach, Xnew) doesn't make any sense for hierarchical clustering, but predict(mach) does. Do you think this is still useful like this?

(ps. Sorry for mixing in docstring fixes...)

@ablaom
Copy link
Member

ablaom commented Sep 10, 2021

@jbrea Thanks indeed for this contribution.

On a first glance this implementation makes sense, but after some reflection, certain aspects bother me. For example, the possibility that predict(mach) can give different answers without fit! being called in between sounds dangerous. (This should probably be ruled out in the API.) Although this is a weak argument, another objection is that type ambiguity prevents us from having operations with zero data arguments - you would have to call predict(mach, nothing) or something.

Instead, I would be inclined to conceptualise this model as a Static transformer. After all, you are not "learning" in the sense of "learning to generalise to unseen examples". Implemented this way, there is no fit to overload, just a transform method with the "training" data as input. There remains the question of what the output should be. Some options I can think of:

  1. Just return the full Hclust object. Pros: flexibility to determine the right cut-off h/k in a later process. Con: To use the object, the user needs to be familiar with its interface
  2. Return the cluster labels, as you do now, according to the h/k values in the struct. Pros: Simple. Interpretation of output is clear. Cons: Need to re-compute (recall transform) for every new h/k.
  3. Return a closure that maps (k, h) or (k, ) or (h, ) to the labels for the cutoff defined by k, h. Pros: simple, flexible. No need for user to know the interface for Hclust. Cons: It is unusual to return callable objects. However, a nice show for the closure could mitigate the mystery. We did this for DecisionTrees here where the closure is for pretty printing of the tree and the argument is the tree depth. (Aside, there is already a scitype for callable objects, CallableReturning{S}, but I need to move it from MLJBase into ScientificTypesBase, which I'm happy to do.) edit I guess we would need to write a plot recipe to overload plot for our callable object.

My vote is for 3. What do you think? And do you have objectionis to implementing as Static?

@jbrea
Copy link
Collaborator Author

jbrea commented Nov 17, 2021

@ablaom Thanks for your feedback and suggestions! I like version 3 and implemented it.
I think the plot recipe is not urgent with plot(hc.dendrogram) (see docstring of HierarchicalClustering).
The only thing I dislike about this approach is that it is not a model and therefore not listed in models() and e.g. info("HierarchicalClustering") errs.

@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2021

Codecov Report

Merging #9 (9f08382) into master (6f60a35) will decrease coverage by 1.22%.
The diff coverage is 84.61%.

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage   98.41%   97.18%   -1.23%     
==========================================
  Files           1        1              
  Lines          63       71       +8     
==========================================
+ Hits           62       69       +7     
- Misses          1        2       +1     
Impacted Files Coverage Δ
src/MLJClusteringInterface.jl 97.18% <84.61%> (-1.23%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ablaom
Copy link
Member

ablaom commented Nov 18, 2021

@jbrea Thanks for that. Unfortunately, since making the suggestion I'm rethinking my Static suggestion. After discussions at the other open PR on DBSCAN, I'm not sure this is best after all. And I see that all the sklearn static clusterers that we wrap are implemented as Unsupervised transformers without transform or predict methods.

Can we sit on this until the API is firmed up?

@ablaom
Copy link
Member

ablaom commented Jul 14, 2022

@jbrea Thanks for your patience. We may want to revisit this after JuliaAI/MLJBase.jl#806

@jbrea
Copy link
Collaborator Author

jbrea commented Jul 18, 2022

I'll have a look at this in early September, if you are not in a hurry.

@jbrea
Copy link
Collaborator Author

jbrea commented Aug 31, 2022

Hi @ablaom

I am struggling a bit with squeezing hierarchical clustering into the current design.
In hierarchical clustering the "heavy-lifting" is done when computing the dendrogram.
Once one has the dendrogram, it is cheap to make predictions by cutting the dendrogram at the desired height.
The current implementation uses a _cache field that holds the dendrogram and a hash of the data used to compute the dendrogram in order to trigger recomputation of the dendrogram, if the data changes significantly.
Do you see a better approach?

@ablaom
Copy link
Member

ablaom commented Sep 4, 2022

@jbrea Thanks again for looking into this. I agree, this is just not easy to fit into the current design (or any general ML API, really).
I really like your idea, and it's probably the best solution from the point of view of user-friendliness.

However, I'm pretty sure storing state in model is going to lead to unexpected behaviour down the line. That is, I think predict needs to re-compute from scratch everytime it is called.

I think the best we can do is include your earlier closure in the report (in addition to the raw Clustering.Hclust object). The predict call should return the assignments corresponding to the (k, c) parameters at time of call.

What do you say?

@jbrea
Copy link
Collaborator Author

jbrea commented Sep 5, 2022

Thanks, @ablaom. This sounds good to me. I pushed the new version.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice implementation. Perhaps we should add some tests?

At the least, we should add HierarchicalClusterer to the generic interface tests at the end of runtests.jl

We also need to update this line to include the new model:

jbrea and others added 6 commits September 6, 2022 08:46
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
@jbrea
Copy link
Collaborator Author

jbrea commented Sep 6, 2022

Thanks, @ablaom. Tests are added now. I think this PR should be ready now.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go. @jbrea Thanks for this valuable contribution, your patience, and careful design considerations.

@ablaom ablaom merged commit a9e19bf into JuliaAI:master Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants