-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement auto early exaggeration #220
Conversation
openTSNE/tsne.py
Outdated
|
||
gradient_descent_params = { | ||
"dof": self.dof, | ||
"negative_gradient_method": self.negative_gradient_method, | ||
"learning_rate": self.learning_rate, | ||
"learning_rate": learning_rate_now, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the correct put to put this. This would set the learning rate for the duration of the whole embedding, not just the early exaggeration phase, right? I think the best place to put this might actually be inside the gradient_descent
call function, so the correct rescaling will happen during any call to .optimize
, not just the standard TSNE().fit
call.
I wasn't aware that this was an issue during the early exaggeration phase at all. It is actually necessary to rescale the learning rate, if the exaggeration isn't 1? In that case, wouldn't it also make sense to rescale it given any exaggeration value, not just the early exaggeration phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current default openTSNE behaviour (which is the same in FIt-SNE and about to become default in soon-to-be-released sklearn 1.2) is to use learning rate N/12. Here 12 corresponds to the early exaggeration, the idea being that if learning_rate * exaggeration > N, then the optimization may diverge (as shown by Linderman & Steinerberger). One COULD use learning rate N, instead of N/12, for the 2nd optimization phase, once early exaggeration is turned off. But we do not do it. We could decide to do it, and I think it may even be a good idea, but this will make openTSNE default behaviour different from other implementations. In any case, this I think requires a different PR and a separate issue, maybe comparing current and suggested scheme on a range of datasets.
In this PR I assumed that we want to keep the learning rate constant throughout optimization. Which is why I set it to N/early_exaggeration and keep the same learning rate for the second phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense, I'd forgotten about that. However, I'm a bit unsure what the best way about implementing would be. We currently handle learning_rate="auto"
here in the _handle_nice_params
function. However, this function is called in every .optimize
step, so that we can perform whatever optimization sequence we want.
With the way you've implemented it now, we'd be going around this function because we'd change the learning_rate
from auto
to a number before we even got to this function. This would also cause inconsistent behaviour between
TSNE().fit()
and
embedding.optimize(...)
embedding.optimize(...)
since the first one would then use the "correct" learning rate, while the second would use N/12
in both cases. I am very much opposed to this inconsistency, but I'm not sure what the best way to handle this would be.
Perhaps the best course of action would be to investigate rescaling the learning rate based on exaggeration, so we'd end up with a different learning rate for the early exaggeration and the standard optimization phase. I think this would best fit into our framework, and it seems the most principled approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I will start a separate issue.
To build on top of #220 (comment), I've run a quick experiment to see convergence rates using lr/exaggeration, and the results indicate that N/exag may actually lead to better and faster convergence than simply dividing by N/ee_rate: Visually, all four embeddings look pretty similar, and I wouldn't say ones are less converged than the others, but it seems like we can get the same KL divergence faster this way. What do you think? |
Does |
Yes, the "regular" phase is run with exaggeration=1
No, this is my typical macosko example. This is one example, so we'd definitely have to check it on several more, but this does indicate "that at least it wouldn't hurt". And if we were to implement N/exag, this would fit in much more cleanly into the entire openTSNE architecture. I wouldn't mind being slightly inconsistent with FIt-SNE or scikit-learn in this respect, since the visualizations seem visually indistinguishable from one another. |
I agree. I ran the same test on MNIST, and observe faster convergence using the suggested approach. Incidentally, I first did it wrong because I did not specify correct momentum terms for the two optimize() calls. This made me think that I am not aware of any reason for why momentum should be different. I tried using momentum=0.8 for both stages, and it seems to be better than the current 0.5->0.8 scheme. Note that your criticism that |
Yes, this is the same issue, and this has bothered me from the very start. So I'm very happy to see that using a single momentum seems to lead to faster convergence, as this would justify defaulting to 0.8 everywhere. I'm not aware of any justification for this choice either, this came from the original LVM and I never bothered to question it. |
It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now. Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good. |
Yes, I agree.
I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary. |
Iris:
Very good point. Red line shows that turning off learning rate "clipping" (I mean clipping to 200) works actually very good. |
That's great! I think we should test it for even smaller data sets, but this indicates that we can get rid of the 200 altogether. |
Are you going to run it on smth with sample size over 1mln? Sounds like you have everything set up for these experiments. But if you want, I can run something as well. |
Yes, sure, I'll find a few more data sets and run them. If everything goes well, we'll change the defaults to momentum=0.8 and lr=N/exaggeration, and this will solve all the issues outlined above. |
This may be worth bumping to 0.7! |
Have you had a chance to run it on some other datasets? Otherwise I would give it a try on something large, I am curious :) |
Hey Dmitry, no, unfortunately, I haven't had time yet. It's a busy semester for teaching :) If you have any benchmarks you'd like to run, I'd be happy to see the results. |
Just ran it in an identical way for Iris, MNIST, and n=1.3 mln dataset from 10x. I used uniform affinities with k=10 in all cases, to speed things up. Old: current default I think everything is very consistent:
|
I've also tried this on two other data sets: shekhar (20k) and cao (2mln): In both cases, using
I don't understand this entirely. So, can we use |
I now think that we should use |
Yes, I agree. I tried it myself and there are big oscillations. E.g. subsampling iris to 50 data points also leads to less oscillation with I think the best course of action is to add a |
Do you want me to edit this PR, or do you prefer to make a new one?
Is it okay to make |
I think adding it to this PR is completely fine.
I think it does. The exaggeration factor should come in through |
But exaggeration factor does not really feel like a "gradient descent parameter"... So I was reluctant to add it into the
Edit: sorry, misread your comment. If it is already passed in as you said, then there is of course no need to change it. Edit2: but actually, looking at how Line 1399 in 46d65ae
|
My understanding is that it is already included in the
for the EE phase, and
for the standard phase. Importantly, learning_rate = optim_params["learning_rate"]
if learning_rate == "auto":
exaggeration = optim_params.get("exaggeration", None)
if exaggeration is None:
exaggeration = 1
learning_rate = n_samples / exaggeration
optim_params["learning_rate"] = learning_rate |
I see. Still not quite sure where it gets added to the dictionary, but it does not matter now. I did the changes and added a test that checks if running |
I think this is fine now. I'm glad we found a way to simplify the API and speed up convergence at the same time :) |
Implements #218.
First,
early_exaggeration="auto"
is now set tomax(12, exaggeration)
.Second, the learning rate. We have various functions that currently take
learning_rate="auto"
and set it tomax(200, N/12)
. I did not change this, because those functions usually do not know what the early exaggeration was. So I kept it as is. I only changed the behaviour of the base class: therelearning_rate="auto"
is now set tomax(200, N/early_exaggeration)
.This works as intended:
(Note that the learning rate is currently not printed by the
repr(self)
because it's kept as "auto" at construction time and only set later. That's also how we had it before.)