Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PCA and spectral inits #225

Merged
merged 7 commits into from
Feb 2, 2023

Conversation

dkobak
Copy link
Contributor

@dkobak dkobak commented Jan 12, 2023

A tiny PR that would fix #224, #223, and #180. I am not insisting on any of these changes, but thought I make the PR to make it easier to discuss.

@dkobak
Copy link
Contributor Author

dkobak commented Jan 12, 2023

A lot of tests fail because we would need to fix random seeds over there. I'm not doing it for now -- let's discuss first.

@pavlin-policar
Copy link
Owner

pavlin-policar commented Feb 2, 2023

I think adding a tiny amount of noise to each initialization is fine if it solves all these issues. I'm used to calling this jitter, so that's what I'll call it. However, I think the user should be able to turn this off. For instance, we could add a general parameter TSNE(..., jitter_initialization=True) so that the user can still turn this off. I would also put this into its own separate function e.g.

def jitter(x, scale=0.01, random_state):
    std = np.std(x[:, 0])
    target_std = std * scale
    return x + random_state.normal(0, std, x.shape)

The we can call this pca, spectral, and any other initializations we have. I definitely wouldn't add this to the rescale function as that could be very confusing having rescale also actually change our initialization. We can turn this off by default for precomputed embeddings, but this might be a pain, because then we'd again need to implement something like jitter_initialization="auto" so we can figure out if we need to jitter or not. Maybe it would be easier to just enable jittering by default for all initializations, and if the user wants, they can turn this off manually. We'd need to document this behaviour in the initialization parameters.

What do you think?

@dkobak
Copy link
Contributor Author

dkobak commented Feb 2, 2023

Sure, let me move jitter out of rescale() into jitter().

I also agree that there should be a way to switch it off. What about adding add_jitter=True parameter to pca() and to spectral()? Then the user would be able to switch jitter off by manually calling e.g. pca(add_jitter=False).

I am not so sure about having the global parameter like TSNE(..., jitter_initialization=True), because indeed I think it would make more sense not to add jitter to a user-provided initialization.

@pavlin-policar
Copy link
Owner

What about adding add_jitter=True parameter to pca() and to spectral()? Then the user would be able to switch jitter off by manually calling e.g. pca(add_jitter=False).

Yes, this is exactly what I had in mind. We should add this to both pca and spectral. The other two -- weighted_mean and median can only be used when adding new points to an embedding, and they ignore one another, so it doesn't make any difference if we jitter those, so I wouldn't add the jitter parameter to those.

I am not so sure about having the global parameter like TSNE(..., jitter_initialization=True), because indeed I think it would make more sense not to add jitter to a user-provided initialization.

So you'd want to have jitter on by default for pca and spectral, but have it off for precomputed? The issue I see with that is that the user then can't use standard pca without jitter without having to precompute the initialization, then passing it through the constructor.

It might be fine though. The TSNE constructor has enough parameters as it is, and nobody will notice this in practice, since the added noise should be very small. Okay, I think this is fine for now then, and we'll add the parameter later, if we find any need for it.

@dkobak
Copy link
Contributor Author

dkobak commented Feb 2, 2023

Agree. I refactored the jitter code, and also added a test that checks that our spectral init coincides with sklearn spectral embedding. Luckily, it passes :-)

I also had to fix random seeds in some existing tests that used PCA/spectral initializations.

@pavlin-policar
Copy link
Owner

Great, thanks! I'll close the related issues.

@dkobak dkobak deleted the init-fixes branch February 2, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding tiny amount of noise to PCA/spectral init to prevent points from overlapping
2 participants