Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] t-SNE is not deterministic even with random_state #2980

Open
Tracked by #4139
zbjornson opened this issue Oct 14, 2020 · 18 comments
Open
Tracked by #4139

[BUG] t-SNE is not deterministic even with random_state #2980

zbjornson opened this issue Oct 14, 2020 · 18 comments
Assignees
Labels
bug Something isn't working CUDA / C++ CUDA issue

Comments

@zbjornson
Copy link
Contributor

Describe the bug
cuML's t-SNE outputs vary from run to run, even when random_state is used or initial embeddings are provided (and #2549 is fixed).

Steps/Code to reproduce bug

from cuml.manifold import TSNE
from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot as plt
import numpy as np

X, y = make_blobs(n_samples=10000, centers=8, n_features=4, random_state=0)
X = X.astype(np.float32)
tsne = TSNE(random_state=0)

Y1 = tsne.fit_transform(X)
plt.scatter(Y1[:, 0], Y1[:, 1])
plt.show()

Y2 = tsne.fit_transform(X)
plt.scatter(Y2[:, 0], Y2[:, 1])
plt.show()

image image

Expected behavior
Should be the same between runs, like scikit learn.

Environment details (please complete the following information):

  • Environment location: GCE
  • Linux Distro/Architecture: Ubuntu 20.04 amd64
  • GPU Model/Driver: V100 and driver 440.33.01
  • CUDA: 10.2
  • Method of cuDF & cuML install: source, cmake 3.16.3, gcc 8.4.0, fbe6272
@zbjornson zbjornson added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 14, 2020
@viclafargue viclafargue removed the ? - Needs Triage Need team to review and classify label Oct 15, 2020
@viclafargue
Copy link
Contributor

Thank you for opening the issue. Maybe @danielhanchen can give a hint on this?

@danielhanchen
Copy link
Contributor

Not 100% sure, but I remember this was known issue (for me at least).
The issue is that although the start embedding is initialized the same with the same random seed, other components inside TSNE are not determinstic (well not yet set deterministically).

Eg: FAISS has some randomness inside it, which I don't know if a random seed is used. Likewise, due to GPU parallelization, TSNE will output vastly different results, since the tree search methods terminate at random intervals for each thread. [Ie thread say 3 threads ABC. A terminates in trial 1, but B in trial 2 etc.]

Further discussion: CannyLab/tsne-cuda#44

One way to alleviate this issue I found was to forcefully use a random embedding close to the top 2 PCA components. This way, at least the initial random embedding is a good preconditioner, so the final output won't diverge too much from run to run.

Another solution which is highly not advised is to turn tree searching to become a locking construct. Ie synchronize the entire tree algorithm, which clearly will slow things down.

Sadly I know my comments won't be of much help.

@zbjornson
Copy link
Contributor Author

Thanks for the tips. FAISS as used here is deterministic. It looks like the variation is due to atomics and parallelization.

I tried PCA initialization, but was surprised by the extent of variation between runs:
image
Runs 1,3,5 are similar. Runs 2,4 are similar. Run 2 looks like there's some legit FP error happening, but it's nearly impossible to debug without deterministic behavior.

Five runs with random init but the same seed:
image

Five runs with random init, random seeds:
image


UMAP also appears non-deterministic, despite spectral init, although it's less pronounced. Local structure is very well preserved, and global structure is well preserved with a higher number of neighbors.
k=15
image
k=50
image

@danielhanchen
Copy link
Contributor

Oh in regards to the plots where TSNE suddenly diverges, I actually noticed this issue before. [By the way nice plots :) ]
#1383 had many changes I did a while back, but they weren't incorporated since I left NVIDIA a while back.

To alleivate wild divergences, I leveraged 2 approaches:
(1) Mean centre components after IntegrationKernel every time during an epoch. This should be super fast, since you can add atomicAdd in the IntegrationKernel.

    MLCommon::LinAlg::unaryOp(
      sums, sums, 2, [div_N] __device__(float x) { return x * div_N; }, stream);
    TSNE::mean_centre<<<MLCommon::ceildiv(n, 1024), 1024, 0, stream>>>(
      YY, YY + NNODES + 1, sums, n);
    CUDA_CHECK(cudaPeekAtLastError());

(2) In IntegrationKernel, if the components exceed some MAX_BOUNDS measure, usually say 100 or so, or even 1000, reset the momentum vector and gains so to not cause TSNE to continue diverging:

    // Confirm Y1 and Y2 are within bounds (-MAX_BOUNDS, MAX_BOUNDS)
    // These are arbitrary but can help with outliers
    // Also reset both gains and old forces
    if (Y1[i] < -MAX_BOUNDS) {
      Y1[i] = -MAX_BOUNDS;
      gains1[i] = 1;
      old_forces1[i] = 0;
    }
    else if (Y1[i] > MAX_BOUNDS) {
      Y1[i] = MAX_BOUNDS;
      gains1[i] = 1;
      old_forces1[i] = 0;
    }
    if (Y2[i] < -MAX_BOUNDS) {
      Y2[i] = -MAX_BOUNDS;
      gains2[i] = 1;
      old_forces2[i] = 0;
    }
    else if (Y2[i] > MAX_BOUNDS) {
      Y2[i] = MAX_BOUNDS;
      gains2[i] = 1;
      old_forces2[i] = 0;
    }

@zbjornson
Copy link
Contributor Author

Thanks @danielhanchen. Started landing those changes with #3018.

@zbjornson
Copy link
Contributor Author

zbjornson commented Oct 20, 2020

Hrm, even with mean-centering and constraining to a bounding box, roughly 1 in 4 runs still give wonky results, and no runs give great results. Here are a few examples of the same dataset run several times with PCA init:

Exact Okay B-H Bad B-H 1 Bad B-H 2 Bad B-H 3
exact good-1 bad-1 bad-2 bad-4
All 8 clusters clearly visible All 8 clusters clearly visible, but some spreading. Purple and green clusters tiny. Lots of stray points. Green cluster tiny. Stray points. Clumping in teal cluster. Lots of spreading.

I'm looking at incorporating Martin Burtscher's latest changes to the B-H kernels in hopes that numerical stability will be improved/FP errors reduced. Also looking at the newer FFT kernel approach. CannyLab's version is reproducible and gives consistently good results:

cl

@zbjornson zbjornson reopened this Oct 20, 2020
@zbjornson zbjornson changed the title [BUG] t-SNE is not deterministic even with random_state [BUG] t-SNE is not deterministic even with random_state; gives odd results Oct 20, 2020
@zbjornson
Copy link
Contributor Author

I have a version working with FFT. It gives higher embedding quality, solves the cluster spreading issue and is more consistent between runs. However, there's still something upstream of the main loop (e.g. in the perplexity search) that, compared to CannyLab's impl, is causing less spacing between blobs, more variability between runs and lower embedding quality, as well as occasional "pinpoint" results. See below:

CannyLab has better spacing

cuML-FFT CannyLab
cuml-fft-100k canny-fft-100k

FFT is more consistent than BH and resolves spreading. CannyLab is more consistent between runs. cuML-FFT has occasional (~1 in 15 runs) pinpointing or similar math errors (4th run).

Run 1 2 3 4
cuML-BH good-1 bad-1 bad-4 bad-2
cuML-FFT cuml-1 cuml-2 cuml-3 cuml-5
CannyLab cl-1 cl-2 cl-3 cl-4

CannyLab ends with 1.8x better grad norm and converges much faster

image


This was a bigger project than I intended to take on and don't know if it would even be accepted in cuML. (For my own [work] purposes I'm likely going to use an optimzied fork of CannyLab's.) I'm running out of steam for the moment to keep digging in to the remaining cuML-FFT issues, but in addition to the above wins, the FFT version ...

  • is 20-30% faster
  • is significantly shorter and simpler
  • adds early stopping based on gradient norm
  • has more acceptable licensing than Burtscher's Barnes-Hut (Note than v4.5 has a more permissive license, but cuML is based on v3.1, which has a commercial-restrictive license. v4 uses a 3D tree.)

so I think is worth finishing. The current B-H version doesn't seem usable.

@cjnolet @dantegd @drobison00 thoughts?

@cjnolet
Copy link
Member

cjnolet commented Oct 24, 2020

@zbjornson This looks like a pretty significant change and I do think it's worth investigating more closely as a potential alternative for cuml. Please bear with me (and anyone else who might be interested) while I/we get get spun up on the details of this approach.

Undortunately, today I found what appears to be a regression in 0.16 and I'm concerned this was introduced in recent changes: #3057. I believe there's a possibility this regression could be having an effect on your comparisons as well and could also be the reason why it appears to be unusable in its current state. We might need some more comparisons across versions but I don't recall seeing artifacts and outliers in the cuml tsne such as those in your recent images.

What do you think about bringing over the FFT implementation and adding it as an additional option for now usingalgorithm='fft'?

However, there's still something upstream of the main loop (e.g. in the perplexity search) that, compared to CannyLab's impl, is causing less spacing between blobs, more variability between runs and lower embedding quality,

I've found that different implementations of the algorithm cause the perplexity/n_neighbors (and other hyper params) to have different effects on the resulting embedding. This makes me wonder whether the distances between the resulting clusters in the FFT version could be reproduced by adjusting the hyper-params and whether they can be directly compared using the same hyper-param settings.

@zbjornson
Copy link
Contributor Author

What do you think about bringing over the FFT implementation and adding it as an additional option for now using algorithm='fft'

About to open a PR for this.

Undortunately, today I found what appears to be a regression in 0.16 and I'm concerned this was introduced in recent changes: #3057.

Argh... I'll bisect if I have time this weekend.

This makes me wonder whether the distances between the resulting clusters in the FFT version could be reproduced by adjusting the hyper-params and whether they can be directly compared using the same hyper-param settings.

Section 4 of https://arxiv.org/abs/1712.09005 had an answer:

We present a novel variation called "late exaggeration,” which refers to setting α >1 for the last several hundred iterations. This approach seems to produce more easily interpretable visualizations: one recurring issue with t-SNE outputs (see Fig. 5) is that the arising structure, while clustered, has its clusters close to each other and does not allow for an easy identification of segments. By increasing α (and thus the attraction between points in the same cluster), the clusters contract and are more easily distinguishable.

CannyLab's impl doesn't do this, so it's a bit odd that this impl needs it. Anyway, that's one way to resolve the spacing issue.

I haven't resolved the occasional "pinpointing" issue. I'm not using any randomization, so I think that means it's due to some atomic operation that will be miserable to find and correct. It's likely outside of the main loop though given that this issue doesn't occur in the CannyLab impl and the loops are nearly identical. (The reason they don't need to use late exaggeration >1 is probably related.) Unfortunately this occurs more frequently with more rows.

@cjnolet
Copy link
Member

cjnolet commented Oct 28, 2020

ref #3058

@zbjornson
Copy link
Contributor Author

#3084 resolved the artifacts and the remaining non-determinism appears to be from FP atomics. Don't think there's anything left to do (without major perf impacts).

@cjnolet cjnolet reopened this Nov 12, 2020
@cjnolet
Copy link
Member

cjnolet commented Nov 12, 2020

Opening this back up to track continued investigation of potential data races in floating point additions

@cjnolet
Copy link
Member

cjnolet commented Nov 12, 2020

Tagging @avantikalal

@zbjornson zbjornson changed the title [BUG] t-SNE is not deterministic even with random_state; gives odd results [BUG] t-SNE is not deterministic even with random_state Nov 15, 2020
@trivialfis
Copy link
Member

Hi, may I ask what's the current status of this? I would like to look into it after UMAP. ;-)

@cjnolet
Copy link
Member

cjnolet commented May 13, 2021

@trivialfis, the new (experimental) FFT implementation has lower variance and better numerical stability but it's not completely deterministic. It would be great if you were able to make this deterministic as well.

@trivialfis
Copy link
Member

Thanks for sharing the information. It will be on my to-do list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA / C++ CUDA issue
Projects
None yet
Development

No branches or pull requests

6 participants