Why do test set transformations converge to outliers/circles as num_neighbours increases? #581

dpattiso · 2021-02-11T15:27:17Z

dpattiso
Feb 11, 2021

Hi, I've observed that as the number of neighbours approaches the size of the training data, any test/new data points which are embedded via transform are pushed to the outside of the resulting visualisations. This has been observed on a random dataset and the MNIST datasets as shown below, with the MNIST data showing it round each of the clusters (as opposed to globally). Usually a relatively small number of neighbours is required for this behaviour to be observed.

My question is why this occurs? I don't view it as a bug, but I can't explain why datapoints which are very similar to the original training data embedding are pushed to the outsides. I would naively interpret it as a way of telling when n_neighbors is too high. Scaling or not scaling of the data seems to have no effect on this.

MNIST with 500 neighbours, where "+" points are the unseen test data:

And a close up of one of the MNIST clusters:

Random noise dataset and associated code:


import numpy as np
from umap import UMAP
import matplotlib.pyplot as plt

random_test = np.random.random(size = (1000,40))
X_train_scaled = random_test[:800]
X_test_scaled = random_test[800:]

random_embedding_state = 123456

neighbours = np.arange(2, 21, 2)

for n in neighbours:
    umap_reducer = UMAP(n_neighbors=n, random_state=random_embedding_state)
    umap_reducer.fit(X_train_scaled)
    train_embedding = umap_reducer.transform(X_train_scaled)
    test_embedding = umap_reducer.transform(X_test_scaled)

    f, ax = plt.subplots(1)
    f.suptitle("UMAP Embedding, {} Neighbours".format(n))

    ax.scatter(train_embedding[:,0],
               train_embedding[:,1],
                c = 'b'
               )
    ax.scatter(test_embedding[:,0],
               test_embedding[:,1],
                c = 'r',
               )

    ax.get_xaxis().set_ticks([])
    ax.get_yaxis().set_ticks([])

Answered by lmcinnes

Feb 16, 2021

In the random case this comes down to the curse of dimensionality. Oddly enough a spherical gaussian in high dimensional space actually has almost all the data in a spherical shell, not in the middle of the ball. Uniform distributions end up with points "in the corners" in high dimensions.

UMAP does try to correct for these factors, but when doing a train/test split and training on one set of data it learns distributions only from that training data. The new test data gets transformed assuming that learned distribution of data, and so it generally ends up being on the "outside" because, in practice, that's where most data is. This gets rendered in low dimensions by having the new data tra…

View full answer

lmcinnes · 2021-02-16T18:01:47Z

lmcinnes
Feb 16, 2021
Maintainer

In the random case this comes down to the curse of dimensionality. Oddly enough a spherical gaussian in high dimensional space actually has almost all the data in a spherical shell, not in the middle of the ball. Uniform distributions end up with points "in the corners" in high dimensions.

UMAP does try to correct for these factors, but when doing a train/test split and training on one set of data it learns distributions only from that training data. The new test data gets transformed assuming that learned distribution of data, and so it generally ends up being on the "outside" because, in practice, that's where most data is. This gets rendered in low dimensions by having the new data transformed into a "shell" around the outside.

The question of why this doesn't happen with small numbers of neighbors then comes up. The answer is that, with very local neighborhoods, UMAP tends to follow the very local variations in the data. In terms of those finer grained, small scale, variations in the data the broader pattern of the distribution gets lost, so we don't see as much of the data as being in the "shell" since it is more tangled up in little local patterns.

1 reply

dpattiso Feb 17, 2021
Author

Thanks @lmcinnes. I was hesitant to use the "Gaussian" word as I know there is no assumption in UMAP of Gaussian behaviour -- but in the context of the curse of dimensionality it makes much more sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do test set transformations converge to outliers/circles as num_neighbours increases? #581

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why do test set transformations converge to outliers/circles as num_neighbours increases? #581

dpattiso Feb 11, 2021

Replies: 1 comment · 1 reply

lmcinnes Feb 16, 2021 Maintainer

dpattiso Feb 17, 2021 Author

dpattiso
Feb 11, 2021

Replies: 1 comment 1 reply

lmcinnes
Feb 16, 2021
Maintainer

dpattiso Feb 17, 2021
Author