-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
In the random case this comes down to the curse of dimensionality. Oddly enough a spherical gaussian in high dimensional space actually has almost all the data in a spherical shell, not in the middle of the ball. Uniform distributions end up with points "in the corners" in high dimensions. UMAP does try to correct for these factors, but when doing a train/test split and training on one set of data it learns distributions only from that training data. The new test data gets transformed assuming that learned distribution of data, and so it generally ends up being on the "outside" because, in practice, that's where most data is. This gets rendered in low dimensions by having the new data transformed into a "shell" around the outside. The question of why this doesn't happen with small numbers of neighbors then comes up. The answer is that, with very local neighborhoods, UMAP tends to follow the very local variations in the data. In terms of those finer grained, small scale, variations in the data the broader pattern of the distribution gets lost, so we don't see as much of the data as being in the "shell" since it is more tangled up in little local patterns. |
Beta Was this translation helpful? Give feedback.
In the random case this comes down to the curse of dimensionality. Oddly enough a spherical gaussian in high dimensional space actually has almost all the data in a spherical shell, not in the middle of the ball. Uniform distributions end up with points "in the corners" in high dimensions.
UMAP does try to correct for these factors, but when doing a train/test split and training on one set of data it learns distributions only from that training data. The new test data gets transformed assuming that learned distribution of data, and so it generally ends up being on the "outside" because, in practice, that's where most data is. This gets rendered in low dimensions by having the new data tra…