Hard and Rare samples estimation with NN based models (Team-12, ML-24)

Introduction

Large datasets often contain instances that do not equally contribute to the learning process. These instances may include mislabeled, difficult-to-learn, or redundant samples. Our objective is to minimize the size of the initial dataset by retaining only highly relevant samples. This approach facilitates faster convergence of the model, reduces storage requirements for useful data, and minimizes computational overhead. Our goal is to proactively filter the unlabeled dataset to reduce costs associated with data labeling, while ensuring the final quality of the model remains uncompromised.

Approaches

Supervised learning: SSFT
Unsupervised learning: AutoEncoder

Results

All results you can find in appropreate folders and notebooks.

SSFT

Network with lower number of parameters (0.3M vs 3M) tends to get decent SSFT metrics even without LR scheduling:

Visualization of samples taken from different clusters:

In AutoEncoder scenario, we can look at learnt samples as samples with SSIM threshold higher than a fixed threshold (in our case it's 0.15):

However, during second split training, SSIM on the samples from the first split is growing. Meaning that AE reconstruction task tends to generalize rather than overfit to the samples:

Self and cross correlations are a bad metrics to separate a learnt and not yet learnt samples for the contrastive scenario in BarlowTwins:

AutoEncoder

Mix 3000 images, example:

Training process:

Metric - sum loss from 1 to 100 epoch

Distribution:

Right tail:

Artificial label

Applied label "1" to mixture of images. 10% of MNIST train images were summed and normalized to have 0 mean and 1 std. Normalization prevents AE from separating the majority of latent representations of distorted images into a distinct cluster, increasing correlation of metrics.

SRCC of metrics with the artificial label

Modest correlation with the artificial target of non-supervised metrics.
- More complicated metrics like LID and Entropy yield better results.
Loss based metrics exhibit greater correlation with the artificial target.

Non-supervised metrics:

Non-supervised metric	SRCC
H_mean_from_0	0.2524
LID_mean_from_10	0.2403
H_mean_from_400	0.2298
H_last	0.2285
LID_var_from_10	0.2124

Loss based metrics:

loss metric	SRCC
loss_last	0.3826
loss_mean_from_50	0.3801
loss_mean_from_20	0.3793
loss_mean_from_0	0.3785
loss_diff_last_20	0.3067

SSFT metrics - learning and forgetting times

SRCC of metrics with the artificial label

Small correlation (< 8%) with the SRCC metrics of non-supervised and loss based metrics.
- Again, more complicated metrics like LID and SIL score yield better results.
Loss based metrics exhibit smaller correlation with the SSFT metrics.

Non-supervised metrics - Forgetting time:

Non-supervised metric	SRCC Forgetting time
sil_score__mean_from0	0.0791
sil_score__last	0.0752
LID__mean_from_10	0.0746
LID__std_from_10	0.0667
LID__var_from_10	0.0667

Non-supervised metrics - Learning time:

Non-supervised metric	SRCC Learning time
sil_score__mean_from0	0.1119
sil_score__last	0.1065
LID__mean_from_10	0.0578
sil_score__std_from_10	0.0510
LID__last	0.0501

Loss based metrics:

loss metric	SRCC Forgetting time	SRCC Learning time
loss_last	0.0643	0.0753
loss_mean_from_50	0.0605	0.0734
loss_mean_from_20	0.0594	0.0723
loss_mean_from_0	0.0570	0.0691
loss_diff_last_0	0.0450	0.0456

Reproduction

To train autoencoder and get bottleneck embeddings with samplewise loss values - run ssl-ae/ae-reconstruction-L1.ipynb.
To compute non-supervised and loss based metrics and plot histograms, images from them - run ssl-ae/ae-analysis.ipynb. (The full run takes around hour)
To get correlations with artificial label and SSFT metrics - run ssl-ae/ae-correlations.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Cartography		Cartography
SSFT		SSFT
classification		classification
images		images
ssl-ae		ssl-ae
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hard and Rare samples estimation with NN based models (Team-12, ML-24)

Table of Contents

Introduction

Approaches

Results