Stratification of multilabel data #698

dokato · 2024-05-14T10:22:33Z

dokato
May 14, 2024
Maintainer

Recently, new Multilabel classification task has been added: #440

Looking at datasets available on HF, they're typically quite large, e.g. >50k examples. What's the best way to train/test split it, or just sample it?

Maybe sth from: http://scikit.ml/stratification.html ?

cc @x-tabdeveloping as you added that in

x-tabdeveloping · 2024-05-14T11:06:46Z

x-tabdeveloping
May 14, 2024
Collaborator

I have added some stratified subsampling for multilabel data in this PR: #694 You can have a look at the code there.
But in essence I think you can just use scikit-learn's train_test_split with an encoded labels array as the stratify argument.

Something like this:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

X: ArrayLike = [...]
labels: list[list[str]] = [...]
encoded_labels = MultiLabelBinarizer.fit_transform(labels)

X_train, y_train, X_test, y_test = train_test_split(X, labels, stratifiy=encoded_labels)

6 replies

x-tabdeveloping May 14, 2024
Collaborator

It's not merged yet, so you haven't overlooked anything :D
I'm not sure how well it works though, a lot of times it throws a warning and doesn't reduce the dataset size, so we might have to look into it.

dokato May 14, 2024
Maintainer Author

Yea, because on surface it doesn't seem like such a trivial problem, especially when dealing with narrowly represented classes.

dokato May 15, 2024
Maintainer Author

@x-tabdeveloping have a look at quick and dirty attempt of using iterative_train_test_split from this scikit-multilearn library:
#546
It works quite neat and fast, though requires a few hack arounds to make it running with our flow, plus we need to add another dependency for just this one function, LMK what you think...

x-tabdeveloping May 16, 2024
Collaborator

@dokato Can you make this in a separate PR and then we can discuss it there? I have some thoughts about it

dokato May 20, 2024
Maintainer Author

DIscussion moved to: #760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratification of multilabel data #698

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stratification of multilabel data #698

dokato May 14, 2024 Maintainer

Replies: 1 comment · 6 replies

x-tabdeveloping May 14, 2024 Collaborator

x-tabdeveloping May 14, 2024 Collaborator

dokato May 14, 2024 Maintainer Author

dokato May 15, 2024 Maintainer Author

x-tabdeveloping May 16, 2024 Collaborator

dokato May 20, 2024 Maintainer Author

dokato
May 14, 2024
Maintainer

Replies: 1 comment 6 replies

x-tabdeveloping
May 14, 2024
Collaborator

x-tabdeveloping May 14, 2024
Collaborator

dokato May 14, 2024
Maintainer Author

dokato May 15, 2024
Maintainer Author

x-tabdeveloping May 16, 2024
Collaborator

dokato May 20, 2024
Maintainer Author