Replies: 1 comment 6 replies
-
I have added some stratified subsampling for multilabel data in this PR: #694 You can have a look at the code there. Something like this: from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
X: ArrayLike = [...]
labels: list[list[str]] = [...]
encoded_labels = MultiLabelBinarizer.fit_transform(labels)
X_train, y_train, X_test, y_test = train_test_split(X, labels, stratifiy=encoded_labels) |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Recently, new Multilabel classification task has been added: #440
Looking at datasets available on HF, they're typically quite large, e.g. >50k examples. What's the best way to train/test split it, or just sample it?
Maybe sth from: http://scikit.ml/stratification.html ?
cc @x-tabdeveloping as you added that in
Beta Was this translation helpful? Give feedback.
All reactions