You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.
For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:
to illustrate, these are results from 3 identical runs on identical data:
Example 1:
f1 (micro): 83.2%, 81.6%, 66.2%
f1 (macro): 71.8%, 71.6%, 57.5%
Example 2:
f1 (micro): 81.1%, 77.6%, 66.7%
f1 (macro): 53.5%, 57.3%, 47.1%
The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).
So my questions are:
why are the differences sometimes this large, when the exact same data is used, with the same shuffle for hyperparameter optimisation?
which random_seeds need to be set to stabilise these results?
thank you!
The text was updated successfully, but these errors were encountered:
Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.
For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:
grid_search = GridSearchCV(crf, hyperparam_search_space, scoring=scorer, verbose=True, cv=KFold(nr_folds, random_state=42))
grid_search.fit(x_train, y_train)
optimised_crf = grid_search.best_estimator_
y_pred = optimised_crf.predict(x_test)
final_score = metrics.flat_f1_score(y_test, y_pred, average='macro', labels=["I", "O", "B"])
to illustrate, these are results from 3 identical runs on identical data:
Example 1:
f1 (micro): 83.2%, 81.6%, 66.2%
f1 (macro): 71.8%, 71.6%, 57.5%
Example 2:
f1 (micro): 81.1%, 77.6%, 66.7%
f1 (macro): 53.5%, 57.3%, 47.1%
The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).
So my questions are:
thank you!
The text was updated successfully, but these errors were encountered: