Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are my results so different on identical runs? #118

Open
AylaRT opened this issue Jan 13, 2021 · 0 comments
Open

Why are my results so different on identical runs? #118

AylaRT opened this issue Jan 13, 2021 · 0 comments

Comments

@AylaRT
Copy link

AylaRT commented Jan 13, 2021

Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.

For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:

grid_search = GridSearchCV(crf, hyperparam_search_space, scoring=scorer, verbose=True, cv=KFold(nr_folds, random_state=42))
grid_search.fit(x_train, y_train)
optimised_crf = grid_search.best_estimator_
y_pred = optimised_crf.predict(x_test)
final_score = metrics.flat_f1_score(y_test, y_pred, average='macro', labels=["I", "O", "B"])

to illustrate, these are results from 3 identical runs on identical data:
Example 1:
f1 (micro): 83.2%, 81.6%, 66.2%
f1 (macro): 71.8%, 71.6%, 57.5%

Example 2:
f1 (micro): 81.1%, 77.6%, 66.7%
f1 (macro): 53.5%, 57.3%, 47.1%

The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).

So my questions are:

  • why are the differences sometimes this large, when the exact same data is used, with the same shuffle for hyperparameter optimisation?
  • which random_seeds need to be set to stabilise these results?

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant