-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance when running splitter model #27
Comments
So I think this is actually part of a larger issue on how the data are formatted. In the must recent versions since the implementation of the multitask model, I fixed an issue in the data prep step so that all the training data will have the same length, both the Rodrigues data and Wellcome's own data, and this seems to have caused some dip in performance. It's very curious, because the best performing multitask model 2020.3.18 was built on Reach data with a sequence length of 250 tokens, but Rodrigues data that was loaded in one continuous list of tokens. I think this means that only one Rodrigues example will be loaded, as the string of tokens will get truncated to 250 by the tokenizer. I retrained the multitask model with the data created for the 2020.3.18 model, flawed as it may be, and it creates a big jump in model performance. @lizgzil shall we use this model as default in the meantime, and get to the bottom of the sequence length issues at a later date? |
yeh i think that makes sense. Does it kind of make sense that adding lots of Rodrigues data to the training data dips model performance because it was created for a different task? |
@ivyleavedtoadflax I see this is the latest deep reference parser wheel in S3: |
I did wonder...but then when i included no Rodrigues data, it performed worse. I think it probably needs a bit of experimentation. |
I usually create a Makefile recipe like:
there may already be one in I think when you integrate with Reach you can probably just point directly at github, not So typically what I did in the past was to create a release, make a wheel locally with |
this might be a good way to automatically release and add attributes btw https://github.com/marketplace/actions/automatic-releases |
ooh nice! |
Note that this issue occurs only in #25, not the master branch. The splitting model implemented in 3e48684 performs badly, e.g.:
It was expected that this model would perform less well than the model implemented in 2020.3.1, however it seems to be worse than expected.
The new model
2020.3.6
is required to ensure compatibility with the changes implemented in #25. Changes to the Rodrigues data format mean that this model runs in less than one hours, instead of around 16 hours.Some experimentation with hyper-parameters is probably all that is needed to bring this model up to scratch, and in any case it is largely superseded by the multitask
split_parse
model. If a high quality splitting model is required immediately, revert to an earlier Pre-release version for now, all of which perform very well.The text was updated successfully, but these errors were encountered: