Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrated working PyTorch-CRF in MM #413

Merged
merged 22 commits into from
Jul 12, 2022
Merged

Conversation

vrdn-23
Copy link
Contributor

@vrdn-23 vrdn-23 commented Apr 29, 2022

Folks,

This is the first draft PR that I have ready for moving from the old rusty sklearn-crfsuite to the new (hopefully better) pytorch-CRF. A couple of things to note here:

  • I haven't really removed the original CRF code from the Mindmeld repo just yet. I still want to make sure we don't have major accuracy drops when it comes to the ANLP repo. My individual experiments on the ANLP repo suggests that we should be fine, but there could always be pitfalls that I did not account for while integrating and I want to be sure before we move on to something new. Hence for now, that way this is structured is that we can switch between the crf models based on "crf" and "torch-crf". Ideally I would either want to remove the old CRF completely by the time this PR is merged or mark it as deprecated for a couple of RCs and refactor it out once we move to the next major release.
  • I still need to add some unit tests to this, although I'm not sure if there are particular unit tests that would work for a PyTorch model
  • The design for this isn't really perfect. So I would appreciate any suggestions or modifications for a much cleaner codebase. I'm trying to make sure that we stick to the crfsuite implementation that we have as closely as possible so that when we do eventually switch models, it doesn't involve a lot of refactoring and that most things are taken care of by the new PyTorch class. If this might be the wrong approach to look at things, please let me know!
  • The functions for compute_marginals which is required for the predict_proba class in CRF has been borrowed from a PR raised against the original pytorch-crf repo. For reasons unknown, it never got merged into the master branch, but I've gone through the implementation a bit and it looks solid and the original author also seemed to approve the implementation.
  • For the time-being, I've decided not to provide GPU support for the CRF as now. The reasons for this are 1) My experiments have shown that a smaller batch size does provide a much faster convergence rate for the CRF. In cases related to ANLP, I've specifically noticed that moving the tensors to GPU with such a small batch size takes longer than training it on CPU. 2) The implementation seems fast enough in CPU because we dealing with sparse tensors and also doesn't seem to be taking up too much memory. I don't have the numbers for this specifically, but I'll see if I can collect them during my integration with ANLP and provide a more quantitative analysis.
  • The torch versions we are dealing with are not really the ones that I've been using locally. But I don't want to make a scene, so I'm going to use the old torch versions (that we're currently relying on), and hopefully we will have another PR soon to ensure that ALL the libraries we use in Mindmeld are soon on their latest possible stable version!

If anyone feels that we should still provide GPU support, please let me know and I'd be happy to add to back.

Please feel free to comment or let me know if you have any questions regarding the PyTorch code, or if I've missed anything essential with regards to integrating this new monstrosity!
Thanks for all the help in advance! Looking forward to meeting everyone at a newer scikit-learn version soon! :)

@snow0x2d0
Copy link
Contributor

EEEEEEeeeeeEEEEE

@mattfrey-cisco
Copy link
Contributor

@vrdn-23 - Have you profiled memory usage of this (especially for the feature extraction)? This may not be important to the build anymore, but it would be good to know.

@vrdn-23
Copy link
Contributor Author

vrdn-23 commented May 10, 2022

Folks,
Just a small update since this hasn't seem any activity from my side. I ran some of the experiments to determine runtime and memory consumption of the new CRF module on my local system last week and the results have been inconsistent based on what I've been running in the background. As such I moved the experiments to an online server this week, and they are being run as we speak.

I have enough evidence to suggest that the accuracy is not too bad from whence I first ran the experiments and am convinced that this PR is ready to graduate from its "Draft" status. It would be great if everyone could let me know what you think looks good and what you think needs a complete overhaul. I will be giving my full active attention to the PR starting this Monday (05/16) and hope to have the time/memory/accuracy analysis done and updated on the PR by then.

I hope to have all outstanding comments resolved by Monday EOD and looking forward to addressing new ones!
TIA!

@vrdn-23 vrdn-23 marked this pull request as ready for review May 10, 2022 22:15
@vrdn-23
Copy link
Contributor Author

vrdn-23 commented May 19, 2022

I don't have a a better way of formatting this as a markdown table, so I'm going to leave this a picture but these are the main conditions I think are worth knowing about this data:

  • These numbers are the averages across 15 runs.
  • The different configurations tried here are the original CRF (sklearn-crfsuite), the torch-crf using the "dict" (aka dictionary) configuration, the torch-crf using the "hash" (aka feature hasher) configuration with a patience of 3, the torch-crf "hash" with a patience of 5. All of them use a batch size of 10 and other defaults are kept unless mentioned otherwise.
  • For the time column, the times shown are the total time taken to train and predict on the entire train and test dataset respectively.
  • These are the only 3 intents that use CRF in ANLP and hence these were the only ones I ran these numbers on.
  • The memory and time measurements are based on the memory usage and time taken to run the "er.fit()" and "er.predict()" lines for each intent-specific entity recognizer.

The current configuration that I would support for use in ANLP would be the "hash" configuration with patience 5. The rest of the experimental settings are included to showcase an overall picture of how the torch-crf performs.
Some key takeaways that can be made from these values are:

  • We do see an increase in memory used by the torch-crf and it is highest when we use the "dict" feature extractor and this is expected as the dictionary feature extractor takes into account all possible features while the feature hasher, helps reduce the number of sparse features we have. There is a 50-100 MiB increase in memory while training in the other configurations and if this is a problem that needs to be addressed, I can look further into seeing if there is anything that I can do to reduce the memory footprint of the code. However as expected, there is no increase in memory in increasing the patience, so choosing the patience: 5 configuration does not hurt us in this case.
  • The times for training are not as bad as I had initially anticipated. Apart from the "dict" configuration, at worst we are probably 3-4 minutes slower than training from the original CRF. Personally, I don't see how we can speed up this and reducing the patience obviously has a positive effect in improving the time it takes to train, but I think that also comes at the cost of the final accuracy. Hence in the case of the pytorch-crf, choosing faster train times might come at the cost of accuracy which is why I'm preferring having a longer patience time as opposed to a smaller one.
  • One good thing I noticed is that it seems for bigger datasets (esp call_person) it does seem like inference is happening much faster than when using the original CRF although the improvement might not be as obvious when comparing on smaller datasets. This is probably due to the overhead of having to initialize a PyTorch data loader and everything from scratch each time.
  • The accuracies are not really that bad in terms on the original model. Atmost we see a 1-2% drop in all configs, but the primary reason for this, is the fact that we are sacrificing 15% of our training data for validation and my initial experiments did show that we could match the accuracies we got with the original CRF setup if allowed to train with the full ANLP data. The best way to fix this would be to invest time into a dev set that we can rely on for ANLP.

I realize that this is a lot to take in. But if you have any questions or comments, please feel free to reach out to me 1-on-1 or drop a comment here! I'm hoping that with enough review comments, we can make this code a little bit more cleaner/faster/lighter or all of the above!

@vrdn-23 vrdn-23 requested a review from vembar May 19, 2022 22:59
@vrdn-23
Copy link
Contributor Author

vrdn-23 commented May 19, 2022

mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved
mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved
mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved
mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved
@vrdn-23 vrdn-23 requested a review from tmehlinger May 20, 2022 22:38
@mattfrey-cisco
Copy link
Contributor

@vrdn-23 - I don't think the memory usage matters too much anymore, @tmehlinger can weigh in on that. When we were building for SPLAT, we had to use the env var documented here: https://www.mindmeld.com/docs/userguide/getting_started.html#mm-crf-features-in-memory

If you rerun the tests with that set to '0' the original CRF training will use significantly less memory. I'm fine removing that option, but we need to verify our build pipelines can handle that memory usage, and we should remove that env var from the documentation.

mindmeld/models/tagger_models.py Show resolved Hide resolved
mindmeld/models/taggers/crf.py Show resolved Hide resolved
mindmeld/models/taggers/pytorch_crf.py Show resolved Hide resolved
mindmeld/models/taggers/pytorch_crf.py Outdated Show resolved Hide resolved
@vrdn-23
Copy link
Contributor Author

vrdn-23 commented Jun 17, 2022

The current test failures are related to installing a package called Sudachipy for the Japanese Spacy tokenizer. I'm looking into it and also looking into a potential bug when it comes to incremental loading.

mindmeld/models/helpers.py Outdated Show resolved Hide resolved
mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved
@vrdn-23
Copy link
Contributor Author

vrdn-23 commented Jun 29, 2022

Created #424 to address the incremental build failure

@vrdn-23 vrdn-23 requested a review from snow0x2d0 June 29, 2022 08:38
mindmeld/models/helpers.py Outdated Show resolved Hide resolved
mindmeld/models/tagger_models.py Outdated Show resolved Hide resolved
mindmeld/models/taggers/crf.py Outdated Show resolved Hide resolved
@vrdn-23 vrdn-23 merged commit 238412f into master Jul 12, 2022
@vrdn-23 vrdn-23 deleted the vidamoda/feature/pytorch_crf branch August 12, 2022 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants