Release 0.3.2
This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.
New Features
Embeddings
More word embeddings (#194 )
We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:
french_embedding = WordEmbeddings('fr')
More character LM embeddings (#204 #187 )
Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:
flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')
Custom embeddings (#170 )
Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:
custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')
New embeddings type: DocumentPoolEmbeddings
(#191 )
Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:
word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')
Language model
New method: generate_text()
(#167 )
The LanguageModel
class now has an in-built generate_text()
method to sample the LM. Run code like this:
# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')
# generate 2000 characters
text = model.generate_text(20000)
print(text)
Metrics
Class-based metrics in Metric
class (#164 )
Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.
Bug Fixes
Fix serialization error for MacOS and Windows (#174 )
On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.
"Frozen" dropout (#184 )
Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.
Testing step in language model trainer (#178 )
Previously, the language model was never applied to test data during training. A final testing step has been added in (again).
Testing
Distinguish between unit and integration tests (#183)
Instructions on how to run tests with pipenv (#161 )
Optimizations
Disable autograd during testing and prediction (#175)
Since autograd is unused here this gives us minor speedups.