Skip to content

Release 0.3.2

Compare
Choose a tag to compare
@alanakbik alanakbik released this 12 Nov 17:01
· 5936 commits to master since this release

This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.

New Features

Embeddings

More word embeddings (#194 )

We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:

french_embedding = WordEmbeddings('fr')

More character LM embeddings (#204 #187 )

Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:

flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')

Custom embeddings (#170 )

Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:

custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')

New embeddings type: DocumentPoolEmbeddings (#191 )

Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:

word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')

Language model

New method: generate_text() (#167 )

The LanguageModel class now has an in-built generate_text() method to sample the LM. Run code like this:

# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')

# generate 2000 characters
text = model.generate_text(20000)
print(text)

Metrics

Class-based metrics in Metric class (#164 )

Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.

"Frozen" dropout (#184 )

Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.

Testing step in language model trainer (#178 )

Previously, the language model was never applied to test data during training. A final testing step has been added in (again).

Testing

Distinguish between unit and integration tests (#183)

Instructions on how to run tests with pipenv (#161 )

Optimizations

Disable autograd during testing and prediction (#175)

Since autograd is unused here this gives us minor speedups.