This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.

New Features

Embeddings

More word embeddings (#194 )

We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:

french_embedding = WordEmbeddings('fr')

More character LM embeddings (#204 #187 )

Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:

flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')

Custom embeddings (#170 )

Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:

custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')

New embeddings type: `DocumentPoolEmbeddings` (#191 )

Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:

word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')

Language model

New method: `generate_text()` (#167 )

The LanguageModel class now has an in-built generate_text() method to sample the LM. Run code like this:

# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')

# generate 2000 characters
text = model.generate_text(20000)
print(text)

Metrics

Class-based metrics in `Metric` class (#164 )

Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.

"Frozen" dropout (#184 )

Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.

Testing step in language model trainer (#178 )

Previously, the language model was never applied to test data during training. A final testing step has been added in (again).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.3.2

New Features

Embeddings

More word embeddings (#194 )

More character LM embeddings (#204 #187 )

Custom embeddings (#170 )

New embeddings type: `DocumentPoolEmbeddings` (#191 )

Language model

New method: `generate_text()` (#167 )

Metrics

Class-based metrics in `Metric` class (#164 )

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

"Frozen" dropout (#184 )

Testing step in language model trainer (#178 )

Testing

Distinguish between unit and integration tests (#183)

Instructions on how to run tests with pipenv (#161 )

Optimizations

Disable autograd during testing and prediction (#175)

Release 0.3.2

New Features

Embeddings

More word embeddings (#194 )

More character LM embeddings (#204 #187 )

Custom embeddings (#170 )

New embeddings type: DocumentPoolEmbeddings (#191 )

Language model

New method: generate_text() (#167 )

Metrics

Class-based metrics in Metric class (#164 )

Bug Fixes

Fix serialization error for MacOS and Windows (#174 )

"Frozen" dropout (#184 )

Testing step in language model trainer (#178 )

Testing

Distinguish between unit and integration tests (#183)

Instructions on how to run tests with pipenv (#161 )

Optimizations

Disable autograd during testing and prediction (#175)

New embeddings type: `DocumentPoolEmbeddings` (#191 )

New method: `generate_text()` (#167 )

Class-based metrics in `Metric` class (#164 )