Release 0.4.1 with lots of new features, new embeddings (RNN, Transformer and BytePair embeddings), new languages (Japanese, Spanish, Basque), new datasets, bug fixes and speed improvements (2x training speed for language models).

New Embeddings

Biomedical Embeddings

Added first embeddings trained over PubMed data, namely

Load these for instance with:

# Flair embeddings PubMed
flair_embedding_forward = FlairEmbeddings('pubmed-forward')
flair_embedding_backward = FlairEmbeddings('pubmed-backward')

# ELMo embeddings PubMed
elmo_embeddings = ELMoEmbeddings('pubmed')

Byte Pair Embeddings

Added the byte pair embeddings library by @bheinzerling. Support for 275 languages. Very useful if you want to train small models. Load these for instance with:

# initialize embeddings
embeddings =  BytePairEmbeddings(language='en')

Transformer-XL Embeddings

Transformer-XL embeddings added by @stefan-it. Load with:

# initialize embeddings
embeddings = TransformerXLEmbeddings()

ELMo Transformer Embeddings

Experimental transformer version of ELMo embeddings
added by @stefan-it.

DocumentRNNEmbeddings

The new DocumentRNNEmbeddings class replaces the now-deprecated DocumentLSTMEmbeddings. This class allows you to choose which type of RNN you want to use. By default, it uses a GRU.

Initialize like this:

from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

glove_embedding = WordEmbeddings('glove')

document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')

New languages

Japanese

FlairEmbeddings for Japanese trained by @frtacoa and @minh-agent:

# forward and backward embedding
embeddings_fw = FlairEmbeddings('japanese-forward')
embeddings_bw = FlairEmbeddings('japanese-backward')

Spanish

Added pre-computed FlairEmbeddings for Spanish. Embeddings were computed over Wikipedia by @iamyihwa (see #80 )

To load Spanish FlairEmbeddings, simply do:

# default forward and backward embedding
embeddings_fw = FlairEmbeddings('spanish-forward')
embeddings_bw = FlairEmbeddings('spanish-backward')

# CPU-friendly forward and backward embedding
embeddings_fw_fast = FlairEmbeddings('spanish-forward-fast')
embeddings_bw_fast = FlairEmbeddings('spanish-backward-fast')

Basque

@stefan-it trained FlairEmbeddings for Basque which we now include, load with:

forward_lm_embeddings = FlairEmbeddings('basque-forward')
backward_lm_embeddings = FlairEmbeddings('basque-backward')

add Basque FastText embeddings, load with:

wikipedia_embeddings = WordEmbeddings('eu-wiki')
crawl_embeddings = WordEmbeddings('eu-crawl')

New Datasets

IMDB dataset #410 - load with

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)

TREC_6 and TREC_50 #450 - load with

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.TREC_6)

adds download routines for Basque Universal Dependencies and Named Entities, load with

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_BASQUE)
corpus_ner = NLPTaskDataFetcher.load_corpus(NLPTask.NER_BASQUE)

Other features

FlairEmbeddings for long text

FlairEmbeddings can now be generated for arbitrarily long strings without causing out of memory errors. See #444

Function for calculating perplexity of a string #531

Use like this:

from flair.embeddings import FlairEmbeddings

# get language model
language_model = FlairEmbeddings('news-forward-fast').lm

# calculate perplexity for grammatical sentence
grammatical = 'The company made a profit'
perplexity_gramamtical_sentence = language_model.calculate_perplexity(grammatical)

# calculate perplexity for ungrammatical sentence
ungrammatical = 'Nook negh qapla!'
perplexity_ungramamtical_sentence = language_model.calculate_perplexity(ungrammatical)

# print both
print(f'"{grammatical}" - perplexity is {perplexity_gramamtical_sentence}')
print(f'"{ungrammatical}" - perplexity is {perplexity_ungramamtical_sentence}')

Bug fixes

Overflow error in text generation #322
Sentence embeddings are now vectors #368
macro average F-score computation #521
character embeddings on CUDA #434
accuracy calculation #553

Speed improvements

Asynchronous loading of mini batches in language model training (roughly doubles training speed) #406
Only send mini-batches to GPU #350
Speed up sequence tagger prediction #353
Use new cuda semantics #402
Reduce CPU-GPU shuffling #459
LM memory tweaks #466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.4.1