Skip to content

Release 0.4.1

Compare
Choose a tag to compare
@alanakbik alanakbik released this 22 Feb 13:36
· 5371 commits to master since this release

Release 0.4.1 with lots of new features, new embeddings (RNN, Transformer and BytePair embeddings), new languages (Japanese, Spanish, Basque), new datasets, bug fixes and speed improvements (2x training speed for language models).

New Embeddings

Biomedical Embeddings

Added first embeddings trained over PubMed data, namely

Load these for instance with:

# Flair embeddings PubMed
flair_embedding_forward = FlairEmbeddings('pubmed-forward')
flair_embedding_backward = FlairEmbeddings('pubmed-backward')

# ELMo embeddings PubMed
elmo_embeddings = ELMoEmbeddings('pubmed')

Byte Pair Embeddings

Added the byte pair embeddings library by @bheinzerling. Support for 275 languages. Very useful if you want to train small models. Load these for instance with:

# initialize embeddings
embeddings =  BytePairEmbeddings(language='en')

Transformer-XL Embeddings

Transformer-XL embeddings added by @stefan-it. Load with:

# initialize embeddings
embeddings = TransformerXLEmbeddings()

ELMo Transformer Embeddings

Experimental transformer version of ELMo embeddings
added by @stefan-it.

DocumentRNNEmbeddings

The new DocumentRNNEmbeddings class replaces the now-deprecated DocumentLSTMEmbeddings. This class allows you to choose which type of RNN you want to use. By default, it uses a GRU.

Initialize like this:

from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

glove_embedding = WordEmbeddings('glove')

document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')

New languages

Japanese

FlairEmbeddings for Japanese trained by @frtacoa and @minh-agent:

# forward and backward embedding
embeddings_fw = FlairEmbeddings('japanese-forward')
embeddings_bw = FlairEmbeddings('japanese-backward')

Spanish

Added pre-computed FlairEmbeddings for Spanish. Embeddings were computed over Wikipedia by @iamyihwa (see #80 )

To load Spanish FlairEmbeddings, simply do:

# default forward and backward embedding
embeddings_fw = FlairEmbeddings('spanish-forward')
embeddings_bw = FlairEmbeddings('spanish-backward')

# CPU-friendly forward and backward embedding
embeddings_fw_fast = FlairEmbeddings('spanish-forward-fast')
embeddings_bw_fast = FlairEmbeddings('spanish-backward-fast')

Basque

  • @stefan-it trained FlairEmbeddings for Basque which we now include, load with:
forward_lm_embeddings = FlairEmbeddings('basque-forward')
backward_lm_embeddings = FlairEmbeddings('basque-backward')
  • add Basque FastText embeddings, load with:
wikipedia_embeddings = WordEmbeddings('eu-wiki')
crawl_embeddings = WordEmbeddings('eu-crawl')

New Datasets

  • IMDB dataset #410 - load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)
  • TREC_6 and TREC_50 #450 - load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.TREC_6)
  • adds download routines for Basque Universal Dependencies and Named Entities, load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_BASQUE)
corpus_ner = NLPTaskDataFetcher.load_corpus(NLPTask.NER_BASQUE)

Other features

FlairEmbeddings for long text

FlairEmbeddings can now be generated for arbitrarily long strings without causing out of memory errors. See #444

Function for calculating perplexity of a string #531

Use like this:

from flair.embeddings import FlairEmbeddings

# get language model
language_model = FlairEmbeddings('news-forward-fast').lm

# calculate perplexity for grammatical sentence
grammatical = 'The company made a profit'
perplexity_gramamtical_sentence = language_model.calculate_perplexity(grammatical)

# calculate perplexity for ungrammatical sentence
ungrammatical = 'Nook negh qapla!'
perplexity_ungramamtical_sentence = language_model.calculate_perplexity(ungrammatical)

# print both
print(f'"{grammatical}" - perplexity is {perplexity_gramamtical_sentence}')
print(f'"{ungrammatical}" - perplexity is {perplexity_ungramamtical_sentence}')

Bug fixes

  • Overflow error in text generation #322
  • Sentence embeddings are now vectors #368
  • macro average F-score computation #521
  • character embeddings on CUDA #434
  • accuracy calculation #553

Speed improvements

  • Asynchronous loading of mini batches in language model training (roughly doubles training speed) #406
  • Only send mini-batches to GPU #350
  • Speed up sequence tagger prediction #353
  • Use new cuda semantics #402
  • Reduce CPU-GPU shuffling #459
  • LM memory tweaks #466