Release 0.4.1
Release 0.4.1 with lots of new features, new embeddings (RNN, Transformer and BytePair embeddings), new languages (Japanese, Spanish, Basque), new datasets, bug fixes and speed improvements (2x training speed for language models).
New Embeddings
Biomedical Embeddings
Added first embeddings trained over PubMed data, namely
Load these for instance with:
# Flair embeddings PubMed
flair_embedding_forward = FlairEmbeddings('pubmed-forward')
flair_embedding_backward = FlairEmbeddings('pubmed-backward')
# ELMo embeddings PubMed
elmo_embeddings = ELMoEmbeddings('pubmed')
Byte Pair Embeddings
Added the byte pair embeddings library by @bheinzerling. Support for 275 languages. Very useful if you want to train small models. Load these for instance with:
# initialize embeddings
embeddings = BytePairEmbeddings(language='en')
Transformer-XL Embeddings
Transformer-XL embeddings added by @stefan-it. Load with:
# initialize embeddings
embeddings = TransformerXLEmbeddings()
ELMo Transformer Embeddings
Experimental transformer version of ELMo embeddings
added by @stefan-it.
DocumentRNNEmbeddings
The new DocumentRNNEmbeddings class replaces the now-deprecated DocumentLSTMEmbeddings. This class allows you to choose which type of RNN you want to use. By default, it uses a GRU.
Initialize like this:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
glove_embedding = WordEmbeddings('glove')
document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')
New languages
Japanese
FlairEmbeddings
for Japanese trained by @frtacoa and @minh-agent:
# forward and backward embedding
embeddings_fw = FlairEmbeddings('japanese-forward')
embeddings_bw = FlairEmbeddings('japanese-backward')
Spanish
Added pre-computed FlairEmbeddings
for Spanish. Embeddings were computed over Wikipedia by @iamyihwa (see #80 )
To load Spanish FlairEmbeddings
, simply do:
# default forward and backward embedding
embeddings_fw = FlairEmbeddings('spanish-forward')
embeddings_bw = FlairEmbeddings('spanish-backward')
# CPU-friendly forward and backward embedding
embeddings_fw_fast = FlairEmbeddings('spanish-forward-fast')
embeddings_bw_fast = FlairEmbeddings('spanish-backward-fast')
Basque
- @stefan-it trained
FlairEmbeddings
for Basque which we now include, load with:
forward_lm_embeddings = FlairEmbeddings('basque-forward')
backward_lm_embeddings = FlairEmbeddings('basque-backward')
- add Basque FastText embeddings, load with:
wikipedia_embeddings = WordEmbeddings('eu-wiki')
crawl_embeddings = WordEmbeddings('eu-crawl')
New Datasets
- IMDB dataset #410 - load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)
- TREC_6 and TREC_50 #450 - load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.TREC_6)
- adds download routines for Basque Universal Dependencies and Named Entities, load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_BASQUE)
corpus_ner = NLPTaskDataFetcher.load_corpus(NLPTask.NER_BASQUE)
Other features
FlairEmbeddings for long text
FlairEmbeddings
can now be generated for arbitrarily long strings without causing out of memory errors. See #444
Function for calculating perplexity of a string #531
Use like this:
from flair.embeddings import FlairEmbeddings
# get language model
language_model = FlairEmbeddings('news-forward-fast').lm
# calculate perplexity for grammatical sentence
grammatical = 'The company made a profit'
perplexity_gramamtical_sentence = language_model.calculate_perplexity(grammatical)
# calculate perplexity for ungrammatical sentence
ungrammatical = 'Nook negh qapla!'
perplexity_ungramamtical_sentence = language_model.calculate_perplexity(ungrammatical)
# print both
print(f'"{grammatical}" - perplexity is {perplexity_gramamtical_sentence}')
print(f'"{ungrammatical}" - perplexity is {perplexity_ungramamtical_sentence}')
Bug fixes
- Overflow error in text generation #322
- Sentence embeddings are now vectors #368
- macro average F-score computation #521
- character embeddings on CUDA #434
- accuracy calculation #553