Release 0.4.2 includes new features such as streaming data loading (allowing training over very large datasets), support for OpenAI GPT Embeddings, pre-trained Flair embeddings for many new languages, better classification baselines using one-hot embeddings and fine-tuneable document pool embeddings, and text regression as a third task next to sequence labeling and text classification.

New way of loading data (#768)

The data loading part has been completely refactored to enable streaming data loading from disk using PyTorch's DataLoaders. I.e. training no longer requires the full dataset to be kept in memory, allowing us to train models over much larger datasets. This version also changes the syntax of how to load datasets.

Old way (now deprecated):

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

New way:

import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

To use streaming loading, i.e. to not load into memory, you can pass the in_memory parameter:

import flair.datasets
corpus = flair.datasets.UD_ENGLISH(in_memory=False)

Embeddings

Flair embeddings (#614)

This release brings Flair embeddings to 11 new languages (thanks @stefan-it!): Arabic (ar), Danish (da), Persian (fa), Finnish (fi), Hebrew (he), Hindi (hi), Croatian (hr), Indonesian (id), Italian (it), Norwegian (no) and Swedish (sv). It also improves support for Bulgarian (bg), Czech, Basque (eu), Dutch (nl) and Slovenian (sl), and adds special language models for historical German. Load with language code, i.e.

# load Flair embeddings for Italian 
embeddings = FlairEmbeddings('it-forward')

One-hot encoded embeddings (#747)

Some classification baselines work astonishingly well with simple learnable word embeddings. To support testing these baselines, we've added learnable word embeddings that start from a one-hot encoding of words. To initialize, you need to pass a corpus to initialize the vocabulary.

# load corpus 
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

# init learnable word embeddings with corpus
embeddings = OneHotEmbeddings(corpus)

More options in `DocumentPoolEmbeddings` (#747)

We now allow users to specify a fine-tuning option before the pooling operation is executed in document pool embeddings. Options are 'none' (no fine-tuning), 'linear' (linear remapping of word embeddings), 'nonlinear' (nonlinear remapping of word embeddings). Nonlinear should be used together with WordEmbeddings, while None should be used with OneHotEmbeddings (not necessary since they are already learnt on data). So, to replicate FastText classification you can either do:

# instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')

# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')

OpenAI GPT Embeddings (#624)

We now support embeddings from the OpenAI GPT model. We use the excellent pytorch-pretrained-BERT library to download the GPT model, tokenize the input and extract embeddings from the subtokens.

Initialize with:

embeddings = OpenAIGPTEmbeddings()

Portuguese embeddings from NILC (#576)

Extensibility to new downstream tasks (#681)

Previously, we had the SequenceTagger and TextClassifier as the two downstream tasks supported by Flair. The ModelTrainer had specific methods to train these two models, making it difficult for users to add new types of tasks (such as text regression) to Flair.

This release refactors the flair.nn.Model and ModelTrainer functionality to make it uniform across tagging models and enable users to add new tasks to Flair. Now, by implementing the 5 methods in the flair.nn.Model interface, a custom model immediately becomes trainable with the ModelTrainer. Now, three types of downstream tasks implement this interface:

the SequenceTagger,
the TextClassifier
and the beta TextRegressor.

The code refactor removes a lot of code redundancies and slims down the interfaces of the downstream tasks classes. As the sole breaking change, it removes the load_from_file() methods, which are now part of the load() method, i.e. if previously you loaded a self-trained model like this:

tagger = SequenceTagger.load_from_file('/path/to/model.pt')

You now do it like this:

tagger = SequenceTagger.load('/path/to/model.pt')

New features

New beta support for text regression (#564)
Return confidence scores for single-label classification (#664)
Add method to find probability for each class in case of multi-class classification (#693)
Capability to change threshold during multi label classification #707
Support for customized ELMo embeddings (#661)
Detect multi-label problems automatically: Previously, users always had to specify whether their text classification problem was multi_label or not. Now, this is detected automatically if users do not specify. So now you can init like this:

# corpus
corpus = TREC_6()

# make label_dictionary
label_dictionary = corpus.make_label_dictionary()

# init text classifier
classifier = TextClassifier(document_embeddings, label_dictionary)

We added better module descriptions to embeddings and dropout so that more parameters get printed by default for models for better logging. (#747)
Make 'cache_root' a global variable so that different directories can be chosen for caching (#667)
Both string and Token objects can now be passed to the add_token method (#628)

New datasets

Added IMDB classification corpus to flair.datasets (#749)
Added TREC_6 classification corpus to flair.datasets (#749)
Added 20 newsgroups classification corpus to flair.datasets (NEWSGROUPS object)
WASSA-17 emotion intensity text regression tasks (#695)

Bug fixes

We normalized the training loss across modules so that train / test loss are consistent. (#670)
Permission error on Windows preventing model download (#557)
Handling of empty sentences (#566 #758)
Fix text generation on CUDA (#666)
others ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.4.2

New way of loading data (#768)

Embeddings

Flair embeddings (#614)

One-hot encoded embeddings (#747)

More options in `DocumentPoolEmbeddings` (#747)

OpenAI GPT Embeddings (#624)

Portuguese embeddings from NILC (#576)

Extensibility to new downstream tasks (#681)

New features

New datasets

Bug fixes

Release 0.4.2

New way of loading data (#768)

Embeddings

Flair embeddings (#614)

One-hot encoded embeddings (#747)

More options in DocumentPoolEmbeddings (#747)

OpenAI GPT Embeddings (#624)

Portuguese embeddings from NILC (#576)

Extensibility to new downstream tasks (#681)

New features

New datasets

Bug fixes

More options in `DocumentPoolEmbeddings` (#747)