Release 0.4.2
Release 0.4.2 includes new features such as streaming data loading (allowing training over very large datasets), support for OpenAI GPT Embeddings, pre-trained Flair embeddings for many new languages, better classification baselines using one-hot embeddings and fine-tuneable document pool embeddings, and text regression as a third task next to sequence labeling and text classification.
New way of loading data (#768)
The data loading part has been completely refactored to enable streaming data loading from disk using PyTorch's DataLoaders. I.e. training no longer requires the full dataset to be kept in memory, allowing us to train models over much larger datasets. This version also changes the syntax of how to load datasets.
Old way (now deprecated):
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
New way:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
To use streaming loading, i.e. to not load into memory, you can pass the in_memory
parameter:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH(in_memory=False)
Embeddings
Flair embeddings (#614)
This release brings Flair embeddings to 11 new languages (thanks @stefan-it!): Arabic (ar), Danish (da), Persian (fa), Finnish (fi), Hebrew (he), Hindi (hi), Croatian (hr), Indonesian (id), Italian (it), Norwegian (no) and Swedish (sv). It also improves support for Bulgarian (bg), Czech, Basque (eu), Dutch (nl) and Slovenian (sl), and adds special language models for historical German. Load with language code, i.e.
# load Flair embeddings for Italian
embeddings = FlairEmbeddings('it-forward')
One-hot encoded embeddings (#747)
Some classification baselines work astonishingly well with simple learnable word embeddings. To support testing these baselines, we've added learnable word embeddings that start from a one-hot encoding of words. To initialize, you need to pass a corpus to initialize the vocabulary.
# load corpus
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
# init learnable word embeddings with corpus
embeddings = OneHotEmbeddings(corpus)
More options in DocumentPoolEmbeddings
(#747)
We now allow users to specify a fine-tuning option before the pooling operation is executed in document pool embeddings. Options are 'none' (no fine-tuning), 'linear' (linear remapping of word embeddings), 'nonlinear' (nonlinear remapping of word embeddings). Nonlinear should be used together with WordEmbeddings
, while None should be used with OneHotEmbeddings
(not necessary since they are already learnt on data). So, to replicate FastText classification you can either do:
# instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)
# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')
or
# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')
# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')
OpenAI GPT Embeddings (#624)
We now support embeddings from the OpenAI GPT model. We use the excellent pytorch-pretrained-BERT library to download the GPT model, tokenize the input and extract embeddings from the subtokens.
Initialize with:
embeddings = OpenAIGPTEmbeddings()
Portuguese embeddings from NILC (#576)
Extensibility to new downstream tasks (#681)
Previously, we had the SequenceTagger
and TextClassifier
as the two downstream tasks supported by Flair. The ModelTrainer
had specific methods to train these two models, making it difficult for users to add new types of tasks (such as text regression) to Flair.
This release refactors the flair.nn.Model
and ModelTrainer
functionality to make it uniform across tagging models and enable users to add new tasks to Flair. Now, by implementing the 5 methods in the flair.nn.Model
interface, a custom model immediately becomes trainable with the ModelTrainer
. Now, three types of downstream tasks implement this interface:
- the
SequenceTagger
, - the
TextClassifier
- and the beta
TextRegressor
.
The code refactor removes a lot of code redundancies and slims down the interfaces of the downstream tasks classes. As the sole breaking change, it removes the load_from_file()
methods, which are now part of the load()
method, i.e. if previously you loaded a self-trained model like this:
tagger = SequenceTagger.load_from_file('/path/to/model.pt')
You now do it like this:
tagger = SequenceTagger.load('/path/to/model.pt')
New features
- New beta support for text regression (#564)
- Return confidence scores for single-label classification (#664)
- Add method to find probability for each class in case of multi-class classification (#693)
- Capability to change threshold during multi label classification #707
- Support for customized ELMo embeddings (#661)
- Detect multi-label problems automatically: Previously, users always had to specify whether their text classification problem was multi_label or not. Now, this is detected automatically if users do not specify. So now you can init like this:
# corpus
corpus = TREC_6()
# make label_dictionary
label_dictionary = corpus.make_label_dictionary()
# init text classifier
classifier = TextClassifier(document_embeddings, label_dictionary)
- We added better module descriptions to embeddings and dropout so that more parameters get printed by default for models for better logging. (#747)
- Make 'cache_root' a global variable so that different directories can be chosen for caching (#667)
- Both string and Token objects can now be passed to the add_token method (#628)
New datasets
- Added IMDB classification corpus to
flair.datasets
(#749) - Added TREC_6 classification corpus to
flair.datasets
(#749) - Added 20 newsgroups classification corpus to
flair.datasets
(NEWSGROUPS object) - WASSA-17 emotion intensity text regression tasks (#695)