Release 0.4.4
Release 0.4.4 introduces dramatic improvements in inference speed for taggers (thanks to many contributions by @pommedeterresautee), Flair embeddings in 300 languages (thanks @stefan-it), modular tokenization and many new features and refactorings.
Speed optimizations
Many refactorings by @pommedeterresautee to improve inference speed of sequence tagger (#1038 #1053 #1068 #1093 #1130), Flair embeddings (#1074 #1095 #1107 #1132 #1145), word embeddings (#1084),
embeddings memory management (#1082 #1117), general optimizations (#1112) and classification (#1187).
The combined improvements increase inference speed by a factor of 2-3!
New features
Modular tokenization (#1022)
You can now pass custom tokenizers to Sentence
objects and Dataset
loaders to use different tokenizers than the included segtok
library by implementing a tokenizer method. Currently, in-built support exists for whitespace tokenization, segtok tokenization and Japanese tokenization with mecab (requires mecab to be installed). In the future, we expect support for additional external tokenizers to be added.
For instance, if you wish to use Japanese tokanization performed by mecab, you can instantiate the Sentence
object like this:
from flair.data import build_japanese_tokenizer
from flair.data import Sentence
# instantiate Japanese tokenizer
japanese_tokenizer = build_japanese_tokenizer()
# init sentence and pass this tokenizer
sentence = Sentence("私はベルリンが好きです。", use_tokenizer=japanese_tokenizer)
print(sentence)
Flair Embeddings for 300 languages (#1146)
Thanks to @stefan-it, there is now a massivey multilingual Flair embeddings model that covers 300 languages. See #1099 for more info on these embeddings and this repo for more details.
This replaces the old multilingual Flair embeddings that were trained for 6 languages. Load them with:
embeddings_fw = FlairEmbeddings('multi-forward')
embeddings_bw = FlairEmbeddings('multi-backward')
Multilingual Character Dictionaries (#1157)
Adds two multilingual character dictionaries computed by @stefan-it.
Load with
dictionary = Dictionary.load('chars-large')
print(len(dictionary.idx2item))
dictionary = Dictionary.load('chars-xl')
print(len(dictionary.idx2item))
Batch-growth annealing (#1138)
The paper Don't Decay the Learning Rate, Increase the Batch Size makes the case for increasing the batch size over time instead of annealing the learning rate.
This version adds the possibility to have arbitrarily large mini-batch sizes with an accumulating gradient strategy. It introduces the parameter mini_batch_chunk_size
that you can set to break down large mini-batches into smaller chunks for processing purposes.
So let's say you want to have a mini-batch size of 128, but your memory cannot handle more than 32 samples at a time. Then you can train like this:
trainer = ModelTrainer(tagger, corpus)
trainer.train(
"path/to/experiment/folder",
# set large mini-batch size
mini_batch_size=128,
# set chunk size to lower memory requirements
mini_batch_chunk_size=32,
)
Because we now can arbitrarly raise mini-batch size, we can now execute the annealing strategy in the above paper. Do it like this:
trainer = ModelTrainer(tagger, corpus)
trainer.train(
"path/to/experiment/folder",
# set initial mini-batch size
mini_batch_size=32,
# choose batch growth annealing
batch_growth_annealing=True,
)
Document-level sequence labeling (#1194)
Introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03
, CONLL_03_GERMAN
and CONLL_03_DUTCH
datasets which indicate document boundaries.
Here's how to train a model on CoNLL-03 on the document level:
# read CoNLL-03 with document_as_sequence=True
corpus = CONLL_03(in_memory=True, document_as_sequence=True)
# what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# init simple tagger with GloVe embeddings
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=WordEmbeddings('glove'),
tag_dictionary=tag_dictionary,
tag_type=tag_type,
)
# initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# start training
trainer.train(
'path/to/your/experiment',
# set a much smaller mini-batch size because documents are huge
mini_batch_size=2,
)
Option to evaluate on training split (#1202)
Previously, the ModelTrainer
only allowed monitoring of dev and test splits during training. Now, you can also monitor the train split to better check if your method is overfitting.
Support for Danish tagging (#1183)
Adds support for Danish POS and NER thanks to @AmaliePauli!
Use like this:
from flair.data import Sentence
from flair.models import SequenceTagger
# example sentence
sentence = Sentence("København er en fantastisk by .")
# load Danish NER model and predict
ner_tagger = SequenceTagger.load('da-ner')
ner_tagger.predict(sentence)
# print annotations (NER)
print(sentence.to_tagged_string())
# load Danish POS model and predict
pos_tagger = SequenceTagger.load('da-pos')
pos_tagger.predict(sentence)
# print annotations (NER + POS)
print(sentence.to_tagged_string())
Support for DistilBERT embeddings (#1044)
You can use them like this:
from flair.data import Sentence
from flair.embeddings import BertEmbeddings
embeddings = BertEmbeddings("distilbert-base-uncased")
s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)
for token in s.tokens:
print(token.embedding)
print(token.embedding.shape)
MongoDataset for reading text classification data from a Mongo database (#1192)
Adds the option of reading data from MongoDB. See this documentation on how to use this features.
Feidegger corpus (#1199)
Adds a dataset downloader for the Feidegger corpus consisting of text-image pairs. Instantiate the corpus like this:
from flair.datasets import FeideggerCorpus
# instantiate Feidegger corpus
corpus = FeideggerCorpus()
# print a text-image pair
print(corpus.train[0])
Refactorings
Refactor checkpointing mechanism (#1101)
Refactored the checkpointing mechanism and slimmed down interfaces / code required to load checkpoints.
In detail:
- The methods
save_checkpoint
andload_checkpoint
are no longer part of theflair.nn.Model
interface. Instead, saving and restoring checkpoints is now (fully) performed by theModelTrainer
. - The optimizer state and scheduler state are removed from the
ModelTrainer
constructor since they are no longer required here. - Loading a checkpoint is now one line of code (previously two lines).
# 1. initialize trainer as always with a model and a corpus
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(model, corpus)
# 2. train your model for 2 epochs
trainer.train(
'experiment/folder',
max_epochs=2,
# example checkpointing
checkpoint=True,
)
# 3. load last checkpoint with one line of code
trainer = ModelTrainer.load_checkpoint('experiment/folder/checkpoint.pt', corpus)
# 4. continue training for 2 extra epochs
trainer.train('experiment/folder_2', max_epochs=4)
Refactor data sampling during training (#1154)
Adds a FlairSampler
interface to better enable passing custom samplers to the ModelTrainer
.
For instance, if you want to always shuffle your dataset in chunks of 5 to 10 sentences, you provide a sampler like this:
# your trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# execute training run
trainer.train('path/to/experiment/folder',
max_epochs=150,
# sample data in chunks of 5 to 10
sampler=ChunkSampler(block_size=5, plus_window=5)
)
Other refactorings
-
Switch everything to batch first mode (#1077)
-
Refactor classification to be more consistent with SequenceTagger (#1151)
-
PyTorch-Transformers -> Transformers #1163
-
In-place transpose of tensors (#1047)
Enhancements
Documentation fixes (#1045 #1098 #1121 #1157 #1160 #1168 )
Add option to set rnn_type
used in SequenceTagger
(#1113)
Accept string as input in NER predict (#1142)
Example usage:
# init tagger
tagger= SequenceTagger.load('ner')
# predict over list of strings
sentences = tagger.predict(
[
'George Washington went to Berlin .',
'George Berlin lived in Washington .'
]
)
# output predictions
for sentence in sentences:
print(sentence.to_tagged_string())
Enable One-hot Embeddings of other Tags (#1191)
Bug fixes
- Fix the learning rate finder (#1119)
- Fix OneHotEmbeddings on Cuda (#1147)
- Fix encoding error in
CSVClassificationDataset
(#1055) - Fix encoding errors related to old windows chars (#1135)
- Fix length error in
CharacterEmbeddings
(#1088 ) - Fix tokenizer insert empty token to sentence object (#1226)
- Ensure
StackedEmbeddings
always has the same embedding order (#1114) - Use $HOME instead of ~ for
cache_root
(#1134)