Release 0.4.0: Supervised learning datasets and baselines · pytorch/text

Highlights

torchtext 0.4.0 includes several example scripts that showcase how to create data, build vocabularies, train, test and run inference for common supervised learning baselines. We further provide a tutorial to explain these examples in more detail.

For an advanced application of these constructs see the iterable_train.py example.

We would like to thank the open source community, who continues to send pull
requests for new features and bug-fixes.

ngrams_iterator an iterator that yields ngrams based on a given list or iterator of strings. (#567 #577)
build_vocab_from_iterator (#567)
extract_archive (#569)

Added logging to download_from_url (#569)
Added fast, basic english sentence normalization to get_tokenizer (#569 #568)
Updated docs theme to pytorch_sphinx_theme (#573)
Refined Example.fromJSON() to support parse nested key for parsing nested JSON dataset. (#563)
Added __len__ & get_vecs_by_tokens in 'Vectors' class to generate vector from a list of tokens (#561)
Added templates for torchtext users to bring up issues (#553 #574)
Added a new argument specials in Field.build_vocab to save the user-defined special tokens (#495)
Added a new argument is_target in RawField class to show whether the field is a target variable - False by default (#459). Adjusted is_target argument in LabelField to True to take it into effect (#450)
Added the option to serialize fields with torch.save or pickle.dump, allow tokenizers in different languages (#453)

Allow caching from unverified SSL in CharNGram (#554)
Fix the wrong unk index by generating the unk_index according to the specials (#531)
Update Moses tokenizer link in README.rst file (#529)
Fix the url to load wiki.simple.vec (#525), fix the dead url to load fastText vectors (#521)
Fix UnicodeDecodeError for loading sequence tagging dataset (#506)
Fix collisions between oov words and in-vocab words caused by Issue #447 (#482)
Fix a mistake in the processing bar of Vectors class (#480)
Add the dependency to six under 'install_requires' in the setup.py file (PR #475 for Issue #465)
Fix a bug in Field class which causes overwriting the stop_words attribute (PR #458 for Issue #457)
Transpose the text and target tensors if the text field in BPTTIterator has 'batch_first' set to True (#462)
Add <unk> to default specials (#567)