0.4.0: Supervised learning datasets and baselines
Highlights
Supervised learning baselines
torchtext 0.4.0 includes several example scripts that showcase how to create data, build vocabularies, train, test and run inference for common supervised learning baselines. We further provide a tutorial to explain these examples in more detail.
For an advanced application of these constructs see the iterable_train.py example.
Community
We would like to thank the open source community, who continues to send pull
requests for new features and bug-fixes.
Major New Features
- New datasets for supervised learning (#557 #565 #580)
- AG_NEWS
- SogouNews
- DBpedia
- YelpReviewPolarity
- YelpReviewFull
- YahooAnswers
- AmazonReviewPolarity
- AmazonReviewFull
- Tutorials and examples:
- Reference examples (#569 #575 #571 #575 #576) to
- Create/save text classification datasets
- Train and test a text classification model using one-line dataloading and iterator based Datasets.
- Setup online inference based on a trained model
- A tutorial to showcase and illustrate these examples.
- Reference examples (#569 #575 #571 #575 #576) to
New Features
- ngrams_iterator an iterator that yields ngrams based on a given list or iterator of strings. (#567 #577)
- build_vocab_from_iterator (#567)
- extract_archive (#569)
Improvements
- Added logging to download_from_url (#569)
- Added fast, basic english sentence normalization to get_tokenizer (#569 #568)
- Updated docs theme to pytorch_sphinx_theme (#573)
- Refined Example.fromJSON() to support parse nested key for parsing nested JSON dataset. (#563)
- Added
__len__
&get_vecs_by_tokens
in 'Vectors' class to generate vector from a list of tokens (#561) - Added templates for torchtext users to bring up issues (#553 #574)
- Added a new argument
specials
in Field.build_vocab to save the user-defined special tokens (#495) - Added a new argument
is_target
inRawField
class to show whether the field is a target variable - False by default (#459). Adjustedis_target
argument in LabelField to True to take it into effect (#450) - Added the option to serialize fields with
torch.save
orpickle.dump
, allow tokenizers in different languages (#453)
Bug Fixes
- Allow caching from unverified SSL in
CharNGram
(#554) - Fix the wrong
unk
index by generating the unk_index according to the specials (#531) - Update Moses tokenizer link in README.rst file (#529)
- Fix the url to load
wiki.simple.vec
(#525), fix the dead url to loadfastText
vectors (#521) - Fix
UnicodeDecodeError
for loading sequence tagging dataset (#506) - Fix collisions between oov words and in-vocab words caused by Issue #447 (#482)
- Fix a mistake in the processing bar of Vectors class (#480)
- Add the dependency to
six
under 'install_requires' in the setup.py file (PR #475 for Issue #465) - Fix a bug in
Field
class which causes overwriting thestop_words
attribute (PR #458 for Issue #457) - Transpose the text and target tensors if the text field in BPTTIterator has 'batch_first' set to True (#462)
- Add
<unk>
to default specials (#567)
Backward Compatibility
- Dropped support for python 2.7.9 (#552)