v2.5.0
Release 2.5
We want to particularly point out that guides, tutorials, and API docs are currently being published to http://tensorflow.org/text ! This should make it easier for users to find our documentation. We worked hard on improving docs across the board, so feel free to let us know if further clarification is needed.
Major Features and Improvements
- API docs, guides, & tutorial are now available on http://tensorflow.org/text
- New guides & tutorials including: tokenizers, subwords tokenizer, and BERT text preprocessing guide.
- Add RoundRobinTrimmer
- Add a function to generate a BERT vocab from a tf.data.Dataset.
- Add detokenize methods for BertTokenizer and WordpieceTokenizer.
- Enable NFD and NFKD in NormalizeWithOffset op
Bug Fixes and Other Changes
- Many API updates (eg. adding descriptions & examples) to various ops.
- Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
- Fix a bug in split mode tokenizers that caused tests to fail on Windows.
- Fix broadcasting bugs in RoundRobinTrimmer
- Add WordpieceTokenizeWithOffsets with ALLOW_STATEFUL_OP_FOR_DATASET_FUNCTIONS for tf.data
- Remove PersistentTensor from sentencepiece_kernels.cc
- Document examples are now tested.
- Fix benchmarking of graph mode ops through use of tf.function.
- Set the default for mask_token for StringLookup and IntegerLookup to None
- Update the sentence_breaking_ops docstring to indicate that it's deprecated.
- Adding an i18n-friendly BasicTokenizer that can preserve accents
- For Windows, always include ICU data files since they need to be built in statically.
- Rename documentation file WordShape.md to WordShape_cls.md. Fix #361.
- Convert input to tensor to allow for numpy inputs to state based sentence breaker.
- Add classifiers to py packages and fix header image.
- Fix for the model server test.
- Update regression test for break_sentences_with_offsets.
- Add a shape attribute to the ToDense Keras layer.
- Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
- Fix for the model server test.
- Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems.
- Add regression test for Find Source Offsets
- Fix unselectable_ids shape check in ItemSelector.
- Switch out architecture image in tf.Text documentation.
- Fix regression test for state_based_sentence_breaker_v2
- Update run_build with enable_runfiles flag.
- Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
- Simplify tf-text WORKSPACE, by relying on tf_workspace().
- Update transformer.ipynb to use a saved text.BertTokenizer
- Update mobile targets to use :mobile rather than separate :android & :ios targets.
- Make tools part of the tensorflow_text pip package.
- Import tools from the tf-text package, instead of cloning the git repo.
- Minor cleanups to make some code compile on the android build system.
- Fix pip install command in readme
- Fix tools pip package inclusion.
- A tensorfow.org compatible docs generator for tf-text.
- Sample random tokens correctly during MLM.
- Treat Sentencepiece ops as stateful in tf.data pipelines.
- Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.