Merge pull request #348 from BrikerMan/develop

Release v1.1.2
BrikerMan · Mar 27, 2020 · 6d2970e · 6d2970e
2 parents 4ae3ae6 + ce671ac
commit 6d2970e
Show file tree

Hide file tree

Showing 17 changed files with 210 additions and 57 deletions.
diff --git a/README.md b/README.md
@@ -58,7 +58,7 @@ Kashgari is a simple and powerful NLP Transfer learning framework, build a state
 
 | Task                     | Language | Dataset                   | Score          | Detail                                                                                                   |
 | ------------------------ | -------- | ------------------------- | -------------- | -------------------------------------------------------------------------------------------------------- |
-| Named Entity Recognition | Chinese  | People's Daily Ner Corpus | **94.46** (F1) | [Text Labeling Performance Report](https://kashgari.bmio.net/tutorial/text-labeling/#performance-report) |
+| Named Entity Recognition | Chinese  | People's Daily Ner Corpus | **94.46** (F1) | [Text Labeling Performance Report](https://kashgari.rtfd.io/tutorial/text-labeling.html#performance-report) |
 
 ## Tutorials
 
@@ -170,7 +170,7 @@ Support this project by becoming a sponsor. Your issues and feature request will
 
 ## Contributors ✨
 
-Thanks goes to these wonderful people. And there are many ways to get involved. Start with the [contributor guidelines](https://kashgari.bmio.net/about/contributing/) and then check these open issues for specific tasks.
+Thanks goes to these wonderful people. And there are many ways to get involved. Start with the [contributor guidelines](./docs/about/contributing.md) and then check these open issues for specific tasks.
 
 <!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
 <!-- prettier-ignore-start -->
@@ -199,35 +199,4 @@ This library is inspired by and references following frameworks and papers.
 - [flair - A very simple framework for state-of-the-art Natural Language Processing (NLP)](https://github.com/zalandoresearch/flair)
 - [anago - Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging](https://github.com/Hironsan/anago)
 - [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)
-
-This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!
-
-## Contributors
-
-### Code Contributors
-
-This project exists thanks to all the people who contribute. [[Contribute](CONTRIBUTING.md)].
-<a href="https://github.com/BrikerMan/Kashgari/graphs/contributors"><img src="https://opencollective.com/Kashgari/contributors.svg?width=890&button=false" /></a>
-
-### Financial Contributors
-
-Become a financial contributor and help us sustain our community. [[Contribute](https://opencollective.com/Kashgari/contribute)]
-
-#### Individuals
-
-<a href="https://opencollective.com/Kashgari"><img src="https://opencollective.com/Kashgari/individuals.svg?width=890"></a>
-
-#### Organizations
-
-Support this project with your organization. Your logo will show up here with a link to your website. [[Contribute](https://opencollective.com/Kashgari/contribute)]
-
-<a href="https://opencollective.com/Kashgari/organization/0/website"><img src="https://opencollective.com/Kashgari/organization/0/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/1/website"><img src="https://opencollective.com/Kashgari/organization/1/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/2/website"><img src="https://opencollective.com/Kashgari/organization/2/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/3/website"><img src="https://opencollective.com/Kashgari/organization/3/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/4/website"><img src="https://opencollective.com/Kashgari/organization/4/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/5/website"><img src="https://opencollective.com/Kashgari/organization/5/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/6/website"><img src="https://opencollective.com/Kashgari/organization/6/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/7/website"><img src="https://opencollective.com/Kashgari/organization/7/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/8/website"><img src="https://opencollective.com/Kashgari/organization/8/avatar.svg"></a>
-<a href="https://opencollective.com/Kashgari/organization/9/website"><img src="https://opencollective.com/Kashgari/organization/9/avatar.svg"></a>
+- [bert4keras - Our light reimplement of bert for keras](https://github.com/bojone/bert4keras/)
diff --git a/docs/about/release-notes.md b/docs/about/release-notes.md
@@ -17,6 +17,11 @@ pip show kashgari
 
 ## Current Release
 
+### [1.1.2] - 2020.03.27
+
+- ✨ Add save best model callback `KashgariModelCheckpoint`.
+- ⬆️ Upgrading `bert4keras` version to `0.6.5`.
+
 ### [1.1.1] - 2020.03.13
 
 - ✨ Add BERTEmbeddingV2.

diff --git a/docs/api/embeddings.md b/docs/api/embeddings.md
@@ -11,7 +11,7 @@ Embedding layers have its own \_\_init\_\_ function, check it out from their doc
 | [BERTEmbedding](../embeddings/bert-embedding.md)                        | pre-trained BERT embedding                                                  |
 | [GPT2Embedding](../embeddings/gpt2-embedding.md)                        | pre-trained GPT-2 embedding                                                 |
 | [NumericFeaturesEmbedding](../embeddings/numeric-features-embedding.md) | random init `tf.keras.layers.Embedding` layer for numeric feature embedding |
-| [StackedEmbedding](../embeddings/stacked-embeddingmd)                   | stack other embeddings for multi-input model                                |
+| [StackedEmbedding](../embeddings/stacked-embedding.md)                  | stack other embeddings for multi-input model                                |
 
 All embedding layer shares same API except the `__init__` function.
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -261,12 +261,14 @@ def setup(app):
     rst_readme = os.path.join(docs_path, 'README.rst')
 
     # Update all .md files， for fixing links
-    update_markdown_content(docs_path)
+    if os.environ.get('READTHEDOCS') == 'True':
+        update_markdown_content(docs_path)
 
     # Change readme to rst file, and include in Sphinx index
     with open(rst_readme, 'w') as f:
         md_content = open(original_readme, 'r').read()
         md_content = md_content.replace('(./docs/', '(./')
+        md_content = md_content.replace('.md)', '.html)')
         f.write(convert(md_content))
         print(f'Saved RST file to {rst_readme}')
 

diff --git a/docs/embeddings/bert-embedding_v2.md b/docs/embeddings/bert-embedding_v2.md
@@ -18,7 +18,7 @@ BERTEmbeddingV2 support models:
 When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding
 
 ```python
-kashgari.embeddings.BERTEmbedding(vacab_path: str,
+kashgari.embeddings.BERTEmbedding(vocab_path: str,
                                   config_path: str,
                                   checkpoint_path: str,
                                   bert_type: str = 'bert',
@@ -30,10 +30,10 @@ kashgari.embeddings.BERTEmbedding(vacab_path: str,
 
 **Arguments**
 
-- **vacab_path**: path of model's `vacab.txt` file
+- **vocab_path**: path of model's `vacab.txt` file
 - **config_path**: path of model's `model.json` file
 - **checkpoint_path**: path of model's checkpoint file
-- **bert_type**: `bert`, `albert`, `nezha`. Type of BERT model.
+- **bert_type**: `bert`, `albert`, `nezha`, `electra`, `gpt2_ml`, `t5`. Type of BERT model.
 - **task**: `kashgari.CLASSIFICATION` `kashgari.LABELING`. Downstream task type, If you only need to feature extraction, just set it as `kashgari.CLASSIFICATION`.
 - **sequence_length**: `'auto'` or integer. When using `'auto'`, use the 95% of corpus length as sequence length. If using an integer, let's say `50`, the input output sequence length will set to 50.
 

diff --git a/docs/embeddings/index.md b/docs/embeddings/index.md
@@ -9,7 +9,7 @@ Kashgari provides several embeddings for language representation. Embedding laye
 | [BERTEmbedding](bert-embedding.md)                        | pre-trained BERT embedding                                                  |
 | [GPT2Embedding](gpt2-embedding.md)                        | pre-trained GPT-2 embedding                                                 |
 | [NumericFeaturesEmbedding](numeric-features-embedding.md) | random init `tf.keras.layers.Embedding` layer for numeric feature embedding |
-| [StackedEmbedding](./stacked-embeddingmd)                   | stack other embeddings for multi-input model                                |
+| [StackedEmbedding](./stacked-embedding.md)                   | stack other embeddings for multi-input model                                |
 
 All embedding classes inherit from the `Embedding` class and implement the `embed()` to embed your input sequence and `embed_model` property which you need to build you own Model. By providing the `embed()` function and `embed_model` property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.
 

diff --git a/docs/embeddings/numeric-features-embedding.md b/docs/embeddings/numeric-features-embedding.md
@@ -17,4 +17,4 @@ kashgari.embeddings.NumericFeaturesEmbedding(feature_count: int,
 - **feature_count**: count of the features of this embedding.
 - **feature_name**: name of the feature.
 - **sequence_length**: `'auto'`, `'variable'` or integer. When using `'auto'`, use the 95% of corpus length as sequence length. When using `'variable'`, model input shape will set to None, which can handle various length of input, it will use the length of max sequence in every batch for sequence length. If using an integer, let's say `50`, the input output sequence length will set to 50.
-- **embedding_size**: Dimension of the dense embedding.
+- **embedding_size**: Dimension of the dense embedding.
diff --git a/docs/embeddings/stacked-embedding.md b/docs/embeddings/stacked-embedding.md
@@ -11,4 +11,4 @@ kashgari.embeddings.StackedEmbedding(embeddings: List[Embedding],
 
 **Arguments**
 
-- **embeddings**: list of embedding object.
+- **embeddings**: list of embedding object.
diff --git a/docs/tutorial/text-classification.md b/docs/tutorial/text-classification.md
@@ -14,7 +14,7 @@ You could easily switch from one model to another just by changing one line of c
 | CNN\_LSTM\_Model      |      |
 | CNN\_GRU\_Model       |      |
 | AVCNN\_Model          |      |
-| KMax\_CNN]\_Model     |      |
+| KMax\_CNN\_Model     |      |
 | R\_CNN\_Model         |      |
 | AVRNN\_Model          |      |
 | Dropout\_BiGRU\_Model |      |
@@ -128,7 +128,7 @@ model.build_model(train_x, train_y, valid_x, valid_y)
 optimizer = RAdam()
 model.compile_model(optimizer=optimizer)
 
-# Train model 
+# Train model
 model.fit(train_x, train_y, valid_x, valid_y)
 ```
 

diff --git a/kashgari/__version__.py b/kashgari/__version__.py
@@ -7,4 +7,4 @@
 # file: __version__.py.py
 # time: 2019-05-20 16:32
 
-__version__ = '1.1.1'
+__version__ = '1.1.2'
diff --git a/kashgari/callbacks.py b/kashgari/callbacks.py
@@ -7,11 +7,16 @@
 # file: callbacks.py
 # time: 2019-05-22 15:00
 
+import logging
+import os
+
+from seqeval import metrics as seq_metrics
 from sklearn import metrics
-from kashgari import macros
 from tensorflow.python import keras
+from tensorflow.python.keras import backend as K
+
+from kashgari import macros
 from kashgari.tasks.base_model import BaseModel
-from seqeval import metrics as seq_metrics
 
 
 class EvalCallBack(keras.callbacks.Callback):
@@ -59,5 +64,109 @@ def on_epoch_end(self, epoch, logs=None):
             print(f"\nepoch: {epoch} precision: {precision:.6f}, recall: {recall:.6f}, f1: {f1:.6f}")
 
 
+class KashgariModelCheckpoint(keras.callbacks.ModelCheckpoint):
+    """Save the model after every epoch.
+
+     Arguments:
+         filepath: string, path to save the model file.
+         monitor: quantity to monitor.
+         verbose: verbosity mode, 0 or 1.
+         save_best_only: if `save_best_only=True`, the latest best model according
+           to the quantity monitored will not be overwritten.
+         mode: one of {auto, min, max}. If `save_best_only=True`, the decision to
+           overwrite the current save file is made based on either the maximization
+           or the minimization of the monitored quantity. For `val_acc`, this
+           should be `max`, for `val_loss` this should be `min`, etc. In `auto`
+           mode, the direction is automatically inferred from the name of the
+           monitored quantity.
+         save_weights_only: if True, then only the model's weights will be saved
+           (`model.save_weights(filepath)`), else the full model is saved
+           (`model.save(filepath)`).
+         save_freq: `'epoch'` or integer. When using `'epoch'`, the callback saves
+           the model after each epoch. When using integer, the callback saves the
+           model at end of a batch at which this many samples have been seen since
+           last saving. Note that if the saving isn't aligned to epochs, the
+           monitored metric may potentially be less reliable (it could reflect as
+           little as 1 batch, since the metrics get reset every epoch). Defaults to
+           `'epoch'`
+         **kwargs: Additional arguments for backwards compatibility. Possible key
+           is `period`.
+     """
+
+    def __init__(self,
+                 filepath,
+                 monitor='val_loss',
+                 verbose=0,
+                 save_best_only=False,
+                 save_weights_only=False,
+                 mode='auto',
+                 save_freq='epoch',
+                 kash_model: BaseModel = None,
+                 **kwargs):
+        super(KashgariModelCheckpoint, self).__init__(
+            filepath=filepath,
+            monitor=monitor,
+            verbose=verbose,
+            save_best_only=save_best_only,
+            save_weights_only=save_weights_only,
+            mode=mode,
+            save_freq=save_freq,
+            **kwargs)
+        self.kash_model = kash_model
+
+    def _save_model(self, epoch, logs):
+        """Saves the model.
+
+            Arguments:
+                epoch: the epoch this iteration is in.
+                logs: the `logs` dict passed in to `on_batch_end` or `on_epoch_end`.
+            """
+        logs = logs or {}
+
+        if isinstance(self.save_freq,
+                      int) or self.epochs_since_last_save >= self.period:
+            self.epochs_since_last_save = 0
+            file_handle, filepath = self._get_file_handle_and_path(epoch, logs)
+
+            if self.save_best_only:
+                current = logs.get(self.monitor)
+                if current is None:
+                    logging.warning('Can save best model only with %s available, '
+                                    'skipping.', self.monitor)
+                else:
+                    if self.monitor_op(current, self.best):
+                        if self.verbose > 0:
+                            print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
+                                  ' saving model to %s' % (epoch + 1, self.monitor, self.best,
+                                                           current, filepath))
+                        self.best = current
+                        if self.save_weights_only:
+                            filepath = os.path.join(filepath, 'cp')
+                            self.model.save_weights(filepath, overwrite=True)
+                        else:
+                            self.kash_model.save(filepath)
+                    else:
+                        if self.verbose > 0:
+                            print('\nEpoch %05d: %s did not improve from %0.5f' %
+                                  (epoch + 1, self.monitor, self.best))
+            else:
+                if self.verbose > 0:
+                    print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
+                if self.save_weights_only:
+                    if K.in_multi_worker_mode():
+                        # TODO(rchao): Save to an additional training state file for FT,
+                        # instead of adding an attr to weight file. With this we can support
+                        # the cases of all combinations with `save_weights_only`,
+                        # `save_best_only`, and `save_format` parameters.
+                        # pylint: disable=protected-access
+                        self.model._ckpt_saved_epoch = epoch
+                    filepath = os.path.join(filepath, 'cp')
+                    self.model.save_weights(filepath, overwrite=True)
+                else:
+                    self.kash_model.save(filepath)
+
+            self._maybe_remove_file(file_handle, filepath)
+
+
 if __name__ == "__main__":
     print("Hello world")
diff --git a/kashgari/embeddings/bert_embedding_v2.py b/kashgari/embeddings/bert_embedding_v2.py
@@ -30,13 +30,16 @@ class BERTEmbeddingV2(BERTEmbedding):
     def info(self):
         info = super(BERTEmbedding, self).info()
         info['config'] = {
-            'model_folder': self.model_folder,
+            'vocab_path': self.vocab_path,
+            'config_path': self.config_path,
+            'checkpoint_path': self.checkpoint_path,
+            'bert_type': self.bert_type,
             'sequence_length': self.sequence_length
         }
         return info
 
     def __init__(self,
-                 vacab_path: str,
+                 vocab_path: str,
                  config_path: str,
                  checkpoint_path: str,
                  bert_type: str = 'bert',
@@ -47,7 +50,7 @@ def __init__(self,
         """
         """
         self.model_folder = ''
-        self.vacab_path = vacab_path
+        self.vocab_path = vocab_path
         self.config_path = config_path
         self.checkpoint_path = checkpoint_path
         super(BERTEmbedding, self).__init__(task=task,

diff --git a/kashgari/tasks/classification/models.py b/kashgari/tasks/classification/models.py
@@ -7,7 +7,6 @@
 # file: models.py
 # time: 2019-05-22 11:26
 
-import logging
 import tensorflow as tf
 from typing import Dict, Any
 from kashgari.layers import L, AttentionWeightedAverageLayer, KMaxPoolingLayer
@@ -683,7 +682,7 @@ def build_model_arc(self):
 
 if __name__ == "__main__":
     print(BiLSTM_Model.get_default_hyper_parameters())
-    logging.basicConfig(level=logging.DEBUG)
+    # logging.basicConfig(level=logging.DEBUG)
     from kashgari.corpus import SMP2018ECDTCorpus
 
     x, y = SMP2018ECDTCorpus.load_data()

diff --git a/kashgari/utils.py b/kashgari/utils.py
@@ -46,7 +46,8 @@ def custom_object_scope():
     return tf.keras.utils.custom_object_scope(custom_objects)
 
 
-def load_model(model_path: str, load_weights: bool = True) -> Union[BaseClassificationModel, BaseLabelingModel]:
+def load_model(model_path: str,
+               load_weights: bool = True) -> Union[BaseClassificationModel, BaseLabelingModel]:
     """
     Load saved model from saved model from `model.save` function
     Args:

diff --git a/requirements.txt b/requirements.txt
@@ -6,4 +6,4 @@ keras-gpt-2>=0.8.0
 gensim>=3.5.0
 seqeval==0.0.10
 pandas>=0.23.0
-bert4keras==0.5.9
+bert4keras==0.6.5
Original file line number	Diff line number	Diff line change
Expand Up		@@ -11,4 +11,4 @@ kashgari.embeddings.StackedEmbedding(embeddings: List[Embedding],

		Arguments

		- embeddings: list of embedding object.
		- embeddings: list of embedding object.