Skip to content

Commit

Permalink
Merge pull request #348 from BrikerMan/develop
Browse files Browse the repository at this point in the history
Release v1.1.2
  • Loading branch information
BrikerMan authored Mar 27, 2020
2 parents 4ae3ae6 + ce671ac commit 6d2970e
Show file tree
Hide file tree
Showing 17 changed files with 210 additions and 57 deletions.
37 changes: 3 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Kashgari is a simple and powerful NLP Transfer learning framework, build a state

| Task | Language | Dataset | Score | Detail |
| ------------------------ | -------- | ------------------------- | -------------- | -------------------------------------------------------------------------------------------------------- |
| Named Entity Recognition | Chinese | People's Daily Ner Corpus | **94.46** (F1) | [Text Labeling Performance Report](https://kashgari.bmio.net/tutorial/text-labeling/#performance-report) |
| Named Entity Recognition | Chinese | People's Daily Ner Corpus | **94.46** (F1) | [Text Labeling Performance Report](https://kashgari.rtfd.io/tutorial/text-labeling.html#performance-report) |

## Tutorials

Expand Down Expand Up @@ -170,7 +170,7 @@ Support this project by becoming a sponsor. Your issues and feature request will

## Contributors ✨

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the [contributor guidelines](https://kashgari.bmio.net/about/contributing/) and then check these open issues for specific tasks.
Thanks goes to these wonderful people. And there are many ways to get involved. Start with the [contributor guidelines](./docs/about/contributing.md) and then check these open issues for specific tasks.

<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->
Expand Down Expand Up @@ -199,35 +199,4 @@ This library is inspired by and references following frameworks and papers.
- [flair - A very simple framework for state-of-the-art Natural Language Processing (NLP)](https://github.com/zalandoresearch/flair)
- [anago - Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging](https://github.com/Hironsan/anago)
- [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!

## Contributors

### Code Contributors

This project exists thanks to all the people who contribute. [[Contribute](CONTRIBUTING.md)].
<a href="https://github.com/BrikerMan/Kashgari/graphs/contributors"><img src="https://opencollective.com/Kashgari/contributors.svg?width=890&button=false" /></a>

### Financial Contributors

Become a financial contributor and help us sustain our community. [[Contribute](https://opencollective.com/Kashgari/contribute)]

#### Individuals

<a href="https://opencollective.com/Kashgari"><img src="https://opencollective.com/Kashgari/individuals.svg?width=890"></a>

#### Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [[Contribute](https://opencollective.com/Kashgari/contribute)]

<a href="https://opencollective.com/Kashgari/organization/0/website"><img src="https://opencollective.com/Kashgari/organization/0/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/1/website"><img src="https://opencollective.com/Kashgari/organization/1/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/2/website"><img src="https://opencollective.com/Kashgari/organization/2/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/3/website"><img src="https://opencollective.com/Kashgari/organization/3/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/4/website"><img src="https://opencollective.com/Kashgari/organization/4/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/5/website"><img src="https://opencollective.com/Kashgari/organization/5/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/6/website"><img src="https://opencollective.com/Kashgari/organization/6/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/7/website"><img src="https://opencollective.com/Kashgari/organization/7/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/8/website"><img src="https://opencollective.com/Kashgari/organization/8/avatar.svg"></a>
<a href="https://opencollective.com/Kashgari/organization/9/website"><img src="https://opencollective.com/Kashgari/organization/9/avatar.svg"></a>
- [bert4keras - Our light reimplement of bert for keras](https://github.com/bojone/bert4keras/)
5 changes: 5 additions & 0 deletions docs/about/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ pip show kashgari

## Current Release

### [1.1.2] - 2020.03.27

- ✨ Add save best model callback `KashgariModelCheckpoint`.
- ⬆️ Upgrading `bert4keras` version to `0.6.5`.

### [1.1.1] - 2020.03.13

- ✨ Add BERTEmbeddingV2.
Expand Down
2 changes: 1 addition & 1 deletion docs/api/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Embedding layers have its own \_\_init\_\_ function, check it out from their doc
| [BERTEmbedding](../embeddings/bert-embedding.md) | pre-trained BERT embedding |
| [GPT2Embedding](../embeddings/gpt2-embedding.md) | pre-trained GPT-2 embedding |
| [NumericFeaturesEmbedding](../embeddings/numeric-features-embedding.md) | random init `tf.keras.layers.Embedding` layer for numeric feature embedding |
| [StackedEmbedding](../embeddings/stacked-embeddingmd) | stack other embeddings for multi-input model |
| [StackedEmbedding](../embeddings/stacked-embedding.md) | stack other embeddings for multi-input model |

All embedding layer shares same API except the `__init__` function.

Expand Down
4 changes: 3 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,12 +261,14 @@ def setup(app):
rst_readme = os.path.join(docs_path, 'README.rst')

# Update all .md files, for fixing links
update_markdown_content(docs_path)
if os.environ.get('READTHEDOCS') == 'True':
update_markdown_content(docs_path)

# Change readme to rst file, and include in Sphinx index
with open(rst_readme, 'w') as f:
md_content = open(original_readme, 'r').read()
md_content = md_content.replace('(./docs/', '(./')
md_content = md_content.replace('.md)', '.html)')
f.write(convert(md_content))
print(f'Saved RST file to {rst_readme}')

Expand Down
6 changes: 3 additions & 3 deletions docs/embeddings/bert-embedding_v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ BERTEmbeddingV2 support models:
When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding

```python
kashgari.embeddings.BERTEmbedding(vacab_path: str,
kashgari.embeddings.BERTEmbedding(vocab_path: str,
config_path: str,
checkpoint_path: str,
bert_type: str = 'bert',
Expand All @@ -30,10 +30,10 @@ kashgari.embeddings.BERTEmbedding(vacab_path: str,

**Arguments**

- **vacab_path**: path of model's `vacab.txt` file
- **vocab_path**: path of model's `vacab.txt` file
- **config_path**: path of model's `model.json` file
- **checkpoint_path**: path of model's checkpoint file
- **bert_type**: `bert`, `albert`, `nezha`. Type of BERT model.
- **bert_type**: `bert`, `albert`, `nezha`, `electra`, `gpt2_ml`, `t5`. Type of BERT model.
- **task**: `kashgari.CLASSIFICATION` `kashgari.LABELING`. Downstream task type, If you only need to feature extraction, just set it as `kashgari.CLASSIFICATION`.
- **sequence_length**: `'auto'` or integer. When using `'auto'`, use the 95% of corpus length as sequence length. If using an integer, let's say `50`, the input output sequence length will set to 50.

Expand Down
2 changes: 1 addition & 1 deletion docs/embeddings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Kashgari provides several embeddings for language representation. Embedding laye
| [BERTEmbedding](bert-embedding.md) | pre-trained BERT embedding |
| [GPT2Embedding](gpt2-embedding.md) | pre-trained GPT-2 embedding |
| [NumericFeaturesEmbedding](numeric-features-embedding.md) | random init `tf.keras.layers.Embedding` layer for numeric feature embedding |
| [StackedEmbedding](./stacked-embeddingmd) | stack other embeddings for multi-input model |
| [StackedEmbedding](./stacked-embedding.md) | stack other embeddings for multi-input model |

All embedding classes inherit from the `Embedding` class and implement the `embed()` to embed your input sequence and `embed_model` property which you need to build you own Model. By providing the `embed()` function and `embed_model` property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.

Expand Down
2 changes: 1 addition & 1 deletion docs/embeddings/numeric-features-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ kashgari.embeddings.NumericFeaturesEmbedding(feature_count: int,
- **feature_count**: count of the features of this embedding.
- **feature_name**: name of the feature.
- **sequence_length**: `'auto'`, `'variable'` or integer. When using `'auto'`, use the 95% of corpus length as sequence length. When using `'variable'`, model input shape will set to None, which can handle various length of input, it will use the length of max sequence in every batch for sequence length. If using an integer, let's say `50`, the input output sequence length will set to 50.
- **embedding_size**: Dimension of the dense embedding.
- **embedding_size**: Dimension of the dense embedding.
2 changes: 1 addition & 1 deletion docs/embeddings/stacked-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ kashgari.embeddings.StackedEmbedding(embeddings: List[Embedding],

**Arguments**

- **embeddings**: list of embedding object.
- **embeddings**: list of embedding object.
4 changes: 2 additions & 2 deletions docs/tutorial/text-classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ You could easily switch from one model to another just by changing one line of c
| CNN\_LSTM\_Model | |
| CNN\_GRU\_Model | |
| AVCNN\_Model | |
| KMax\_CNN]\_Model | |
| KMax\_CNN\_Model | |
| R\_CNN\_Model | |
| AVRNN\_Model | |
| Dropout\_BiGRU\_Model | |
Expand Down Expand Up @@ -128,7 +128,7 @@ model.build_model(train_x, train_y, valid_x, valid_y)
optimizer = RAdam()
model.compile_model(optimizer=optimizer)

# Train model
# Train model
model.fit(train_x, train_y, valid_x, valid_y)
```

Expand Down
2 changes: 1 addition & 1 deletion kashgari/__version__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
# file: __version__.py.py
# time: 2019-05-20 16:32

__version__ = '1.1.1'
__version__ = '1.1.2'
113 changes: 111 additions & 2 deletions kashgari/callbacks.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,16 @@
# file: callbacks.py
# time: 2019-05-22 15:00

import logging
import os

from seqeval import metrics as seq_metrics
from sklearn import metrics
from kashgari import macros
from tensorflow.python import keras
from tensorflow.python.keras import backend as K

from kashgari import macros
from kashgari.tasks.base_model import BaseModel
from seqeval import metrics as seq_metrics


class EvalCallBack(keras.callbacks.Callback):
Expand Down Expand Up @@ -59,5 +64,109 @@ def on_epoch_end(self, epoch, logs=None):
print(f"\nepoch: {epoch} precision: {precision:.6f}, recall: {recall:.6f}, f1: {f1:.6f}")


class KashgariModelCheckpoint(keras.callbacks.ModelCheckpoint):
"""Save the model after every epoch.
Arguments:
filepath: string, path to save the model file.
monitor: quantity to monitor.
verbose: verbosity mode, 0 or 1.
save_best_only: if `save_best_only=True`, the latest best model according
to the quantity monitored will not be overwritten.
mode: one of {auto, min, max}. If `save_best_only=True`, the decision to
overwrite the current save file is made based on either the maximization
or the minimization of the monitored quantity. For `val_acc`, this
should be `max`, for `val_loss` this should be `min`, etc. In `auto`
mode, the direction is automatically inferred from the name of the
monitored quantity.
save_weights_only: if True, then only the model's weights will be saved
(`model.save_weights(filepath)`), else the full model is saved
(`model.save(filepath)`).
save_freq: `'epoch'` or integer. When using `'epoch'`, the callback saves
the model after each epoch. When using integer, the callback saves the
model at end of a batch at which this many samples have been seen since
last saving. Note that if the saving isn't aligned to epochs, the
monitored metric may potentially be less reliable (it could reflect as
little as 1 batch, since the metrics get reset every epoch). Defaults to
`'epoch'`
**kwargs: Additional arguments for backwards compatibility. Possible key
is `period`.
"""

def __init__(self,
filepath,
monitor='val_loss',
verbose=0,
save_best_only=False,
save_weights_only=False,
mode='auto',
save_freq='epoch',
kash_model: BaseModel = None,
**kwargs):
super(KashgariModelCheckpoint, self).__init__(
filepath=filepath,
monitor=monitor,
verbose=verbose,
save_best_only=save_best_only,
save_weights_only=save_weights_only,
mode=mode,
save_freq=save_freq,
**kwargs)
self.kash_model = kash_model

def _save_model(self, epoch, logs):
"""Saves the model.
Arguments:
epoch: the epoch this iteration is in.
logs: the `logs` dict passed in to `on_batch_end` or `on_epoch_end`.
"""
logs = logs or {}

if isinstance(self.save_freq,
int) or self.epochs_since_last_save >= self.period:
self.epochs_since_last_save = 0
file_handle, filepath = self._get_file_handle_and_path(epoch, logs)

if self.save_best_only:
current = logs.get(self.monitor)
if current is None:
logging.warning('Can save best model only with %s available, '
'skipping.', self.monitor)
else:
if self.monitor_op(current, self.best):
if self.verbose > 0:
print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
' saving model to %s' % (epoch + 1, self.monitor, self.best,
current, filepath))
self.best = current
if self.save_weights_only:
filepath = os.path.join(filepath, 'cp')
self.model.save_weights(filepath, overwrite=True)
else:
self.kash_model.save(filepath)
else:
if self.verbose > 0:
print('\nEpoch %05d: %s did not improve from %0.5f' %
(epoch + 1, self.monitor, self.best))
else:
if self.verbose > 0:
print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
if self.save_weights_only:
if K.in_multi_worker_mode():
# TODO(rchao): Save to an additional training state file for FT,
# instead of adding an attr to weight file. With this we can support
# the cases of all combinations with `save_weights_only`,
# `save_best_only`, and `save_format` parameters.
# pylint: disable=protected-access
self.model._ckpt_saved_epoch = epoch
filepath = os.path.join(filepath, 'cp')
self.model.save_weights(filepath, overwrite=True)
else:
self.kash_model.save(filepath)

self._maybe_remove_file(file_handle, filepath)


if __name__ == "__main__":
print("Hello world")
9 changes: 6 additions & 3 deletions kashgari/embeddings/bert_embedding_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,16 @@ class BERTEmbeddingV2(BERTEmbedding):
def info(self):
info = super(BERTEmbedding, self).info()
info['config'] = {
'model_folder': self.model_folder,
'vocab_path': self.vocab_path,
'config_path': self.config_path,
'checkpoint_path': self.checkpoint_path,
'bert_type': self.bert_type,
'sequence_length': self.sequence_length
}
return info

def __init__(self,
vacab_path: str,
vocab_path: str,
config_path: str,
checkpoint_path: str,
bert_type: str = 'bert',
Expand All @@ -47,7 +50,7 @@ def __init__(self,
"""
"""
self.model_folder = ''
self.vacab_path = vacab_path
self.vocab_path = vocab_path
self.config_path = config_path
self.checkpoint_path = checkpoint_path
super(BERTEmbedding, self).__init__(task=task,
Expand Down
3 changes: 1 addition & 2 deletions kashgari/tasks/classification/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
# file: models.py
# time: 2019-05-22 11:26

import logging
import tensorflow as tf
from typing import Dict, Any
from kashgari.layers import L, AttentionWeightedAverageLayer, KMaxPoolingLayer
Expand Down Expand Up @@ -683,7 +682,7 @@ def build_model_arc(self):

if __name__ == "__main__":
print(BiLSTM_Model.get_default_hyper_parameters())
logging.basicConfig(level=logging.DEBUG)
# logging.basicConfig(level=logging.DEBUG)
from kashgari.corpus import SMP2018ECDTCorpus

x, y = SMP2018ECDTCorpus.load_data()
Expand Down
3 changes: 2 additions & 1 deletion kashgari/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ def custom_object_scope():
return tf.keras.utils.custom_object_scope(custom_objects)


def load_model(model_path: str, load_weights: bool = True) -> Union[BaseClassificationModel, BaseLabelingModel]:
def load_model(model_path: str,
load_weights: bool = True) -> Union[BaseClassificationModel, BaseLabelingModel]:
"""
Load saved model from saved model from `model.save` function
Args:
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ keras-gpt-2>=0.8.0
gensim>=3.5.0
seqeval==0.0.10
pandas>=0.23.0
bert4keras==0.5.9
bert4keras==0.6.5
Loading

0 comments on commit 6d2970e

Please sign in to comment.