This is forked from the official code for the Microsoft's paper of HMNet model at EMNLP 2020. It is implemented under PyTorch framework. The related paper to cite is:
@Article{zhu2020a,
author = {Zhu, Chenguang and Xu, Ruochen and Zeng, Michael and Huang, Xuedong},
title = {A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining},
year = {2020},
month = {November},
url = {https://www.microsoft.com/en-us/research/publication/end-to-end-abstractive-summarization-for-meetings/},
journal = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
}
we modified the code to do inference for a single example. The predictions (summary) is stored in a summary.text file.
we also added a preprcessing script to transfrom from Microsoft Teams meeting transcription file into the AMI jsonl format
- Install the requirements
- If you have a transcript for a meeting, paste it in preprocess.py
- Edit the name_role_dict in preprocess.py file with the names included in the meeting(this should be automatically extracted in future versions)
- run preprocess.py. this returns the jsonl.gzip file that can be used as an input to the model, the file gets stored in ExampleRawData/meeting_summarization/AMI_proprec/test/test_raw2.jsonl.gzip
- Add the pretrained model to the repo in change that in the conf file, (currently I'm using AMI-finetuned)
- run this line: !python PyLearn.py evaluate ExampleConf/conf_eval_hmnet_AMI, this returns the summary in 'summary.txt'
It is recommended to run our model inside a docker:
Build docker image
cd Docker
sudo docker build . -t hmnet
Run container from image
sudo nvidia-docker run -it hmnet /bin/bash
Get the pretrained HMNet ready at ExampleInitModel/HMNet-pretrained
. Please see document.
Finetune on AMI dataset
CUDA_VISIBLE_DEVICES="0,1,2,3" mpirun -np 4 --allow-run-as-root python PyLearn.py train ExampleConf/conf_hmnet_AMI
The training log/model/settings could be found at ExampleConf/conf_hmnet_AMI_conf~/run_1
-
ExampleRawData/meeting_summarization/AMI_proprec
: The preprocessed AMI dataset. The*.json
files point to the path to each split. Each folder (train
,dev
ortest
) contains the compressed chunks of data in the format for infinibatch. -
ExampleRawData/meeting_summarization/ICSI_proprec
: Same as above for ICSI dataset. -
ExampleInitModel/transfo-xl-wt103
: Here we only used the vocabulary from Transformer-XL, provided by Huggingface.
In ExampleConf/conf_eval_hmnet_AMI
, for the line
PYLEARN_MODEL ###
Replace ###
to the real checkpoint path. Use the relative path w.r.t the location of this configuration file.
CUDA_VISIBLE_DEVICES="0,1,2,3" mpirun -np 4 --allow-run-as-root python PyLearn.py evaluate ExampleConf/conf_eval_hmnet_AMI
The decoding results could be found at ExampleConf/conf_eval_hmnet_AMI_conf~/run_1
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include Microsoft, Azure, DotNet, AspNet, Xamarin, and our GitHub organizations.
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's Microsoft's definition of a security vulnerability, please report it to us as described below.
Please do not report security vulnerabilities through public GitHub issues.
Instead, please report them to the Microsoft Security Response Center (MSRC) at https://msrc.microsoft.com/create-report.
If you prefer to submit without logging in, send email to secure@microsoft.com. If possible, encrypt your message with our PGP key; please download it from the the Microsoft Security Response Center PGP Key page.
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at microsoft.com/msrc.
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
- Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
- Full paths of source file(s) related to the manifestation of the issue
- The location of the affected source code (tag/branch/commit or direct URL)
- Any special configuration required to reproduce the issue
- Step-by-step instructions to reproduce the issue
- Proof-of-concept or exploit code (if possible)
- Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our Microsoft Bug Bounty Program page for more details about our active programs.
We prefer all communications to be in English.
Microsoft follows the principle of Coordinated Vulnerability Disclosure.