Skip to content

Dataset and Codes for EMNLP 2022 Main Conference Long Paper titled "ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts"

Notifications You must be signed in to change notification settings

chuangtc/ECTSum-GPT3

 
 

Repository files navigation

ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts

Long Paper Accepted at the EMNLP 2022 Main Conference!

  • ArXiv Preprint: https://arxiv.org/pdf/2210.12467.pdf
  • Poster: https://rajdeep345.github.io/files/pdf/research/ECTSum_EMNLP2022_Poster.pdf
  • Pre-recorded Video: https://drive.google.com/file/d/1DW2i2ApgiE6V7ViiayX5zdJSRXdAEbsy/view
  • Dataset

    The ECTSum dataset can be found under the data folder.

    Proposed ECTSum dataset

    Dataset # Docs. Coverage Density Compression # Tokens Doc. # Tokens Summary
    Arxiv/PubMed 346,187 0.87 3.94 31.17 5179.22 257.44
    BillSum 23,455 _ 4.12 13.64 1813.0 207.7
    BigPatent 1,341,362 0.86 2.38 36.84 3629.04 116.67
    GovReport 19,466 _ 19.01 19.01 9409.4 553.4
    BookSum 12,293 0.78 1.69 15.97 5101.88 505.32
    ------------ --------- --------- ------- ------------ --------- ----------
    ECTSum 2,425 0.85 2.43 103.67 2916.44 49.23

    Codes

    Codes and instructions for our proposed model ECT-BPS can be found under codes/ECT-BPS
    Codes and instructions for our baseline models can be found under codes/baselines

    Data Preparation for ECT-BPS

    Preparing the data for training the Extractive Module

    Set up Python 3.9, Pytorch on Mac M1

    brew install pyenv
    echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
    echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
    echo 'eval "$(pyenv init -)"' >> ~/.zshrc

    Open another terminal tab

    cd <project_folder>
    pyenv install 3.9.16
    pyenv local 3.9.16
    pyenv which python
    pip install torch torchvision torchaudio

    Set up Pytorch on Windows

    pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu

    Install libraries on Mac and Windows

    pip install sentence-transformers
    pip install num2words
    pip install word2number

    Prepare the data

    python prepare_data_gpt3.py

    Data Location

    The data is saved at out-data/.
    Processed data should be at this location.

    Updates

  • 1st November 2022 - ECTSum Dataset released
  • 30th November 2022 - Codes and Instructions released for training the Extractive Module of ECT-BPS
  • 28th Feburary 2023 - Dataset for GPT-3 created
  • About

    Dataset and Codes for EMNLP 2022 Main Conference Long Paper titled "ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts"

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages

    • Python 88.9%
    • Jupyter Notebook 11.1%