First, clone this repository:
git lfs install
git clone --recursive https://github.com/lovodkin93/Controlled_Text_Reduction.git
cd Controlled_Text_Reduction
then run:
python3 -m pip install --user virtualenv
if you haven't installed yet virtualenv, and then run:
python3 -m venv venvs/controlled_text_reduction_env
source venvs/controlled_text_reduction_env/bin/activate
python3 -m pip install -r requirements.txt
python -m spacy download en_core_web_sm
The next step is to preprocess the data. To do that, run:
python -m src.preprocess_from_highlight_rows_to_document_rows configs/preprocess/<CONFIG_PREPROCESS_FILE>
You can find examples of <CONFIG_PREPROCESS_FILE>
under configs/preprocess/
Some note about the <CONFIG_PREPROCESS_FILE>
:
- The
train_file_highlight_rows
parameter should be a csv file with the following columns:topic
,summaryFile
,documentFile
,docSpanOffsets
, wheretopic
should be identical tosummaryFile
, and thesummaryFile
,documentFile
columns should include the names of the extracted documents and summaries (see the next point). - The
doc_data_dir
parameter should point to the datapath, which should contain the following structure where a summary and its related document directory share the same name:
- <DATA_PATH>
- summaries
- A.txt
- B.txt
- ...
- A
- doc_A1
- doc_A2
- ...
- B
- doc_B1
- doc_B2
- ...
- The
output_file_path
parameter should save the output file in thedata/
folder.
In the following link you can find the full Controlled Text Reduction dataset: link to dataset
Once the datasets have been preprocessed and saved in the data/
folder, you can proceed to training the models.
To do that, run:
python -m src.run_experiments configs/train/<CONFIG_TRAIN_FILE>
You can find examples of <CONFIG_TRAIN_FILE>
under configs/train/
.
You have different options:
- To train a model which receives as input text+highlights, follow
train_led__4096__global_text_and_highlights.json
. - To train a model which receives as input only text (without highlights), follow
train_led__4096__no_highlights.json
. - To train a model which receives as input only highlights (their concatenation), follow
train_led__4096__global_only_highlights.json
. - To further finetune an already finetuned model (e.g., dirst finetuning on CNN-DM and DUC training set and then further finetuning on DUC alone), follow
further_finetune_led_4096_global_on_highlights_pretrained_CNNDM_full_and_duc.json
.
To test or evaluate a trained model, run:
python -m src.run_experiments configs/eval/<CONFIG_TEST_FILE>
You can find examples of <CONFIG_TEST_FILE>
under configs/eval/
.
You have different options:
- To test a simple concatenation, follow
test_simple_concatenation.json
(oreval_simple_concatenation.json
for the devset). - To test a model which receives as input text+highlights, follow
test_led_text_and_highlights.json
(oreval_led_text_and_highlights.json
for the devset). - To test a model which receives as input only text (without highlights), follow
test_led_no_highlights.json
(oreval_led_no_highlights.json
for the devset). - To test a model which receives as input only highlights, follow
test_led_only_highlights.json
(oreval_led_only_highlights.json
for the devset). - To run a mixed experiment, where highlights are paired with a different summary of the same document, follow
test_led_mixed.json
.
We also uploaded our best model to huggingface, for an easy employment. Please refer to: https://huggingface.co/biu-nlp/led-base-controlled-text-reduction
If you find this repository useful in your research, please cite the following paper:
@inproceedings{slobodkin-etal-2022-controlled,
title = "Controlled Text Reduction",
author = "Slobodkin, Aviv and
Roit, Paul and
Hirsch, Eran and
Ernst, Ori and
Dagan, Ido",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.385",
pages = "5699--5715",
abstract = "Producing a reduced version of a source text, as in generic or focused summarization, inherently involves two distinct subtasks: deciding on targeted content and generating a coherent text conveying it. While some popular approaches address summarization as a single end-to-end task, prominent works support decomposed modeling for individual subtasks. Further, semi-automated text reduction is also very appealing, where users may identify targeted content while models would generate a corresponding coherent summary.In this paper, we focus on the second subtask, of generating coherent text given pre-selected content. Concretely, we formalize \textit{Controlled Text Reduction} as a standalone task, whose input is a source text with marked spans of targeted content ({``}highlighting{''}).A model then needs to generate a coherent text that includes all and only the target information.We advocate the potential of such models, both for modular fully-automatic summarization, as well as for semi-automated human-in-the-loop use cases.Facilitating proper research, we crowdsource high-quality dev and test datasets for the task. Further, we automatically generate a larger {``}silver{''} training dataset from available summarization benchmarks, leveraging a pretrained summary-source alignment model.Finally, employing these datasets, we present a supervised baseline model, showing promising results and insightful analyses.",
}