Skip to content

Latest commit

 

History

History
233 lines (151 loc) · 9.64 KB

README.rst

File metadata and controls

233 lines (151 loc) · 9.64 KB

SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation

License License2 PyPI pyversions

Developed with 💛 at Expert.ai Research Lab

  • License: Apache 2.0 (Software) and Attribution 4.0 International (Datasets)
  • Paper: arXiv

Content

This repository contains code and datasets for the paper titled SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation. It is organized as follows:

  • data/processed: Contains the SPACE-IDEAS and SPACE-IDEAS+ datasets.
  • The rest of the repository contains the code for conducting paper experiments.

Installation

The whole project is handled with make, go to a terminal an issue:

git clone https://github.com/expertailab/SPACE-IDEAS
cd SPACE-IDEAS
make setup
conda activate ideas_annotation
make install-as-pkg

Reproducibility

Data split: To split the SPACE-IDEAS dataset in train and test splits, we can run the split_data.py script:

python scripts/split_data.py

Two files, train.jsonl and test.jsonl, will be created in the data/processed folder.

Single-sentence classification:

To train a single sentence classifier using the training SPACE-IDEAS data without context, we run:

python ideas_annotation/modeling/idea_dataset_sentence_classification.py --input_train_dataset data/processed/train.jsonl --input_test_dataset data/processed/test.jsonl

If we want to use the context, we run:

python ideas_annotation/modeling/idea_dataset_sentence_classification.py --input_train_dataset data/processed/train.jsonl --input_test_dataset data/processed/test.jsonl --use_context

To train using the SPACE-IDEAS plus dataset, we have to change the input_train_dataset to :

python ideas_annotation/modeling/idea_dataset_sentence_classification.py --input_train_dataset data/processed/space-ideas_plus.jsonl --input_test_dataset data/processed/test.jsonl --use_context

Sequential sentence classification:

We need to split the train set in train2 and dev set, we can do this with:

python scripts/split_train_data.py

Two files, train2.jsonl and dev.jsonl, will be created in the data/processed folder.

We clone the sequential_sentence_classification repository, create a new conda environment and install the required allennlp library.

git clone https://github.com/expertailab/sequential_sentence_classification.git
cd sequential_sentence_classification/
git checkout allennlp2
conda create -n sequential_sentence_classification python=3.9
conda activate sequential_sentence_classification
pip install allennlp==2.0.0

We have to modify the train.sh script in scripts folder, with the data paths:

TRAIN_PATH=../data/processed/train2.jsonl
DEV_PATH=../data/processed/dev.jsonl
TEST_PATH=../data/processed/test.jsonl

We can now run the trainining stript with:

./scripts/train.sh tmp_output_dir_space-ideas

The trained model will be at tmp_output_dir_space-ideas/model.tar.gz, we can get the test predictions with:

python -m allennlp predict tmp_output_dir_space-ideas/model.tar.gz ../data/processed/test.jsonl --include-package sequential_sentence_classification --predictor SeqClassificationPredictor --cuda-device 0 --output-file space-ideas-predictions.json

Now we can obtain the prediction metrics with:

cd ..
conda activate ideas_annotation
python scripts/sequential_sentence_classification_metrics.py --prediction_test_file sequential_sentence_classification/space-ideas-predictions.json --gold_test_file data/processed/test.jsonl

Sequential Transfer Learning

Single-sentence classification:

We can train a model, using for example SPACE-IDEAS plus dataset, and use that trained model to finetune on the SPACE-IDEAS dataset, we can do this with the following command:

python ideas_annotation/modeling/idea_dataset_sentence_classification.py --model $PATH_TO_TRAINED_MODEL --input_train_dataset data/processed/train.jsonl --input_test_dataset data/processed/test.jsonl --use_context

Sequential sentence classification:

First we need to train a model using the SPACE-IDEAS plus dataset, we can do it by changing the TRAIN_PATH variable in the train.sh script and point to the dataset location (../data/processed/space-ideas_plus.jsonl). Then we launch the training with:

cd sequential_sentence_classification/
conda activate sequential_sentence_classification
./scripts/train.sh tmp_output_dir_space-ideas-plus

When the training is finished, we will have a model.tar.gz file in the "tmp_output_dir_space-ideas-plus" folder. To finally train using the SPACE-IDEAS dataset, we need to change the "config.jsonnet" file in the "sequential_sentence_classification" folder, we need to change the "model" field in line 40, to the following:

..
"model": {
   "type": "from_archive",
   "archive_file": "tmp_output_dir_space-ideas-plus/model.tar.gz"
},
..

Then we change again the TRAIN_PATH variable in the train.sh script to point to the dataset location (../data/processed/train2.jsonl), and launch the training with:

./scripts/train.sh tmp_output_dir_space-ideas_from_space-ideas-plus

The trained model will be at tmp_output_dir_space-ideas_from_space-ideas-plus/model.tar.gz, we can get the test predictions with:

python -m allennlp predict tmp_output_dir_space-ideas_from_space-ideas-plus/model.tar.gz ../data/processed/test.jsonl --include-package sequential_sentence_classification --predictor SeqClassificationPredictor --cuda-device 0 --output-file space-ideas-predictions_from_space-ideas-plus.json

Now we can obtain the prediction metrics with:

cd ..
conda activate ideas_annotation
python scripts/sequential_sentence_classification_metrics.py --prediction_test_file sequential_sentence_classification/space-ideas-predictions_from_space-ideas-plus.json --gold_test_file data/processed/test.jsonl

Multi-Task Learning

Single-sentence classification:

By deafult, we can do multitask training using all the available datasets (SPACE-IDEAS, SPACE-IDEAS plus) with:

python scripts/merge_space-ideas_dataset.py
python ideas_annotation/modeling/idea_dataset_multitask_sentence_classification.py

Sequential sentence classification:

To run the multitask traininig with sequential sentence classification, we need to install a variation of the grouphug library. We can install it with:

git clone https://github.com/expertailab/grouphug.git
cd grouphug
pip install .
cd ..

Now we can run the idea_dataset_multitask_sentence_classification.py script:

python ideas_annotation/modeling/idea_dataset_multitask_sentence_classification.py

In line 135 of the script, we can set the combinations of datasets that we want to train: ["chatgpt", "gold"].

How to cite

To cite this research please use the following:

@inproceedings{garcia-silva-etal-2024-space-ideas,
    title = "{SPACE}-{IDEAS}: A Dataset for Salient Information Detection in Space Innovation",
    author = "Garcia-Silva, Andres  and
      Berrio, Cristian  and
      Gomez-Perez, Jose Manuel",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1311",
    pages = "15087--15092",
    abstract = "Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.",
}

Expert.ai favicon Expert.ai

At Expert.ai we turn language into data so humans can make better decisions. Take a look here!