CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media

This repo contains the code and datasets for the paper "CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media".

Abstract

While there has been substantial progress in developing systems to automate fact-checking, they still lack credibility in the eyes of the users. Thus, an interesting approach has emerged: to perform automatic fact-checking by verifying whether an input claim has been previously fact-checked by professional fact-checkers and to return back an article that explains their decision. This is a sensible approach as people trust manual fact-checking, and as many claims are repeated multiple times. Yet, a major issue when building such systems is the small number of known tweet--verifying article pairs available for training. Here, we aim to bridge this gap by making use of crowd fact-checking, i.e., mining claims in social media for which users have responded with a link to a fact-checking article. In particular, we mine a large-scale collection of 330,000 tweets paired with a corresponding fact-checking article. We further propose an end-to-end framework to learn from this noisy data based on modified self-adaptive training, in a distant supervision scenario. Our experiments on the CLEF'21 CheckThat! test set show improvements over the state of the art by two points absolute.

Code and Data

Models

Models: stsb-bert-base: CrowdChecked Jaccard 30 + CheckThat 2021 Task 2A.

Datasets

For the CrowdCheck dataset we provide the following files:

The retrieved Snopes fact-checking articles (data/clef2021-format/vclaims.tar.gz)[data/clef2021-format/vclaims.tar.gz].
The IDs of the claims from Twitter (data/clef2021-format/tweets-all-ids.tsv.tar.gz)[data/clef2021-format/tweets-all-ids.tsv.tar.gz]. We are sharing only the IDs to comply with the Twitter policies.
The mapping between the Tweets and their corresponding Snopes articles in the CLEF 2021 format (qrels -- (data/clef2021-format/qrels-train-)[data/clef2021-format/qrels-train-]). The suffixes of the files show the filtering method (Cosine and Jaccard similarity) and the cutoff threshold, e.g., qrels-train-30.tsv.tar.gz -- jaccard similarity with cutoff threshold of 0.30.
The similarity predictions from SBERT used in the cosine similarity filtering data/sbert_predictions_ids.csv.tar.gz.

The input and output format is the same as the CheckThat-2021 competitions (Task 2A). Please refer to the input/output format described here -- CheckThat 2021 Task 2A, Input Data Format.

TBA

Requirements

The project uses poetry to manage its dependencies. You need to run the following commands to install the dependencies and run a shell:

> poetry install
> poetry shell

We provide the corresponding requirements.txt for convenience.

Training

To train the model you can use the following script:

    # CROWDCHECKED_PATH, QRELS_PATH, TWEETS_PATH are resolved from the ids shared in the `data` folder.
    # CLEF_PATH is the path to the `https://gitlab.com/checkthat_lab/clef2021-checkthat-lab` repo
    ${PYTHON_DIR}/python ${TRAINER_DIR}/trainer.py \
        --train_data_path ${CROWDCHECKED_PATH}/${QRELS_PATH} \
        --train_tweets_path ${CROWDCHECKED_PATH}/ ${TWEETS_PATH}.tsv \
        --dev_data_path ${CLEF_PATH}/data/subtask-2a--english/train/qrels-dev.tsv \
        --dev_tweets_path ${CLEF_PATH}/data/subtask-2a--english/train/tweets-train-dev.tsv \
        --test_data_path ${CLEF_PATH}/test-gold/subtask-2a--english/qrels-test.tsv \
        --test_tweets_path ${CLEF_PATH}/test-gold/subtask-2a--english/tweets-test.tsv \
        --vclaims_train_path ${CROWDCHECKED_PATH}/vclaims/ \
        --vclaims_dev_path ${CLEF_PATH}/data/subtask-2a--english/train/vclaims/ \
        --vclaims_test_path ${CLEF_PATH}/data/subtask-2a--english/train/vclaims/ \
        --model_name_or_path ${MODEL_NAME} \
        --output_dir ${OUTPUT_PATH} \
        --cache_dir cache \
        --max_seq_length 128 \
        --do_train \
        --do_eval \
        --do_predict \
        --logging_steps 500 \
        --per_gpu_train_batch_size 32 \
        --per_gpu_eval_batch_size 128 \
        --learning_rate 2e-05 \
        --weight_decay 0.01 \
        --adam_epsilon 1e-08 \
        --max_grad_norm 1.0 \
        --num_train_epochs 10 \
        --warmup_proportion 0.1 \
        --seed ${seed} \
        --remove_dates \
        --overwrite_output_dir

References

Please cite as [1].

[1] M. Hardalov, A. Chernyavskiy, I. Koychev, D, Ilvovsky, P. Nakov "CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media". In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 266–285, Online only.

@inproceedings{hardalov-etal-2022-crowdchecked,
    title = "{C}rowd{C}hecked: Detecting Previously Fact-Checked Claims in Social Media",
    author = "Hardalov, Momchil  and
      Chernyavskiy, Anton  and
      Koychev, Ivan  and
      Ilvovsky, Dmitry  and
      Nakov, Preslav",
    booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    series = "AACL-IJCNLP~'22",
    year = "2022",
    address = "Online only",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.aacl-main.22",
    pages = "266--285",
}

License

The dataset is licensed under CC-BY-NC 4.0, see the data/LICENSE. The code in this repository is licenced under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media

Abstract

Code and Data

Models

Datasets

Requirements

Training

References

License

About

Releases

Packages

Languages

License

mhardalov/crowdchecked-claims

Folders and files

Latest commit

History

Repository files navigation

CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media

Abstract

Code and Data

Models

Datasets

Requirements

Training

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages