CANDLE 🕯: Extracting Cultural Commonsense Knowledge at Scale

Running spaCy on your input corpus

The first step is to run spaCy on your input corpus of choice. The script candle/run_spacy.py can be used for this purpose. For example, to run this script on the dummy files in the candle/data/input_corpus directory, run the following command:

cd candle
python run_spacy.py \
    -i data/input_corpus/dummy-000.jsonl \
    -o data/spacy/dummy-000.spacy

The input file should be a JSONL file, where each line is a JSON object with the following fields:

text: The text of the document (required).
timestamp: The timestamp of the document (optional).
url: The URL of the document (optional).

After running spaCy on all the input files, you should create a file consisting of the paths to all the spaCy output files (see e.g., candle/data/spacy/dummy.txt). This file should be passed to the next steps using the spacy_file_list argument (see below).

CANDLE pipeline execution

There are 6 components (see candle/pipeline/pipeline.py):

For example, to run the pipeline for the religions domain (see also candle/config_religions.yaml), follow these steps:

Start your local MongoDB instance:

cd /path/to/mongodb/folder
bin/mongod --dbpath /folder/to/save/the/database --bind_ip_all

Run the first 3 components:

cd candle/candle
python main.py \
  --config config_religions.yaml \
  --people_group religions \
  --spacy_file_list data/spacy/dummy.txt \
  --components 1 2 3

Run the last 3 components:

for facet in "food" "drink" "ritual"
do
  python main.py \
    --config config_religions.yaml \
    --people_group religions \
    --components 4 5 6 \
    --cluster_facet $facet \
    --cluster_nid data/religions/religion_ids.txt \
    --domain religions \
    --output_file _outputs/religions_$facet.jsonl
done

Citation

If you use this code or our datasets, please cite the following paper:

@inproceedings{candle2023,
  title={Extracting Cultural Commonsense Knowledge at Scale},
  author={Nguyen, Tuan-Phong and Razniewski, Simon and Varde, Aparna and Weikum, Gerhard},
  booktitle={Proceedings of the ACM Web Conference},
  year={2023}
}

More information is available on: https://candle.mpi-inf.mpg.de/

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
candle		candle
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CANDLE 🕯: Extracting Cultural Commonsense Knowledge at Scale

Running spaCy on your input corpus

CANDLE pipeline execution

Citation

About

Contributors 2

Languages

License

cultural-csk/candle

Folders and files

Latest commit

History

Repository files navigation

CANDLE 🕯: Extracting Cultural Commonsense Knowledge at Scale

Running spaCy on your input corpus

CANDLE pipeline execution

Citation

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages