StereoKG

Repository for WOAH paper "StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes" (paper)

Requirements

Software:

PRAW
NLTK
PyTorch
SimpleTransformers
HuggingFace Transformers

Account Credentials

PRAW - Set up a Reddit Account
Twitter - Access Twitter Developer API Credentials

Pipeline

Credentials

In Credentials.py, include all metadata like the subreddits to query, Twitter developer credentials, absolute paths to respective directories, and a list of all subject entities in the KG.

Question Templates

The question templates are stored for each entity in the questions folder. For using additional entities, simply create a questions file for that entity and enter the questions in the format

subject, <question pattern>

(1) Data Extraction

Scraping Reddit - scrape_reddit.py
Scraping Twitter - scrape_twitter.py

The extracted questions are inherently converted to sentences and stored in the filepath specified in Credentials.py.

germans are obsessed with the wednesday frog meme
indians are inherently happy
christians are supporting donald trump
Muslim women wear burkha

(2) Fast clustering

fast_clustering.py can be run to perform clustering on the sentences. If the create_singleton_mode is true, separate files are created for singleton and non-singleton clusters.

['Christians believe everyone else should live and abide by their rules', 'Christians love to assume that everyone is supposed to live by the rules of their religion']
['indians are so loud', 'indians are always so loud in public places', 'indians are such noisy eaters']
['French people speak so fast', 'Serious question why do french people talk so quick', 'French people speak so quickly I am just a child']

(3) Triple Generation

triple_generation.py can be used to extract triples from sentences in clusters using a Python based OpenIE wrapper.

(4) Triple Selection

This process uses the DistilBERT-CoLA model for choosing the most grammatically appropriate triple from the cluster of triple samples.

<indians, don't sacrifice, freedom>
<christians, don't remember, sabbath day>
<atheists, are at, peace with death>

KG

The resultant KG of triples is saved in kg/stereoKG.tsv.

Using the triple_to_text model trained on WebNLG data, this same KG is converted to a verbalized form and saved in kg/stereoKG_linearised.txt.

Experiments

The code for language modeling integration experiments with intermediate pretraining can be found in lm_integration.

Models

The best models in our research paper are Twitter-RoBERTa models with intermediate pretraining on the structured and unstructured KG triples. They can be found at the following links:

StereoKG-DT-SK - Domain (Twitter) pre-trained RoBERTa with intermediate pretraining on verbalised triples.
StereoKG-DT-UK - Domain (Twitter) pre-trained RoBERTa with intermediate pretraining on scraped sentences.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data_extraction		data_extraction
data_processing		data_processing
kg		kg
lm_integration		lm_integration
questions		questions
Credentials.py		Credentials.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StereoKG

Requirements

Account Credentials

Pipeline

Credentials

Question Templates

(1) Data Extraction

(2) Fast clustering

(3) Triple Generation

(4) Triple Selection

KG

Experiments

Models

About

Releases

Packages

Languages

License

uds-lsv/StereoKG

Folders and files

Latest commit

History

Repository files navigation

StereoKG

Requirements

Account Credentials

Pipeline

Credentials

Question Templates

(1) Data Extraction

(2) Fast clustering

(3) Triple Generation

(4) Triple Selection

KG

Experiments

Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages