EDM-BERT

EM-BERT currently uses static representations of wikipedia pages, we plan to use dynamic representations for these pages. We do this by finding the most relevant paragraph for an entity based on the query using a BM25 search model.

Setup

Clone this repository
Run Code/make.sh
cd Code/XML_parser/
Download the files with entities and wikipedia.
Install xml_split using sudo apt install xml-twig-tools
Remove the abstract from the tsv file awk -F"\t" '{print $1}' short_abstracts_en_full.tsv > entities.tsv
Run the script to split up the wikipedia dump and create JSON files for our Lucene index ./build_jsons.sh 5Gb wikipedia.xml entities.tsv
Create the Lucene index python -m pyserini.index -collection JsonCollection -input json -threads 20 -index "full_wiki_dump/" -storeDocvectors -storePositions -storeRaw
Run EDMBERT with this index python3 -m pygaggle.run.evaluate_document_ranker --split dev --method seq_class_transformer --model "output/monobert-large-msmarco-finetuned_acc_batch_testmodel_acc_batch_600k_64_e6" --dataset "data/DBpedia-Entity/" --index-dir "full_wiki_dump" --task msmarco --output-file ../Runs/testrun_mostrel.tsv \ --w2v "resources/wikipedia2vec/wikipedia-20190701/wikipedia2vec_500.pkl" --mapper "mappers/wikipedia2vec-500-cased.monobert-base-cased.linear.npy"
Evaluate the scores by running python3 Code/Evaluation.py "Code/XML_Parser/qrels-v2.txt" "/Runs/testrun_mostrel.tsv"

Orginal EMBERT Description

About this repository

Github page with supplementary information to the paper `Entity-aware Transformers for Entity Search' by Emma Gerritse, Faegheh Hasibi and Arjen de Vries, which was accepted at SIGIR 2022, and can be accessed here

Structure

Structure of this github repository is as follows: In the Runs Directory, you can find all runs in the paper, with the same name as in the paper table 2.

In the Code directory, all code is available. All models and supplementary materials can be downloaded by running

cd Code/make.sh

Note that this will download around 40 gb of data.

We recommend running this code in an virtual environment using Python 3.7 (Using newer versions leads to conflicts with Pytorch), for example by using:

python3.7 -m venv venv
source venv/bin/activate

All Python packages can be downloaded with pip install -r requirements.txt, then do

pip install tensorflow==2.5.0
pip install numpy==1.20.3
pip install click==7.1.1

to complete the install.

Reranking

To rerank, call the following function in Code:

python -m pygaggle.run.evaluate_document_ranker  --split dev --method seq_class_transformer --model pathtomodel --dataset pathtodata --index-dir pathtoindex  --task msmarco --output-file pathtooutput --w2v pathtowikipedia2vec --mapper path2mapper

For example:

python -m pygaggle.run.evaluate_document_ranker  --split dev --method seq_class_transformer --model ../output/monobert-large-msmarco-finetuned_acc_batch_testmodel_acc_batch_600k_64_e6 --dataset ../data/DBpedia-Entity  --index-dir ../indexes/lucene-index-dbpedia_annotated_full  --task msmarco --output-file ../Runs/testrun.tsv --w2v ../resources/wikipedia2vec/wikipedia-20190701/wikipedia2vec_500.pkl --mapper ./mappers/wikipedia2vec-500-cased.monobert-base-cased.linear.npy

More information

Most of the code is based on either the E-BERT or the Pygaggle repository.

To use on your own datasets, make sure to provide all documents and queries as in the pygaggle repository, but annotate the documents before the index and the queries. Annotations should come right after the mention, for example Neil Gaiman ENTITY/Neil_Gaiman novels. We used REL, but you can use entity linker as long as the part after the ENTITY preamble is a Wikipediapage.

An example of finetuning can be found in Code/retraining_dbpedia_entity_folds.py.

Downloads

Everything needed to evaluate the model can be downloaded with the script in Code/make.sh If you just want the seperate models or Lucene indexes, they can be downloaded here.

TSV of DBpedia Entity

TSV of DBpedia Entity Annotated

Lucene index for DBpedia Entity

Lucene index for DBpedia Entity annotated

Wikipedia2vec embeddings

EMBERT finetuned on Annotated Dbpedia, all 5 folds

MonoBERT fintetuned on DBpedia, not annotated, all 5 folds

EMBERT fintetuned on MSMARCO (EMBERT (1st) in paper)

Citation and contact

You can cite us using

@inproceedings{Gerritse:2022:Entity,
author = {Gerritse, Emma and Hasibi, Faegheh and De Vries, Arjen},
booktitle = {Proc. of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
series = {SIGIR '22},
title = {{Entity-aware Transformers for Entity Search}},
year = {2022}
}

In case anything is missing, please either make an issue or send an emal to emma.gerritse@ru.nl

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Code		Code
Runs		Runs
.gitignore		.gitignore
EDM-BERT: Dynamic Wikipedia Representations in Knowledge Graph Enhanced BERT Entity Ranking.pdf		EDM-BERT: Dynamic Wikipedia Representations in Knowledge Graph Enhanced BERT Entity Ranking.pdf
README.md		README.md
ebert_diagram.png		ebert_diagram.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDM-BERT

Setup

Orginal EMBERT Description

About this repository

Structure

Reranking

More information

Downloads

Citation and contact

About

Releases

Packages

Languages

Lukevanl/EDM-BERT

Folders and files

Latest commit

History

Repository files navigation

EDM-BERT

Setup

Orginal EMBERT Description

About this repository

Structure

Reranking

More information

Downloads

Citation and contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages