https://biendata.com/competition/wsdm2020/
ID: @nlp-rabbit
Python >= 3.6
git clone https://github.com/supercoderhawk/wsdm-digg-2020
pip3 install -r requirements.txt
python3 -m spacy download en
-
setup elasticsearch service, refer to link
-
setting value
ES_BASE_URL
in constants.py with your configured elastic search endpoint.
-
unzip file and put all files under
data/
folder, renametest.csv
totest_release.csv
-
Download model , unzip it and put files into
data
folder -
execute
bash scripts/prepare_data.sh
in project root folder to build the data for next step
- execute
bash scripts/run_end2end.sh
in project root folder
the above script includes three main parts
-
execute elasticsearch to retrieval candidate papers
core logic in
search\search.py
which is called bybenchmark\benchmark.py
-
execute the rerank by BERT
core logic in
reranking\predict.py
, model code inreranking\plm_rerank.py
-
recall phase
-
keywords and keyphrase extraction
-
noun chunk extraction
-
textrank keyword extraction
-
candidate keywords filtering, including noun, proper noun and adjective
-
-
BM25 based search (elasticsearch)
-
-
rerank phase
Bert based rerank (SciBert from AllenAI), single model, not have any ensemble methods
training data built by first stage (BM25) search result
loss is marginal loss (hinge loss) which is widely used in ranking scenario
The model required to be trained just the Bert based reranking model
# prepare training data for reranking
bash scripts/prepare_rerank.sh
# training the rerank model
bash scripts/train_rerank.sh
# predict the result
bash scripts/predict_rerank.sh
-
In this project, abbreviation
plm
meansPretrained Language Model
. -
methods tried but not effective:
-
Bert-Knrm, Bert-ConvKnrm paper: CEDR: Contextualized Embeddings for Document Ranking, code in
reranking\plm_knrm.py
andreranking\plm_conv_knrm.py
-
Bert based sentence vectorization method, paper Universal Sentence Encoder (Use BERT CLS output replaced vanilla transformer trained from scratch) code in
vectorization\plm_vectorization.py
andvectorization\predict.py
-