GitHub - spyflying/TrecPM-2017-Elasticsearch

国科大2018年秋季现代信息检索大作业

Authors: Huang, Hui, Gao

代码介绍

elasticsearch_copy: 需要代码包，可以通过pip安装，也可以直接copy
extract_xml_to_elastic.py/extract_xml_to_elastic_multiprocess.py: 将clinicaltrials_xml中的xml文件添加到名为"ct"的index中，每条xml文件创建一条相应id的数据；multiprocess为多进程并行
query_train: 训练过程，用24个topic的ground-truth文档进行训练，训练结果：cache/cache*/keywords.txt
query_test: 进行五折交叉验证，分别以其中4折作为训练集，剩下一折作为测试集，得到的查询结果；结果：qresults/result.txt
test_all: 将所有样本都拿来训练后，测试所有的topic，生成查询结果
trec_eval.9.0: 评测脚本；make命令编译；编译完成后执行trec_eval qrel.txt result.txt得到最终结果
query_test_interaction.py: 交互式评测脚本，输入topic id (1~30)，返回按照score排序的查询结果
run.sh: 一键执行建立索引、训练、测试和结果评价脚本
search.sh: 交互式查询界面脚本

数据说明

clinicaltrials_xml: 三级目录，共24万条电子病例数据，每条数据存为一个.xml文件
topics2017.xml: 待查询的文件，共30个topic，在query_elasticsearch.py脚本中对每个topic构建查询语句，生成查询结果
qrels-final-trials.txt: topics2017对应的ground-truth文件

环境要求

Elasticsearch

不需要安装,直接下载tar包,解压后进入文件夹运行:

./bin/elasticsearch
Python 3.6

nltk

终端执行以下命令安装nltk，并在python中使用nlpk安装punkt

 user$ pip install nlpk
 user$ python
 >>> import nlpk
 >>> nlpk.download('punkt')

运行说明

数据

clinicaltrials/

运行Elastic Search

./bin/elasticsearch

构建索引

python extract_xml_to_elastic_multiprocess.py

训练

python query_train.py

测试

python query_test.py

计算P5,P10,P15

trec_eval clinicaltrials/qrels-final-trials.txt qresults/result.txt

一键执行建立索引、训练、测试和结果评价

./run.sh

交互式界面

./search.sh

任务说明

优化查询语句构建，修改query_elasticsearch.py中es_query()函数；优化代码
跑通测试脚本，可以成功测试P@5, P@10, P@15
编写界面程序
整理代码，可以一键执行
需要在TREC 2017 PM上提交吗
完成实验报告

方法介绍

索引构建：使用elasticsearch构建全文索引；采用tf/idf(或BM25模型）
查询构建：初始查询使用topic中的"disease", "gene", "gender", "age"字段构建查询语句，其中"disease"和"gene"采用"multi_match"查询，对"brief_title", "brief_summary", "detailed_description", "eligibility", "keyword", "mesh_term"等filed进行匹配，同时参与相关性计算；"disease"和"gene"之间用AND连接

    "must": {
        "multi_match": {
            "query": main_query,
            "fields": ["brief_title * 3", "brief_summary", "detailed_description", "eligibility",
                "keyword * 3",
                "mesh_term * 3"],
        }
    },
    "must": {
        "multi_match": {
            "query": gene_query,
            "fields": ["brief_title", "brief_summary", "detailed_description", "eligibility", "keyword",
                "mesh_term"],
        }
    },

"gender"和"age"采用"filter"对文档进行过滤，不参与相关性计算

    "filter": {
        "range": {"maximum_age": {"gte": age_query}},
        "range": {"minimum_age": {"lte": age_query}}
    }
    "post_filter": {"term": {"gender": "all"},}

以上查询不采用topic的ground-truth信息，仅根据已有的索引和检索字段进行查询，精度略低

训练：采用5折交叉验证，24个topic作为训练样本，6个topic作为测试样本

正样本：每个topic对应的ground-truth中所有相关文档

负样本：每个topic对应的ground-truth中所有不相关文档中随机挑选30条

topic frequency：每个word在每个topic中出现的总次数，定义为topic frequency

doc frequency：每个word在所有topic的所有文档中出现的总次数，为doc frequency

训练过程是基于相关文档和不相关文档的topic frequency和doc frequency进行的。规定高频词为：参与训练的24个topic中，如果在其中12个或更多的topic的对应文档中的topic frequency都大于50，并且doc frequency大于300，这个word就可以认为是高频词；训练中，如果一个word在相关文档中是高频词，且在不相关文档中不是高频词，可以说明这个word更可能是相关文档具有的特征，可以被选为查询词构建查询语句。之所以要在12个或更多的topic中topic frequency都很大，是因为这样选出来的词在不同topic的相关文档中出现相对均匀，更容易迁移到测试topic上。
测试：除了采用topic中的字段构建查询语句，还使用了从训练样本中得到的关键词构建查询语句。并且为了使得不同查询语句具有不同的权重，还加入"boost"字段，对"desease"和"gene"查询赋予更高的权重，权重是根据测试集的表现进行挑选的。

    "multi_match": {
        "query" : query_word[group_id],
        "fields" : ["brief_summary", "detailed_description"],
        "boost" : 1
     }

查询词选择结果：

    train=2,3,4,5; test=1: clinical including criteria advanced and/or drug tumors dose potential history solid combination
  
    train=1,3,4,5; test=2: combination advanced and/or drug tumors clinical dose criteria ≥ blood potential history including
  
    train=1,2,4,5; test=3: clinical including criteria advanced and/or drug tumors inclusion defined dose ≥ potential inhibitor history active metastases combination mutation
  
    train=1,2,3,5; test=4: blood clinical including criteria advanced and/or drug tumors dose uln ≥ potential inhibitor history solid combination mutation
  
    train=1,2,3,4; test=5: clinical trial including criteria advanced and/or drug tumors ≥ potential inhibitor history solid combination mutation

虽然每轮训练得到的查询词数目不同，但是很多词基本相同。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

国科大2018年秋季现代信息检索大作业

代码介绍

数据说明

环境要求

运行说明

任务说明

方法介绍

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
cache		cache
clinicaltrials		clinicaltrials
elasticsearch_copy		elasticsearch_copy
qresults		qresults
trec_eval.9.0		trec_eval.9.0
.gitignore		.gitignore
README.md		README.md
elastic_setting.sh		elastic_setting.sh
extract_xml_to_elastic.py		extract_xml_to_elastic.py
extract_xml_to_elastic_multiprocess.py		extract_xml_to_elastic_multiprocess.py
query_test.py		query_test.py
query_test_interaction.py		query_test_interaction.py
query_train.py		query_train.py
run.sh		run.sh
search.sh		search.sh
stop_list.txt		stop_list.txt
test_all.py		test_all.py

spyflying/TrecPM-2017-Elasticsearch

Folders and files

Latest commit

History

Repository files navigation

国科大2018年秋季现代信息检索大作业

代码介绍

数据说明

环境要求

运行说明

任务说明

方法介绍

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages