Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
AnkitSatpute authored Feb 2, 2024
1 parent 2d535c1 commit 41f55c9
Showing 1 changed file with 66 additions and 48 deletions.
114 changes: 66 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,30 @@
# A Gold Standard Dataset for Recommending Scientific Documents with Mathematical Content
# Introducing the First Gold Standard Dataset for the Recommendation of Research Papers with Mathematical Content

In this repository, we include the first gold standard dataset for recommending scientific documents with mathematical content. Along with the dataset, we provide scripts used to construct the dataset and to run an example evaluation.
In this repository, we include the first gold standard dataset for recommending scientific research articles with mathematical content. Along with the dataset, we provide scripts used to construct the dataset and to run an example evaluation.

## Repository contents

## Overview of the repo

- dataset : Contains recommendations pairs and their contents. (For the content of each file and descrption, please refer to the "dataset" folder)
- src: Contains python scripts required to come up with the representative seed articles. Additionally, scripts required to use the dataset for evaluating a recommendation system approach. (For the content of each file and descrption, please refer to the "dataset" folder)

## Main contents of the repository

- [Dataset](#Dataset)
- [Preprocessing and Seed documents selection](#Preprocessing-and-Seed-documents-selection)
- [Example use case of dataset](#Example-use-case-of-dataset)
- [Seed documents Creation](#Preprocessing-and-Seed-documents-selection)

## Dataset

As of Feb-2023, there are 421 recommendation pairs with 80 seed documents.
As of April-2023, there are 421 recommendation pairs with 80 seed research articles.

### 1. Recommendation pairs.

All recommendation pairs are available [in this file](https://github.com/gipplab/MathRecGoldStandData/blob/main/dataset/recommendationPairs.csv) are with their zbMATHOpen_ID. For example: The document with ID:1566951 is ["Noncommutative symmetric algebras of two-sided vector spaces."](https://zbmath.org/?q=an%3A1566951)
All recommendation pairs are available "dataset/recommendationPairs.csv" are with their zbMATHOpen_ID. For example: The research article with ID:1566951 is ["Noncommutative symmetric algebras of two-sided vector spaces."](https://zbmath.org/?q=an%3A1566951)
Sample recommendation pairs from the curated dataset.


| Seed | 1st | 2nd | 3rd | 4th | 5th |
| Seed | 1st recommendation | 2nd recommendation | 3rd recommendation | 4th recommendation | 5th recommendation|
|---------|---------|---------|---------|---------|---------|
| 1566951 | 4181495 | 930151 | 5083606 | 1579464 | 6338806 |
| 1363213 | 1445144 | 1036371 | 6225939 | 2165994 | 1801581 |
Expand All @@ -27,76 +33,88 @@ Sample recommendation pairs from the curated dataset.
| 1591097 | 5049067 | 3867686 | 1758339 | 2136591 | |


The first column represents the seed documents and subsequent columns ranked recommendations. The recommendations are ranked according to the decreasing order of relevancy.
The first column represents the seed research articles and subsequent columns ranked recommendations. The recommendations are ranked according to the decreasing order of relevancy. In order to find research articles manually in zbMATH Open, search at "https://zbmath.org/" with the prefix an (for example: "an:1566951") or via URL requests directly given the search query with “an:” prefix. For example: https://zbmath.org/?q=an%3A1566951 where %3A is the URL encoding of the colon :


### 2. Document contents.

Each document's contents such as title, abstract/review/summarry, authors, MSC codes, Full-text link, references, etc are available in [the separate file](https://github.com/gipplab/MathRecGoldStandData/blob/main/dataset/documentContents.csv).
Each research article's contents such as title, abstract/review/summarry, authors, MSC codes, Full-text link, references, etc are available in dataset/documentContents.csv.

Example document from the file:
Example research article from the file:

| zbMATH_ID | Title | Abstract/Review/Summarry | Authors | Keywords | MSCs | Full text link | References |
|-----------|------------------------|----------------------------------------------------------------|----------------|-------------------------------------------|------------------|-----------------------------------------------|----------------------------------|
| 10342 | Maximal contact ...... | The author proves the following theorem: Fix an infinite...... | Cossart V..... | Samuel stratum and desingularization..... | [{code: 14E15... | https://doi.org/10.1215/S0012-7094-91-06303-9 | S. Abhyankar: Resolution of..... |


Additionally, contents from any documents from zbMATH Open can be fetched via [zbMATH Open API](https://oai.zbmath.org/) or available in the [repository](https://zenodo.org/record/6448360#.Y_UmrHbP02w).
Additionally, contents from any research article from zbMATH Open can be fetched via [zbMATH Open API](https://oai.zbmath.org/) directly downloaded from the official repository on zenodo [repository](https://zenodo.org/record/6448360#.Y_UmrHbP02w).

## Preprocessing and Seed documents selection
## Example use case of dataset

Please install dependencies from [requirements.txt](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/requirements.txt) before running any scripts.
We demonstrate an example evaluation of recommendation approaches with our dataset. For generating recommendations, we consider zbMATH Open collection. The collection contain 4.5 million documents. To rank recommendations, first we use the BM25 algorithm (a modified TF-IDF scheme) with cosine similarity provided by the default search capability of [Elasticsearch(ES)](https://www.elastic.co/).Second, we utlize language models to generate embeddings and then use cosine similarity to get relevant recommendations.

### Preprocessing
1. Elasticsearch Versions used
1. Elasticsearch: [7.9.3](https://www.elastic.co/jp/downloads/past-releases/elasticsearch-7-9-3)
2. Kibana (only for testing purposes, not neede to run evaluation): [7.9.3](https://www.elastic.co/downloads/past-releases/kibana-7-9-3)

The following table provides scripts and its functions/steps involved in preprocessing.
The following table includes scripts and its corresponding functinality for perfoming example evaluation. Our scripts include experiments sufficient to run on a local system. However, we expect that at least 20 GB of free space is available for Elasticsearch.

| No. | Functionality/Step | Script |
|-----|-----------------------------------------------------|---------------------|
| 1 | Load all zbMATH Open documents as loocal .txt files | [getAlldocs.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/preProcessing/getAlldocs.py) |
| 2 | Remove short/irrelevant documents | [remvShrtdocs.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/preProcessing/remvShrtdocs.py) |
| 3 | Extract TOIs and remove Non-English documents | [extractTOIs.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/preProcessing/extractTOIs.py) |
| 4 | Convert LaTeX to MathML and extract MOIs | [extractMOIs.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/preProcessing/extractMOIs.py) |
| 5 | Discipline-wise documents | [docsPerMSC.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/preProcessing/docsPerMSC.py) |

| No. | Functionality/Step | Script |
|-----|------------------------------------------------|-----------------------|
| 1 | Load zbMATH Open documents on ES | src/exampleEvaluation/loadDOcsonES.py |
| 2 | indexing Configuration (text and text + Math ) | src/exampleEvaluation/collectionsRef.py |
| 3 | Generate recommendations | src/exampleEvaluation/genRecms.py |
| 4 | Evaluate recommendation | src/exampleEvaluation/evalRecms.py |
| 5 | Generate embeddings and calculate cosine similarity to generate recommendations | src/baseline/langModelEval.py |

### Seed documents selection
The above mentioned scripts are not all the scripts. Please refer to the "main/src/" folder for more detaila.

The representative seed documents selection follows a four step procedure. Each step and its corresponding scripts are available in the following table.
### Example recommendation generation

| Step No. | Name | Script |
|----------|-----------------------------------|----------------------|
| 1 | Mathematical discipline selection | [reprMSCsel.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/seedDocSelection/reprMSCsel.py) |
| 2 | Working dataset creation | [workingDset.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/seedDocSelection/workingDset.py) |
| 3 | Capture probability calculation | [captureProb.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/seedDocSelection/captureProb.py) |
| 4 | Final seeds selection | [finalSeedsSel.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/seedDocSelection/finalSeedsSel.py) |
#### Make sure elasticsearch cluster is running

python src/exampleEvaluation/genRecms.py

## Example use case of dataset

We demonstrate an example evaluation of recommendation approaches with our dataset.
For generating recommendations, we consider two document collections, i.e., zbMATH Open and Algebraic Geometry.
These collections contain 4.5 million documents and 86 thousand documents, respectively.
To rank recommendations, we use the BM25 algorithm (a modified TF-IDF scheme) with cosine similarity provided by the default search capability of [Elasticsearch(ES)](https://www.elastic.co/).
We use two document elements for comparison, text and math expressions.
## Seed documents Creation

1. Versions used
1. Elasticsearch: [7.9.3](https://www.elastic.co/jp/downloads/past-releases/elasticsearch-7-9-3)
2. Kibana (only for testing purposes): [7.9.3](https://www.elastic.co/downloads/past-releases/kibana-7-9-3)
Here we mention the scripts to get the representative seeds from the zbMATH Open. Note: These steps are not required if you directly want to use the dataset. Please refer to the "Example use case of the dataset" below for information regarding utilizing dataset for evaluation.

The following table includes scripts and its corresponding functinality for perfoming example evaluation. Please adjust the Elasticsearch configurations based on your used infrastrucure. Our scripts include experiments sufficient to run on a local system. However, we expect that at least 40 GB of free space is available for Elasticsearch and Kibana.
Please install dependencies from src/requirements.txt before running any scripts.

### virtual environment
`python3 -m venv recseedsel` (More on creating [virtual environment](https://docs.python.org/3/library/venv.html))
`source recseedsel/bin/activate` ([activate](https://docs.python.org/3/tutorial/venv.html#:~:text=Once%20you%E2%80%99ve%20created%20a%20virtual%20environment%2C%20you%20may%20activate%20it.) virtual environment)

| No. | Functionality/Step | Script |
|-----|------------------------------------------------|-----------------------|
| 1 | Load zbMATH Open documents on ES | [loadDOcsonES.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/exampleEvaluation/loadDOcsonES.py) |
| 2 | indexing Configuration (text and text + Math ) | [collectionsRef.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/exampleEvaluation/collectionsRef.py) |
| 3 | Generate recommendations | [genRecms.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/exampleEvaluation/genRecms.py) |
| 4 | Evaluate recommendation | [evalRecms.py](https://github.com/gipplab/MathRecGoldStandData/blob/main/src/exampleEvaluation/evalRecms.py) |
### install dependencies
`pip install -r src/requirements.txt`

### 1. Preprocessing

The following table provides scripts and its functions/steps involved in preprocessing. The preprocessing steps are only required to calculate the representative seed research article.

| No. | Functionality/Step | Script |
|-----|-----------------------------------------------------|---------------------|
| 1 | Load all zbMATH Open documents as local .txt files | src/preProcessing/getAlldocs.py |
| 2 | Remove short/irrelevant documents | src/preProcessing/remvShrtdocs.py |
| 3 | Extract TOIs and remove Non-English documents | src/preProcessing/extractTOIs.py |
| 4 | Convert LaTeX to MathML and extract MOIs | src/preProcessing/extractMOIs.py |
| 5 | Discipline-wise documents | src/preProcessing/docsPerMSC.py |

The above mentioned scripts are not all the scripts. Please refer to the [seed documents selection](https://github.com/gipplab/MathRecGoldStandData/tree/main/src/) folder for more detaila.

### 2. Seed documents selection

The representative seed documents selection follows a four step procedure. Each step and its corresponding scripts are available in the following table.

| Step No. | Name | Script |
|----------|-----------------------------------|----------------------|
| 1 | Mathematical discipline selection | src/seedDocSelection/reprMSCsel.py |
| 2 | Working dataset creation | src/seedDocSelection/workingDset.py |
| 3 | Capture probability calculation | src/seedDocSelection/captureProb.py |
| 4 | Final seeds selection | src/seedDocSelection/finalSeedsSel.py |


## License

Legal restrictions and copyright: The zbMATH Open data is subject to the Terms and Conditions for the zbMATH Open API Service of FIZ Karlsruhe – Leibniz-Institut für Informationsinfrastruktur GmbH. Content generated by zbMATH Open, such as reviews, classifications, software, or author disambiguation data, are distributed under CC-BY-SA 4.0. This defines the license for the whole dataset, which also contains non-copyrighted bibliographic metadata and reference data derived from I4OSC (CC0).
CC-BY-SA 4.0. This defines the license for the whole dataset, which also contains non-copyrighted bibliographic metadata and reference data derived from I4OSC (CC0).

0 comments on commit 41f55c9

Please sign in to comment.