This repository contains code for the paper Finding Paths for Explainable MOOC Recommendation: A Learner Perspective.
- Datasets
- Installation
- How to run UPGPR on Xuetang
- How to run UPGPR on COCO
- Additional information about UPGPR config files
- How to run the baselines
- Using a custom dataset
- Citation
Datasets
Download Xuetang from http://moocdata.cn/data/MOOCCube, extract the file and place the MOOCCube folder in data/mooc/
We assume that you will have at least the following two folders:
- data/mooc/MOOCCube/entities/
- data/mooc/MOOCCube/relations/
Get the coco dataset by contacting the authors of COCO: Semantic-Enriched Collection of Online Courses at Scale with Experimental Use Cases by email. Extract the file and place it in data/coco/
You sould get one folder:
- data/coco/coco/
Note: Because you might get a more recent version of the dataset, some of the characteristics (number of learners, courses, etc... ) might be different.
Installation
Python 3.10 is required.
We recommend using a conda environment, but feel free to use wahthever you are the most confortable with:
conda create -n upgpr python=3.10
conda activate upgpr
pip install -r requirements.txt
If you intent to run the skill extractor on the coco datset, you will need to download en_core_web_lg:
python -m spacy download en_core_web_lg
UPGPR on Xuetang
python src/UPGPR/preprocess_mooc.py
After this process, all the files from MOOCCUbe have been standardized into the format needed by PGPR. The files are saved in the folder data/mooc/MOOCCube/processed_files.
We used the same file format as in the original PGPR repoisitory: https://github.com/orcax/PGPR.
python src/UPGPR/make_dataset.py --config config/UPGPR/mooc.json
After this process, the files containing the train, validation and test sets and the Knowledge Graph have been created in tmp/mooc.
python src/UPGPR/train_transe_model.py --config config/UPGPR/mooc.json
The KG embeddings are saved in tmp/mooc.
python src/UPGPR/train_agent.py --config config/UPGPR/mooc.json
The agent is saved in tmp/mooc.
python src/UPGPR/test_agent.py --config config/UPGPR/mooc.json
The results are saved in tmp/mooc.
UPGPR on COCO
python src/UPGPR/extract_skills.py
After this process, the files course_skill.csv and learner_skill.csv have been created in data/coco/coco
python src/UPGPR/preprocess_coco.py
After this process, all the files from coco have been standardized into the format needed by PGPR. The files are saved in the folder data/mooc/MOOCCube/processed_files.
We used the same file format as in the original PGPR repoisitory: https://github.com/orcax/PGPR.
python src/UPGPR/make_dataset.py --config config/UPGPR/coco.json
After this process, the files containing the train, validation and test sets and the Knowledge Graph have been created in tmp/mooc.
python src/UPGPR/train_transe_model.py --config config/UPGPR/coco.json
The KG embeddings are saved in tmp/coco.
python src/UPGPR/train_agent.py --config config/UPGPR/coco.json
The agent is saved in tmp/coco.
python src/UPGPR/test_agent.py --config config/UPGPR/coco.json
The results are saved in tmp/coco.
Config files
To run the original PGPR, change the config files in config/UPGPR as follows:
- Set the "reward" attribute in "TRAIN_AGENT" and "TEST_AGENT" to "cosine".
- Set the "use_pattern" attribute in "TRAIN_AGENT" and "TEST_AGENT" to "true".
- Set the "max_path_len" attribute in "TRAIN_AGENT" and "TEST_AGENT" to 3.
To run UPGPR, change the config files in config/UPGPR as follows:
- Set the "reward" attribute in "TRAIN_AGENT" and "TEST_AGENT" to "binary_train".
- Set the "use_pattern" attribute in "TRAIN_AGENT" and "TEST_AGENT" to "false".
- Set the "max_path_len" attribute in "TRAIN_AGENT" and "TEST_AGENT" to an integer > 2
- If "max_path_len" has a value different than 3, change the value of the "topk" attribute in "TEST_AGENT" to list of the same length as "max_path_len".
Baselines
Process the Xuetang files for RecBole (requires data/mooc/MOOCCube/processed_files)
python src/baselines/format_moocube.py
After this process, all the files from coco have been standardized into the format needed by RecBole. The files are saved in the folder data/mooc/recbolemoocube.
We follow the same process for coco:
python src/baselines/format_coco.py
The files are saved in the folder data/coco/recbolecoco.
To run the baselines, choose a config file in config/baselines and run the following:
python src/baselines/baseline.py --config config/baselines/coco_Pop.yaml
This example runs the Pop baseline on the coco dataset.
You can ignore the warning "command line args [--config config/baselines/coco_Pop.yaml] will not be used in RecBole". The argument is used properly.
Custom
In the folder example, we have provided a minimalistic example of a synthetic dataset to help understadning the format of the files required by UPGPR. This dataset is to understand the format of the files only and is too small to be used to test the code.
Below, you will find a detailed description of the files:
- Enrolments file. You must have a file named "enrolments.txt" containing the enrollments of each student. The structure is the following: each line contain one enrolment with the student id and the course id separated by a space. id must be integers. An example is provided here: enrolments.txt
- Entities files. For each entity in your knowledge graph (student, course, teacher, school, etc...) you must have a file named "entity_name.txt" and each line contains the name of the entity associated to the id line number - 1. For example in the file courses.txt, on line 1 we have the course "Math", meaning that it's id is 0.
- Relations files. For each relation in your knowledge grahp (course_teacher, course_school, eacher_school, etc...) you must have a file named "sourceentity_targetentity.txt" where each line corresponds to the source entity id and contains all the tagets entities id that are related to the source entity. For example in the file course_teachers.txt, on line 3 we have the course "2 3", meaning that Charlie and Dave are teaching History.
You also need to modify the config file to be suited to your custom dataset. You will need to modify the content of "KG_ARGS" in the config file to specify the entities,relations, and files names that contain these realtions. You can have a look at the file example.json to have an example of the content of "KG_ARGS" for our example dataset.
@article{frej2023finding,
title={Finding Paths for Explainable MOOC Recommendation: A Learner Perspective},
author={Frej, Jibril and Shah, Neel and Kne{\v{z}}evi{\'c}, Marta and Nazaretsky, Tanya and K{\"a}ser, Tanja},
journal={arXiv preprint arXiv:2312.10082},
year={2023}
}