Bibliometric analysis of recommender systems

This is the code used in the paper 'A survey on Neural Recommender Systems: insights from a bibliographic analysis' (link will be provided when available). In order to run the code, you need to install the libraries in the file requirements.txt in your virtual environment:

pip install -r requirements.txt

The results will be saved in the data directory. In the same directory you will also find the data that we used to perform the analysis.

How to run

In order to run the script you need to just run the main.py script in your environment or wherever you wish to run it. For instance, in the directory where is the script:

python main.py

Or if you are using pipenv:

pipenv run main.py

You can also specify two arguments:

-skip_data_viz: The default code behaviour is to run a part of the code that produces some nice plots of the data and saves them in the 'plots' directory. If this argument is specified, it will not. You don't need to pass any additional parameter to this argument.
-clustering_model: the default value is 'lda'. You can also opt for 'keybert' and 'dbscan' to obtain the results from these models instead.

Example

pipenv run main.py -skip_data_viz -clustering_model keybert

In this way you are making the plots that are going to be saved in the 'plots' directory, and you are using keybert to get the keywords for each cluster.

pipenv run main.py -clustering_model dbscan

In this case you are not making plots of the data, and you are using DBScan to cluster the papers.

pipenv run main.py

And if you run the code like this, you won't make the plots and you will use Latent Dirichlet Allocation (LDA) to cluster the papers.

Some of our results

We scraped all the papers published in open access in the last 5 years that treated recommender systems. The number of publications per year can be visualized in the figure below.

It is easy to notice the steady growth in the number of publications per year. We wanted to understand what were the main topics that drove this growth. To do this we decided to cluster the last 1000 papers published with the LDA algorithm. The results from the algorithm are shown in the image below.

In this way we were able to find some topics (clusters) of interest that we summarized in the following table:

Topic	Label	Top-k words
0	Probabilistic approaches	condition, determine, variance, minimum, estimate, bound, error, standard, linear, parameter
1	Graph Neural Networks	capture, feature, batch, attention, dimension, representation, prediction, vector, matrix, layer
2	Computer Vision and Data Visualization	classification, label, machine, description, content, identify, accuracy, domain, extract, language
3	N.A.	effect, application, event, documentation, technology, control, communication, access, management, development, environment, support
4	AI Fairness	reward, reinforce, question, agent, person, prefer, experience, world, policy, feedback, decision

The topics were identified on the basis of the most frequent words in each cluster. We also checked the papers in order to understand if the words were a good indication and if there was a topic that described all the papers in the cluster. The only one that we did not feel confident enough to label was the fourth topic, where the papers did not even seem to be on recommender systems, but rather on some sort of recommendation made from a different type of analysis. If you have any idea about it, let us know by opening an issue or by contacting us!

About the scraper

We got the .pdf files gathered in the articles_pdf directory from arXiv through the scraper that you can find here. It is not possible tough to get all .pdf files from arxiv for two reasons:

They interrupt the connection with you after scraping for a while. You would need to change your IP dynamically in order to keep scraping;
Not all papers uploaded a .pdf file to arXiv, but this could be solved by scraping from Ar5iv, maybe.

While there could be solutions to these two problems, we think it would be more useful to create a new library to find the .pdf file of a paper online, if there exists one. We may work on this in the future.

About the data

In the data directory you will find two directories and a .csv file with the necessary data to run the scripts and perform the analysis:

To convert the .pdf files to .txt files we slightly modified this repository from Extracting Body Text from Academic PDF Documents for Text Mining (Yu et al., 2020). The changes made simply allowed us to use the repository, but the main application was the one that you can find in the repository. We may share in the future what we did in order to make it work easily.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
results		results
scraper		scraper
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
clustering.py		clustering.py
config.py		config.py
main.py		main.py
plotter.py		plotter.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bibliometric analysis of recommender systems

How to run

Example

Some of our results

About the scraper

About the data

About

Releases

Packages

Contributors 2

Languages

License

yetanotherfrancisdeveloper/recommender-systems-bibliometric-analysis

Folders and files

Latest commit

History

Repository files navigation

Bibliometric analysis of recommender systems

How to run

Example

Some of our results

About the scraper

About the data

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages