Skip to content

Latest commit

 

History

History
39 lines (28 loc) · 2.93 KB

README.md

File metadata and controls

39 lines (28 loc) · 2.93 KB

OSF Crawler Logo


MIT-License Top Language Latest Release

OSF Crawler

This repository contains a crawler for the Open Science Framework website.

Features

This crawler:

  • automatically downloads information about registered research projects or preprints from the Open Science Framework website either by crawling the website or by interacting with the official API. It then stores the information in a MongoDB database.
  • uses the natural language processing library spaCy to perform common data cleanup steps such as getting rid of stop words and lemmatizing the words and then the LDA algorithm of the topic modelling framework gensim to determine which topics were covered by the downloaded research.
  • outputs the most frequent tags, subjects as well as words used in the titles and descriptions in the form of an Excel file as well as the topics found by gensim and the corresponding coherence score of the LDA algorithm.

Tools

Purpose Name
Programming language Python 3.10
Version control system Git
HTML parser BeautifulSoup
Browser automation library Pyppeteer
NLP library spaCy
Output generator OpenPyXL
Asynchronous framework asyncio
Topic modelling framework gensim
NoSQL database MongoDB

Licence

This "OSF Crawler" is published under the MIT licence, which can be found in the LICENSE file.

References

The "Open Science Framework" logo was taken from the University of Oklahoma Libraries website.