Skip to content

Latest commit

 

History

History
63 lines (35 loc) · 3.07 KB

README.md

File metadata and controls

63 lines (35 loc) · 3.07 KB

academic-publishers

Pre-Print

This readme-file contains some cursory remarks. For greater details, see the pre-print at socarXiv.

Purpose

This project's purpose is to generate a list of major academic publishers and their scholarly journals through webscraping.

As for the results, see the file Output\top-100-publishers.xlsx (last updated in July 2022).

Key Documents

There are three key documents for adding/scraping publishers:

  • Data\04_publishers.xlsx: the (adaptable) list of publishers to be scraped, including the URL and the relevant CSS selectors (many of the newer additions in July 2022 were counted manually, albeit the most relevant CSS selector containing the journal names or links are added);
  • Script\Function\function_getjournals.R: the scraping function;
  • Script\Analysis_06_Extract-Journals.R: activate the scraping function

Data Sources

Publishers

The compilation of publishers was generated by drawing from the following four sources:

  • DOAJ (using the data dump in Dec. 2020)
  • Publons (using webscraping on 11 Dec 2020)
  • Scopus (using the csv-formatted source list from Oct. 2020)
  • Sherpa Romeo (using webscraping on 11 Dec 2020)

Journals

The list of journals was scraped from every respective publisher's website, using the URLs listed in Data\04_publishers.csv.

Methodical Approach

Publishers

The data extraction regarding the publishers occurs in the files 01 to 04 in the Script-folder, mainly using R's rvest-package.

Using the information from the four data sources, the script takes each publisher's highest journal count as assigned by these data sources (so that each publisher has up to four, often differing, journal counts). It then orders the list by each publisher's respective highest journal count. This is done in file 05 in the Script-folder.

In a further step, the script harmonizes duplicated names of publishers (based on the data in Data\03_publishers_harmonization.txt).

The rest was done manually, e.g. looking for the links of journal catalogues and collecting the relevant CSS selector for each publisher (in Data\04_publishers.xlsx).

Journals

Finally, the publishers' websites are accessed via a uniform webscraping function (but with differing CSS selectors) so as to extract all of the publishers' journal names, including the URL to each journal. This is done in file 06 in the Script-folder.

The various css selectors for each publisher is saved in Data\04_publishers.xlsx.

Results

Publishers

The full list of the publishers and their journal counts is visible in Output\top-100-publishers.xlsx.

Journals

The journal list is visible in Output\Journals\alljournals-2022-03-02.csv. Note, however, that the list is incomplete as many publishers were not scraped (yet) but only their number of journals were counted based on CSS selectors.