Skip to content

Publicly available scholarly articles collection for NLP/IR applications

License

Notifications You must be signed in to change notification settings

ProjectDossier/scholarly_articles_collections

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Scholarly Articles Collections

Publicly available scholarly articles collections for NLP/IR applications

Name and Link Description Contents Notes Data
arXiv Dataset A dataset of 1.7 million arXiv articles for applications. This dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:
* id: ArXiv ID (can be used to access the paper, see below)
* submitter: Who submitted the paper
* authors: Authors of the paper
* title: Title of the paper
* comments: Additional info, such as number of pages and figures
* journal-ref: Information about the journal the paper was published in
* doi (Digital Object Identifier)
* abstract: The abstract of the paper
* categories: Categories / tags in the ArXiv system
* versions: A version history)
TREC 2019 Fair Ranking Track Academic search task. The corpus for this project is the Semantic Scholar (S2) Open Corpus from the Allen Institute for Artificial Intelligence. a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects (paper) needs a permission to access this data
explicit Semantic Ranking Dataset Dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes:
* The query log used in the paper
*  relevance judgements for the queries
* ranking lists from Semantic Scholar
* candidate documents
* entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.
Publicly Available
AMiner Citation Network Dataset: DBLP+Citation, ACM Citation network citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. Publicly Available
explicit Semantic Ranking Dataset Dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. It includes:
* The query log used in the paper
* relevance judgements for the queries
* ranking lists from Semantic Scholar
* candidate documents
* entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments.
Publicly Available
CORE provides access to 89M free to read full text research papers with 29M full texts hosted directly metadata and full text research articles from thousands of data providers. On top of this continuously growing corpus API
Semantic Scholar Academic Graph multidisciplinary knowledge graph where scientific papers and their authors are connected by citations of one paper by another metadata from scholarly articles API
CiteSeer An evolving scientific literature digital library and search engine that has focused primarily on the literature in computer and information science metadata, databases, data sets of pdf files and text of pdf files. Needs permision to access to sharing folders on Google Drive
Isearch information retrieval (IR) test collection to facilitate the evaluation of integrated search, i.e. search across a range of different sources but with one search box and one ranked result list approx. 18,000 monographic records, 160,000 papers and journal articles in PDF and 275,000 abstracts with a varied set of metadata and vocabularies from the physics domain, 65 topics based on real work tasks and corresponding graded relevance assessments.
English Wikipedia Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). The dataset has abstracts of Articles from the English Wikipedia (202,383 documents) as well as Titles and URLs. All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL) data 795 MB

About

Publicly available scholarly articles collection for NLP/IR applications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published