Skip to content

Reconstructed "Great Recession News" Corpus (Research in Corpus Linguistics) with the help of Selenium Bot.

License

Notifications You must be signed in to change notification settings

maciejskorski/GreatRecessionNews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Selenium Open In Collab

Great Recession News Corpus

Overview

Reconstructed "Great Recession News" Corpus, described in "Building the Great Recession News Corpus (GRNC): A contemporary diachronic corpus of economy news in English" (Research in Corpus Linguistics, 2020).

The authors don't share the source data neither the list of articles. The corpus can be only interacted through paid plans of Sketch Engine.

This repository offers an alternative by publishing digital identifiers (urls) of the documents used in the corpus. The content can be further retrieved for non-commerical purposes through APIs for developers or scrappers.

Data Description

SketchEngine processed 18,915 articles from "The Guardian" and 13,069 articles from "New York Times". There are some redundancies in the data, not mentioned in the original paper. The urls retrieved in this repo perfectly match what is available at SketchEngine.

source unique urls total urls
New York Times 12556 13069
The Guardian 18161 18915

Methodology

I developed a Selenium Bot to extract article identifiers that were available from SketchEngine.

About

Reconstructed "Great Recession News" Corpus (Research in Corpus Linguistics) with the help of Selenium Bot.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published