Skip to content

Latest commit

 

History

History
120 lines (79 loc) · 3.88 KB

README.md

File metadata and controls

120 lines (79 loc) · 3.88 KB

Mechanical News

Mechanical News is an application framework that scrapes and saves the full text of online news articles to a database for social science research purposes.

Mechanical News it built on top of Scrapy and Flask, which lets you write web scrapers that retrieve news articles (using Scrapy), store them in the database, and then connect to a RESTful API to retrieve the articles from the database (using Flask).

You run Mechanical News on your own server. The users (i.e., researchers) instead use an R library or Python package to access the articles in a tidy data format directly from the API. The researcher doesn't need to know anything about how Mechanical News works.

Features

  • Build your own Scrapy scraper (or use an existing scraper from the library)
  • Extract information from news articles
  • Store full text news articles to a database
  • Run in different modes:
    • Scrape articles from news sites continuously (e.g., every day)
    • Scrape articles from specific URLs

Extracted information from news articles

News content

  • headline
  • article lead
  • article body text
  • links in article body text
  • main image

Metadata

  • authors
  • date of publication
  • date of modification
  • news section (e.g., World, Sports, Tech)
  • tags
  • categories
  • language
  • type of page (e.g., text article, video, sound)
  • news genre (e.g., news, sports, opinion, entertainment)
  • whether the article is behind a paywall
  • HTTP response headers
  • metadata tags (e.g., OpenGraph, microformats)
  • when the article was present on the frontpage

Overview of the architecture

Overview of the architecture of Mechanical News.

Install

Not yet available

Requirements:

  • Python 3.6+
  • MySQL 5.6+
  • Docker

Mechanical News have been tested on Windows 10, Red Hat 7.6, and Ubuntu 18.

Quick start

Scrape all news articles from the news frontpages using all available spiders in the /spiders directory by running this from the project path:

$ python run.py --crawl

Scrape all news articles from the frontpage of a specific site (bbc is the name of the spider):

$ python run.py --crawl bbc

Scrape the news article content from a specific URL:

$ python run.py --url https://www.bbc.com/XXX

Available spiders

Show all spiders you have installed:

$ python run.py --list

This will list all spiders in your /spiders directory. A spider is responsible for scraping a news site.

Documentation

See documentation wiki.

Contribute

Read how to contribute to Mechanical News by writing your own scrapers and share them.

Support

License

GNU General Public License v3.0

Similar projects

  • newspaper - library for automatic news article metadata extraction using heuristics. Mainly useful for English speaking content and when you don't want specific metadata.
  • news-please - library and system for news article metadata extraction with database and search function, also built on Scrapy and newspaper. However, you cannot specify what information you want to extract.
  • Media Cloud - open data platform that allows researchers to answer quantitative questions about the content of online media. Roll your own server or use the cloud service. However, you cannot access full text due to copyright.