Skip to content

desiquant/news_scraper

Repository files navigation

DesiQuant News Scraper

A scrapy crawler that scrapes market news from Indian financial news outlets

test status

⚠️ WARNING: Work in progress. Will implement breaking frequently. The dataset is not updated everyday. The documentation and code requires more documentation and clarity.

Data Usage

The data is periodically updated (every 1 hour) and saved to s3://desiquant/data/news.parquet. This file can be easily accessed and read as a pandas dataframe as follows:

import s3fs
import pandas as pd

df = pd.read_parquet("s3://desiquant/data/news.parquet", storage_options={
    "key": "sceN1eFOJQmBIWHNEMd8",
    "secret": "w1BERx7F6LTe87sk9K9deoBcfYXNCwlol5xcLeev",
    "endpoint_url": "http://data.desiquant.com:9000",
})
df
url title article_text author date_modified date_published description scrapy_parsed_at scrapy_scraped_at
0 https://www.moneycontrol.com/news/business/mar... Stay stock-specific and maintain strict stop-l... It was a historic week (ended July 19) for dom... Jigar Patel 2024-07-21T19:55:46+05:30 2024-07-21T19:55:46+05:30 A breach of 24,500 could halt the current mome... 2024-07-21 17:09:08.064765 2024-07-21 17:09:07+00:00
1 https://www.moneycontrol.com/news/business/mar... Wall St ends volatile session lower in afterma... US stocks extended their slump on Friday as li... Reuters 2024-07-20T10:17:59+05:30 2024-07-20T10:17:59+05:30 The Dow Jones Industrial Average fell 377.49 p... 2024-07-21 17:09:08.474808 2024-07-21 17:09:08+00:00
2 https://www.moneycontrol.com/news/business/mar... Trade Spotlight: How should you trade Federal ... The benchmark indices saw profit booking after... Sunil Shankar Matkar 2024-07-11T01:49:05+05:30 2024-07-11T01:46:54+05:30 If the Nifty 50 breaks 24,200, the immediate s... 2024-07-21 17:09:09.444510 2024-07-21 17:09:08+00:00
... ... ... ... ... ... ... ... ... ...

Scraper Usage

Run a spider. The outputs are saved to outputs/moneycontrol.jl in JSONlines format

# scrape all market articles from "2010-01-01" till today with
scrapy crawl moneycontrol

# trial run: stops after scraping 10 items. useful for testing purposes
TRIAL_RUN=1 scrapy crawl moneycontrol

To view a list of all available spiders:

scrapy list

# businessstandard
# businesstoday
# economictimes
# financialexpress
# firstpost
# freepressjournal
# indianexpress
# ipfy
# moneycontrol
# ndtvprofit
# news18
# outlookindia
# thehindu
# thehindubusinessline
# zeenews

To run all the spiders in production

# view scraping benchmark tests performed by scrapy
scrapy bench
python run.py

Run tests to check if spiders are still working.

# view the parsed the article
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11

# test all spiders
pip install -e .[test]
pytest

Sitemaps

The sitemaps for each website not always directly available in robots.txt. Googling for keywords like "ndtvprofit.com daily sitemap xml" seems to retrieve the ones that are not mentioned.

Publisher Sitemap Type Sitemap Link
News 18 Daily Sitemap Link
The Hindu Daily Sitemap Link
The Hindu Business Line Daily Sitemap Link
Business Today Daily Sitemap Link
Money Control Daily Sitemap Link
Business Standard Sitemap Index Link
Economic Times Monthly Sitemaps Link
Firstpost Daily Sitemap Link
NDTV Profit Daily Sitemap Link
Free Press Journal Daily Sitemap Link
Outlook India Daily Sitemap Link
Zee News Monthly Sitemap Link
Financial Express Daily Sitemap Link
Indian Express Daily Sitemap Link

More Sources

The following news websites were in consideration but no daily sitemaps were found. Some effective strategies (requires more research) to iteratively retrieve a list of all articles are mentioned below.

Notes:

TODO

  • Generate a final s3 dump file that can be used.
  • Run the scraper as prefect flow
  • Scraping mode - Update/dump
  • While running the test, if it fails, prevent scrapy from showing the entire output
  • export PYTHONDONTWRITEBYTECODE=1
  • pytest failing on few spiders on remote server
  • moneycontrol and indianexpress have very aggressive protection. they don't seem to allow usage of even floating ips from hetzner. but ips of brightdata seem to work

Server Checklist

  • Attach floating IPs
  • Prevent pycache
  • Mount volume