DesiQuant News Scraper

A scrapy crawler that scrapes market news from Indian financial news outlets

⚠️ WARNING: Work in progress. Will implement breaking frequently. The dataset is not updated everyday. The documentation and code requires more documentation and clarity.

Data Usage

The data is periodically updated (every 1 hour) and saved to s3://desiquant/data/news.parquet. This file can be easily accessed and read as a pandas dataframe as follows:

import s3fs
import pandas as pd

df = pd.read_parquet("s3://desiquant/data/news.parquet", storage_options={
    "key": "sceN1eFOJQmBIWHNEMd8",
    "secret": "w1BERx7F6LTe87sk9K9deoBcfYXNCwlol5xcLeev",
    "endpoint_url": "http://data.desiquant.com:9000",
})
df

	url	title	article_text	author	date_modified	date_published	description	scrapy_parsed_at	scrapy_scraped_at
0	https://www.moneycontrol.com/news/business/mar...	Stay stock-specific and maintain strict stop-l...	It was a historic week (ended July 19) for dom...	Jigar Patel	2024-07-21T19:55:46+05:30	2024-07-21T19:55:46+05:30	A breach of 24,500 could halt the current mome...	2024-07-21 17:09:08.064765	2024-07-21 17:09:07+00:00
1	https://www.moneycontrol.com/news/business/mar...	Wall St ends volatile session lower in afterma...	US stocks extended their slump on Friday as li...	Reuters	2024-07-20T10:17:59+05:30	2024-07-20T10:17:59+05:30	The Dow Jones Industrial Average fell 377.49 p...	2024-07-21 17:09:08.474808	2024-07-21 17:09:08+00:00
2	https://www.moneycontrol.com/news/business/mar...	Trade Spotlight: How should you trade Federal ...	The benchmark indices saw profit booking after...	Sunil Shankar Matkar	2024-07-11T01:49:05+05:30	2024-07-11T01:46:54+05:30	If the Nifty 50 breaks 24,200, the immediate s...	2024-07-21 17:09:09.444510	2024-07-21 17:09:08+00:00
...	...	...	...	...	...	...	...	...	...

Scraper Usage

Run a spider. The outputs are saved to outputs/moneycontrol.jl in JSONlines format

# scrape all market articles from "2010-01-01" till today with
scrapy crawl moneycontrol

# trial run: stops after scraping 10 items. useful for testing purposes
TRIAL_RUN=1 scrapy crawl moneycontrol

To view a list of all available spiders:

scrapy list

# businessstandard
# businesstoday
# economictimes
# financialexpress
# firstpost
# freepressjournal
# indianexpress
# ipfy
# moneycontrol
# ndtvprofit
# news18
# outlookindia
# thehindu
# thehindubusinessline
# zeenews

To run all the spiders in production

# view scraping benchmark tests performed by scrapy
scrapy bench
python run.py

Run tests to check if spiders are still working.

# view the parsed the article
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11

# test all spiders
pip install -e .[test]
pytest

Sitemaps

The sitemaps for each website not always directly available in robots.txt. Googling for keywords like "ndtvprofit.com daily sitemap xml" seems to retrieve the ones that are not mentioned.

Publisher	Sitemap Type	Sitemap Link
News 18	Daily Sitemap	Link
The Hindu	Daily Sitemap	Link
The Hindu Business Line	Daily Sitemap	Link
Business Today	Daily Sitemap	Link
Money Control	Daily Sitemap	Link
Business Standard	Sitemap Index	Link
Economic Times	Monthly Sitemaps	Link
Firstpost	Daily Sitemap	Link
NDTV Profit	Daily Sitemap	Link
Free Press Journal	Daily Sitemap	Link
Outlook India	Daily Sitemap	Link
Zee News	Monthly Sitemap	Link
Financial Express	Daily Sitemap	Link
Indian Express	Daily Sitemap	Link

More Sources

The following news websites were in consideration but no daily sitemaps were found. Some effective strategies (requires more research) to iteratively retrieve a list of all articles are mentioned below.

https://www.livemint.com/api/cms/story/v2/11720327511606 - Check Content Length in Head. TODO: Check for market slug with a smaller query
https://timesofindia.indiatimes.com/articleshow/81896735.cms - Redirect not showing in head, No sitemap as well.
https://www.indiainfoline.com/news/top-share-market-news/page/14072 - New articles have no ID in the url. Seems to allow old articles to redirect
https://in.investing.com/news/a/a-4293269 - Doesn't redirect to actual url

Notes:

TODO

Generate a final s3 dump file that can be used.
Run the scraper as prefect flow
Scraping mode - Update/dump
While running the test, if it fails, prevent scrapy from showing the entire output
export PYTHONDONTWRITEBYTECODE=1
pytest failing on few spiders on remote server
moneycontrol and indianexpress have very aggressive protection. they don't seem to allow usage of even floating ips from hetzner. but ips of brightdata seem to work

Server Checklist

Attach floating IPs
Prevent pycache
Mount volume

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
.vscode		.vscode
infra		infra
news_scraper		news_scraper
tests		tests
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
prefect-test.py		prefect-test.py
pytest.ini		pytest.ini
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DesiQuant News Scraper

Data Usage

Scraper Usage

Sitemaps

Notes:

TODO

Server Checklist

About

Releases

Packages

Languages

desiquant/news_scraper

Folders and files

Latest commit

History

Repository files navigation

DesiQuant News Scraper

Data Usage

Scraper Usage

Sitemaps

Notes:

TODO

Server Checklist

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages