A scrapy crawler that scrapes market news from Indian financial news outlets
⚠️ WARNING: Work in progress. Will implement breaking frequently. The dataset is not updated everyday. The documentation and code requires more documentation and clarity.
The data is periodically updated (every 1 hour) and saved to s3://desiquant/data/news.parquet
. This file can be easily accessed and read as a pandas dataframe as follows:
import s3fs
import pandas as pd
df = pd.read_parquet("s3://desiquant/data/news.parquet", storage_options={
"key": "sceN1eFOJQmBIWHNEMd8",
"secret": "w1BERx7F6LTe87sk9K9deoBcfYXNCwlol5xcLeev",
"endpoint_url": "http://data.desiquant.com:9000",
})
df
url | title | article_text | author | date_modified | date_published | description | scrapy_parsed_at | scrapy_scraped_at | |
---|---|---|---|---|---|---|---|---|---|
0 | https://www.moneycontrol.com/news/business/mar... | Stay stock-specific and maintain strict stop-l... | It was a historic week (ended July 19) for dom... | Jigar Patel | 2024-07-21T19:55:46+05:30 | 2024-07-21T19:55:46+05:30 | A breach of 24,500 could halt the current mome... | 2024-07-21 17:09:08.064765 | 2024-07-21 17:09:07+00:00 |
1 | https://www.moneycontrol.com/news/business/mar... | Wall St ends volatile session lower in afterma... | US stocks extended their slump on Friday as li... | Reuters | 2024-07-20T10:17:59+05:30 | 2024-07-20T10:17:59+05:30 | The Dow Jones Industrial Average fell 377.49 p... | 2024-07-21 17:09:08.474808 | 2024-07-21 17:09:08+00:00 |
2 | https://www.moneycontrol.com/news/business/mar... | Trade Spotlight: How should you trade Federal ... | The benchmark indices saw profit booking after... | Sunil Shankar Matkar | 2024-07-11T01:49:05+05:30 | 2024-07-11T01:46:54+05:30 | If the Nifty 50 breaks 24,200, the immediate s... | 2024-07-21 17:09:09.444510 | 2024-07-21 17:09:08+00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Run a spider. The outputs are saved to outputs/moneycontrol.jl
in JSONlines format
# scrape all market articles from "2010-01-01" till today with
scrapy crawl moneycontrol
# trial run: stops after scraping 10 items. useful for testing purposes
TRIAL_RUN=1 scrapy crawl moneycontrol
To view a list of all available spiders:
scrapy list
# businessstandard
# businesstoday
# economictimes
# financialexpress
# firstpost
# freepressjournal
# indianexpress
# ipfy
# moneycontrol
# ndtvprofit
# news18
# outlookindia
# thehindu
# thehindubusinessline
# zeenews
To run all the spiders in production
# view scraping benchmark tests performed by scrapy
scrapy bench
python run.py
Run tests to check if spiders are still working.
# view the parsed the article
scrapy parse https://www.businesstoday.in/markets/stocks/story/upward-revision-in-eps-estimates-what-analysts-say-on-tcs-q1-results-stock-trading-strategy-436794-2024-07-11
# test all spiders
pip install -e .[test]
pytest
The sitemaps for each website not always directly available in robots.txt
. Googling for keywords like "ndtvprofit.com daily sitemap xml"
seems to retrieve the ones that are not mentioned.
Publisher | Sitemap Type | Sitemap Link |
---|---|---|
News 18 | Daily Sitemap | Link |
The Hindu | Daily Sitemap | Link |
The Hindu Business Line | Daily Sitemap | Link |
Business Today | Daily Sitemap | Link |
Money Control | Daily Sitemap | Link |
Business Standard | Sitemap Index | Link |
Economic Times | Monthly Sitemaps | Link |
Firstpost | Daily Sitemap | Link |
NDTV Profit | Daily Sitemap | Link |
Free Press Journal | Daily Sitemap | Link |
Outlook India | Daily Sitemap | Link |
Zee News | Monthly Sitemap | Link |
Financial Express | Daily Sitemap | Link |
Indian Express | Daily Sitemap | Link |
More Sources
The following news websites were in consideration but no daily sitemaps were found. Some effective strategies (requires more research) to iteratively retrieve a list of all articles are mentioned below.
- https://www.livemint.com/api/cms/story/v2/11720327511606 - Check Content Length in Head. TODO: Check for market slug with a smaller query
- https://timesofindia.indiatimes.com/articleshow/81896735.cms - Redirect not showing in head, No sitemap as well.
- https://www.indiainfoline.com/news/top-share-market-news/page/14072 - New articles have no ID in the url. Seems to allow old articles to redirect
- https://in.investing.com/news/a/a-4293269 - Doesn't redirect to actual url
- Generate a final s3 dump file that can be used.
- Run the scraper as prefect flow
- Scraping mode - Update/dump
- While running the test, if it fails, prevent scrapy from showing the entire output
- export PYTHONDONTWRITEBYTECODE=1
- pytest failing on few spiders on remote server
- moneycontrol and indianexpress have very aggressive protection. they don't seem to allow usage of even floating ips from hetzner. but ips of brightdata seem to work
- Attach floating IPs
- Prevent pycache
- Mount volume