Skip to content

houyongkuo/NewsSpider_CNN_ByKeywords

Repository files navigation

CNN_NewsSpider_ByKeywords

CNN CNN(Cable News Network) founded in June 1980 by Ted Turner of Turner Broadcasting Corporation (TBS), it provides all-weather news programs to cable networks and satellite TV users through satellite, and is headquartered in Atlanta, Georgia, USA. CNN Website: CNN

Crawl English News Demo(CNN as an example)

Installation

First, you need to install the corresponding anaconda environment to utilize the relevant dependent libraries by running:
conda env create -f newsspider.yml

Set Types, sections, sort and other info

Second, you can refer to the sections below to restrict the specific categories of content to be crawled through custom URL filtering constraints.
Sections = us, politics,world,opinion, health/business/entertainment/sport/travel/style, if all CNN just no section
URL = 'https://edition.cnn.com/search?size=10&q=' + keywords + '&sections=us,politics,world,opinion,health' + '&types=article' + '&sort=relevance'

Set keyword and savedir

You can set the keyword and savedir in CNN_NewsSpider_Keywords.py.
keyword =
savedir =

Crawl

After setting the necessary information, you can execute the following command to grab:
python CNN_NewsSpider_Keywords.py

Post-processing

After the crawling is over, you can refer to Extract_TextContent.py to extract and customize post-processing like duplicate to get the desired content.
python Extract_TextContent.py

Duplicate content can be removed by the check function and recursive sequence by run:
python Fill_Sequence.py

Crawl Chinese News Demo(Peaple-Daily as an example)

The code of this part is heavily borrowed from people-daily-crawler-date
人民日报-人民网 (people.com.cn).

References

Many code references people-daily-crawler-date , thanks to the relevant authors for their open source.

About

Script for crawling CNN news by keyword

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages