CNN_NewsSpider_ByKeywords

CNN(Cable News Network) founded in June 1980 by Ted Turner of Turner Broadcasting Corporation (TBS), it provides all-weather news programs to cable networks and satellite TV users through satellite, and is headquartered in Atlanta, Georgia, USA. CNN Website: CNN

Crawl English News Demo(CNN as an example)

Installation

First, you need to install the corresponding anaconda environment to utilize the relevant dependent libraries by running:
conda env create -f newsspider.yml

Set Types, sections, sort and other info

Second, you can refer to the sections below to restrict the specific categories of content to be crawled through custom URL filtering constraints.
Sections = us, politics,world,opinion, health/business/entertainment/sport/travel/style, if all CNN just no section
URL = 'https://edition.cnn.com/search?size=10&q=' + keywords + '&sections=us,politics,world,opinion,health' + '&types=article' + '&sort=relevance'

Set keyword and savedir

You can set the keyword and savedir in CNN_NewsSpider_Keywords.py.
keyword =
savedir =

Crawl

After setting the necessary information, you can execute the following command to grab:
python CNN_NewsSpider_Keywords.py

Post-processing

After the crawling is over, you can refer to Extract_TextContent.py to extract and customize post-processing like duplicate to get the desired content.
python Extract_TextContent.py

Duplicate content can be removed by the check function and recursive sequence by run:
python Fill_Sequence.py

Crawl Chinese News Demo(Peaple-Daily as an example)

The code of this part is heavily borrowed from people-daily-crawler-date
人民日报-人民网 (people.com.cn).

References

Many code references people-daily-crawler-date , thanks to the relevant authors for their open source.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CNN_NewsSpider_Keywords.py		CNN_NewsSpider_Keywords.py
CNN_icon.jpg		CNN_icon.jpg
Extract_TextContent.py		Extract_TextContent.py
Fill_Sequence.py		Fill_Sequence.py
PeopleDaily_Date.py		PeopleDaily_Date.py
README.md		README.md
newsspider.yml		newsspider.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNN_NewsSpider_ByKeywords

Crawl English News Demo(CNN as an example)

Installation

Set Types, sections, sort and other info

Set keyword and savedir

Crawl

Post-processing

Crawl Chinese News Demo(Peaple-Daily as an example)

References

About

Releases

Packages

Languages

houyongkuo/NewsSpider_CNN_ByKeywords

Folders and files

Latest commit

History

Repository files navigation

CNN_NewsSpider_ByKeywords

Crawl English News Demo(CNN as an example)

Installation

Set Types, sections, sort and other info

Set keyword and savedir

Crawl

Post-processing

Crawl Chinese News Demo(Peaple-Daily as an example)

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages