CNN(Cable News Network) founded in June 1980 by Ted Turner of Turner Broadcasting Corporation (TBS), it provides all-weather news programs to cable networks and satellite TV users through satellite, and is headquartered in Atlanta, Georgia, USA. CNN Website: CNN
First, you need to install the corresponding anaconda environment to utilize the relevant dependent libraries by running:
conda env create -f newsspider.yml
Second, you can refer to the sections below to restrict the specific categories of content to be crawled through custom URL filtering constraints.
Sections = us, politics,world,opinion, health/business/entertainment/sport/travel/style, if all CNN just no section
URL = 'https://edition.cnn.com/search?size=10&q=' + keywords + '§ions=us,politics,world,opinion,health' + '&types=article' + '&sort=relevance'
You can set the keyword and savedir in CNN_NewsSpider_Keywords.py.
keyword =
savedir =
After setting the necessary information, you can execute the following command to grab:
python CNN_NewsSpider_Keywords.py
After the crawling is over, you can refer to Extract_TextContent.py to extract and customize post-processing like duplicate to get the desired content.
python Extract_TextContent.py
Duplicate content can be removed by the check function and recursive sequence by run:
python Fill_Sequence.py
The code of this part is heavily borrowed from people-daily-crawler-date
人民日报-人民网 (people.com.cn).
Many code references people-daily-crawler-date , thanks to the relevant authors for their open source.