Skip to content

Python Script to Scrape GKToday Website to create monthly magazine

License

Notifications You must be signed in to change notification settings

himanshudabas/GKTodayScrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GKTodayScrape - Python script to scrape GKToday.in and Telegram Bot to serve those PDFs

Python Script to Scrape GKToday Website to create monthly magazines and Quiz PDFs. Monthly magazines & Quiz are created in both .docx and .pdf format.

Following are the screensnaps of the script and telegram bot. You can see more screensnaps here.

Script in Action:

Scraping script in action screensnap

Telegram Bot serving the scraped files in PDF format

Telegram Bot screensnap

NOTE :

I have tested this script on Windows & Linux. Although I have only setup the Telegram Bot on Linux. You can follow the Guide till Step 2 on Windows. Following step 2 is the guide for setting up the Telegram bot, which is only for Linux.

1. Setup

Clone the Project in you Home folder

>> git clone ~/

1.1. Install LibreOffice in order to convert the .docx to .pdf

LibreOffice Download Link

If you are on windows, you need to set add it to the PATH.

1.2. Setup Python Virtual Environment

Create a directory named gktoday in your home directory.

>> mkdir ~/gktoday

>> cd ~/gktoday

Initialize python virtual environment and activate it.

>> python3 -m venv env

>> source ./env/bin/activate

Install the required python libraries.

(env) >> pip install bs4 requests python-docx flask python-telegram-bot

2. Run the Script

(env) >> python ~/GKTodayScrape/scrape.py

3. Setup of Telegram Bot

Following section will help you setup your own Telegram Bot to serve the converted PDF Magazines on the Bot.

3.1. Create your Telegram Bot and get it's API Key

How to Build Your First Telegram Bot: A Guide for Absolute Beginners

3.2. Get a SSL certificate

Note: Telegram only works over HTTPS if you want to use webhooks. So you need to get an SSL certificate for this to work. Follow this Guide to get a SSL certificate. (Yes, its Free)

Running Your Flask Application Over HTTPS

3.3. Run the Flask Server in the background

>> nohup ~/gktoday/env/bin/python ~/GKTodayScrape/app.py >> ~/gktoday/log/nohup_app.py.log 2>&1 &

This will also log the output of nohup in ~/gktoday/log/nohup_app.py.log

In case something Bad happens, you can check this this log file for errors

4. Add cronjob

Note : This is not necessary, but you might want to add a cronjob to your linux server to preiodically scrape files from GKToday.in.

Open crontab and add the following line and save the crontab

>> crontab -e

0 */4 * * * ~/gktoday/env/bin/python ~/GKTodayScrape/scrape.py >> ~/gktoday/log/cron_scrape.log 2>&1

About

Python Script to Scrape GKToday Website to create monthly magazine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages