A Playwright based web scraper to scrape internships from Internshala, written in Python. Data is stored in a CSV file.
This is for educational purposes only. I am not responsible for any misuse of this code.
- Store data in a CSV file
- Store data in a Google Sheet (optional)
- Keeps the Google Sheet data synced using GitHub Actions (optional)
Make sure you have the following dependencies installed:
- Playwright library for Python
- Python 3.x
you can install them using the following commands too:
pip install playwright && playwright install chromium
- Clone the repository
- Install the dependencies using
pip install -r requirements.txt
- Run the script using
python main.py
Optional steps (for Google Sheets mode only):
- Create a new Google Sheet
- Create a new project in Google Cloud Platform
- Follow this guide for setting up the Google Sheets API
- Download the JSON file and add all the credentials to the
.env
file (refer to.env.example
) - Get the Google Sheet ID from the URL e.g
https://docs.google.com/spreadsheets/d/GOOGLE_SHEET_ID/edit
- Add to
GOOGLE_SHEET_ID
to the.env
file
Optional steps (for syncing Google Sheets using GitHub Actions):
- GitHub Actions are already setup in the repository
- Download GitHub CLI or add secrets manually to the repository from
.env
file - With GitHub CLI run
gh secret set -R <your-username/your-repo> -f .env
--headful
: Run the script in non-headless mode (show the browser)python main.py --headful
--gs
: Run the script in Google Sheets mode (store data in Google Sheets)python main.py --gs