This repository contains a Python script and a GitHub Actions workflow that automates the process of scraping data from multiple Apple App Store app pages. The workflow runs daily at a predefined time (09:00 UTC) or can be triggered manually, and it collects information such as:
- App ranking in the Medical category.
- App star rating.
- Total number of ratings.
The scraped data is appended to a CSV file stored in a separate branch (data-branch
) to keep the main branch clean.
- Scrapes data from multiple App Store app pages.
- Runs automatically every day at 09:00 UTC.
- Stores data in a CSV file (
apps_ranking.csv
) in thedata-branch
. - Easily extendable to scrape additional apps by editing the workflow.
- The GitHub Actions workflow is configured to run daily using a cron job (
cron: '0 9 * * *'
) or can be manually triggered from the "Actions" tab in the repository. - For each app URL defined in the workflow, the script will scrape the app's ranking, star rating, and the total number of reviews.
- The scraped data is then appended to the
apps_ranking.csv
file in thedata-branch
. - Each job runs in parallel for faster scraping of multiple apps.
If you want to test the scraper locally before running it on GitHub Actions, you can follow these steps:
- Python 3.x installed on your local machine.
- Clone the repository:
git clone https://github.com/your-username/your-repo.git cd your-repo
- Install the required dependencies: The dependencies are listed in the
requirements.txt
file. To install them, run:
pip install -r requirements.txt
- Run the scraper: You can run the scraper for a specific app URL by executing the
run_scraper.py
script and passing the app URL as a parameter:
python run_scraper.py --app_url "https://apps.apple.com/us/app/google/id284815942"
- Check the output: The scraped data will be appended to
apps_ranking.csv
, which will be created in the local directory if it doesn't already exist.
- Python 3.x
- GitHub repository
The workflow is defined in .github/workflows/scraper.yml
and will scrape data from the app URLs defined in the APP_URLS
environment variable.
To modify the list of apps being scraped, add or remove app URLs in the list:
APP_URLS: |
https://apps.apple.com/us/app/google/id284815942
There are two ways to run the scraping workflow:
- Automatic Daily Runs: The workflow will run automatically every day at 09:00 UTC based on the cron schedule.
- Manual Trigger: You can manually trigger the workflow via the "Actions" tab in the GitHub repository:
- Go to the Actions tab.
- Select the "Daily Scraper" workflow.
- Click on the "Run workflow" button to start the scraper immediately.
The Python script run_scraper.py
is designed to take an --app_urls
argument, which is passed by the GitHub Actions workflow . The script scrapes the app's ranking, rating, and total number of reviews and appends it to the CSV file.
Feel free to modify the scraping logic or add additional data points to be extracted as needed.
- The scraped data is stored in a CSV file (apps_ranking.csv) in the data-branch.
- Each row in the CSV contains the following columns:
- Timestamp: The date and time of the scraping run.
- App URL: The URL of the scraped app.
- Ranking: The app's ranking in its category.
- Star Rating: The app's star rating.
- Total Number of Ratings: The total number of user ratings.
All scraped data is committed to the data-branch to keep the main branch clean. You can access the data-branch directly or fetch the CSV file from there.