Webscraping outdoorsy events in the Washington DC metro area for an events calendar on Capital Nature.
Capital Nature is a 501c3 nonprofit organization dedicated to bringing nature into the lives of DC area residents and visitors. To that end, they maintain an events calendar listing all of the area's great nature events.
For each event source identified by Capital Nature, we use Python (3.6.x) to web-scrape events from their websites. As a part of the scrape, we transform the data to fit our schema. The scrapers output three separate spreadsheets (csv) that the Capital Nature team then uploads to their Wordpress website.
You can run the scrapers three different ways:
- Locally
- csv reports are written to
./data
,./logs
and./reports
- csv reports are written to
- Locally, with Docker
- good for local testing using an environment that mimics AWS Lambda with option to write results locally
- In AWS
- csv reports are written to S3
Regardless of how you choose to get going, install Git Large File Storage before proceeding. We use it to help manage the lamdba deployment and dependencies, which are ~ 50 MB.
- Navigate into the repository:
cd capital-nature-ingest
- Start a virtual environment:
python -m venv env
source env/bin/activate # env\Scripts\activate if you're on Windows
pip install -r requirements.txt
- Set environment variables:
Before getting the events, you'll need to have a National Park Service (NPS) API key and an Eventbrite API key.
- Get one for NPS here
- Get one for Eventbrite here. For the Eventbrite token, we've found it helpful to follow the instructions in this blog post. After signing up, in the top right dropdown, click on Account Settings > Developer Links sidebar > API Keys then click on Create API Key or go to this link
Once you've got your tokens, add them as environment variables called NPS_KEY
and EVENTBRITE_TOKEN
. Or simply run the script and input them when prompted.
- Run the scrapers:
python get_events.py
Read this about the script's output.
The Docker environment mimics that of AWS Lambda and is ideal for local testing. Since some of this project's dependencies have to be compiled specifically for AWS, you'll need to run a build sript before building the Docker image and running the container.
- From the root of the repository, build the project
build.sh
That command zipped the essential components of this project and then combined them with Amazon-Linux-2-compatible versions of numpy and pandas from layer/aws-lambda-py3.6-pandas-numpy.zip
.
- Build the image:
docker build -t scrapers ./lambda-releases
- Run the container:
docker run --rm -e EVENTBRITE_TOKEN=$EVENTBRITE_TOKEN -e NPS_KEY=$NPS_KEY scrapers
If you want to write the results locally, you can mount any combination of /data
, /logs
, and /results
directories to your local filesystem:
docker run --rm -e EVENTBRITE_TOKEN=$EVENTBRITE_TOKEN -e NPS_KEY=$NPS_KEY -v `pwd`:/var/task/data -v `pwd`:/var/task/logs -v `pwd`:/var/task/reports scrapers
The above will write the three data spreadsheets, all of the log files for broken scrapers, and a results spreadsheet to your current working directory.
Follow the instructions here to install and configure the AWS Cloud Developer Kit (CDK). We're currently using version 1.18.0
.
Note that you'll need to install node.js as a part of this step if you don't already have it.
If you're new to AWS, you can give your capital nature user account the following AWS-managed IAM policies:
- AWSLambdaFullAccess
- AmazonS3FullAccess
- CloudWatchFullAccess
- AWSCloudFormationFullAccess
Finally, in order to use the CDK, you must specify your account's credentials and AWS Region. There are multiple ways to do this, but the following examples use the --profile
option with cdk
commands. This means our credentials and region are specified in ~/.aws/config
.
If you haven't run this command before (i.e. from the Docker setup), then you need to zip up the relevant components of this project for deployment to AWS Lambda.
build.sh
If you haven't done so already, activate a virtual environment and install the dependencies:
python -m venv .env
source .env/bin/activate
pip install -r requirements.txt
First, run the following command to verify you CDK version:
cdk --version
This should show 1.18.0
.
Now we can deploy/redeploy the app:
cdk deploy --profile <your profile name>
After that command has finished, the resources specified in app.py
have been deployed to the AWS account you configured with the CDK. You can now log into your AWS Console and check out all the stuff.
You can optionally see the Cloudformation template generated by the CDK. To do so, run cdk synth
, then check the output file in the cdk.out/
directory.
You can destroy the AWS resources created by this app with cdk destroy --profile <your profile name>
. Note that we've given the S3 Bucket a removalPolicy
of cdk.RemovalPolicy.DESTROY
so that it isn't orphaned at the end of this process (you can read more about that here).
After running the scrapers locally, you'll have three csv files in a new data/
directory (unless you used the Docker approach, in which they'll be wherever you chose to mount the volume):
cap-nature-events-<date>.csv
(all of the events)cap-nature-organizers-<date>.csv
(a list of the event sources, which builds off the previous list each successive time you run this)cap-nature-venues-<date>.csv
(a list of the event venues, which builds off the previous list each successive time you run this)
These files are used by the Capital Nature team to update their website.
Two other directories are also created in the process:
/reports
/logs
The /reports
directory holds spreadsheets that summarize the results of get_events.py
. There's a row for each event source and columns with data on the number of events scraped, the number of errors encountered, and the event source's status (e.g. "operational") given the presence of errors and/or data. A single report is generated each time you run get_events.py
and includes the date in the filename to let you connect it to the data files in the /data
directory. Because of this, if you run get_events.py
more than once in one day, the previous report is overwritten.
The /logs
directory naturally contains the logs for each run of get_events.py
. These files include tracebacks and are useful for developers who want to debug errors. A log file gets genereated for each event source that raises errors. The date is included in the filename, but running get_events.py
more than once in one day will overwrite the day's previous file.
To track bugs, request new features, or just submit interesting ideas, we use GitHub issues.
If you'd like to lend a hand, hop on over to our Issues to see what event sources still need scraping. If you see one that you'd like to tackle, assign yourself to that issue and/or leave a comment saying so. NOTE: You need to join the DataKindDC GitHub organization in order to assign yourself. If you don't want to join the organization, then just leave a comment and one of us will assign you. Doing this will let others know that you're working on that event source and that they shouldn't duplicate your efforts.
Once you've found something you want to work on, please read our contributing guideline for details on how to contribute using git and GitHub.