Skip to content
Olesya Altunina edited this page Nov 30, 2022 · 10 revisions

This repo contains packages used to run the AWS data pipeline (see README) for the Crowdbreaks project.

Repo structure

Below the repo structure is explained:

  • .github/workflows contains GitHub Actions workflows for deploying to ECS/Lambda/Sagemaker.
  • awstools package contains most of the functions that use AWS SDK for Python + 'global' configs. These helpers are then used throughout the rest of the repo, including Lambda functions, streamer package and Sagemaker tools.
  • AWS Lambda functions
    • Streamer
      • lambda-es-rotation is used for rotating Elasticsearch indices. It is triggered by an AWS EventBridge cron event crowdbreaks-es-monthly-rotation.
      • lambda-s3-to-es is used to preprocess raw data to fit Elasticsearch schema, including retrieving geo info & predictions using existing Sagemaker endpoints.
      • lambda-streamer-management is used to manage the streamer status on the Crowdbreaks website.
    • Auto MTurking on the Crowdbreaks website
      • lambda-sample-for-annotations is used for creating random samples of the recent data for annotation.
      • lambda-subsample-annotations is used for creating a small evaluation subsample on the annotation results to evaluate annotation results.
  • streamer package is used for streaming the data from Twitter API v1.1 (+ v2 is an option) filtering endpoint and sending them to AWS Kinesis Firehose.
  • Dockerfile is used to build streamer for ECS. If moved to another folder, please change .github/workflows/aws-*.yml files to build from there.

Lambda triggers

Currently, Lambda triggers are not set automatically (except the S3 triggers when a config is updated). If the functions are recreated from scratch (for example, by deleting all functions and running a 'push-create-lambda' workflow), make sure to set the corresponsing triggers in AWS console.

How the secrets are stored

There are 3 sources of secrets: AWS, Elasticseach and Twitter.

The secrets are stored in four different places:

Elasticsearch

Elasticsearch is served through https://www.elastic.co, the credentials are stored in 1password. Make sure that the clusters (esp. crowdbreaks-stg, since it has very little memory, are not overflowing). Delete the older indices if the used storage is getting close to the limit.

Billing

Elasticsearch usage and billing are on AWS marketplace account or on https://www.elastic.co, it's not shown in AWS Costs Explorer.

How to launch streamer for Twitter API v2

In case Twitter v1 gets deprecated, here is how to launch streamer for Twitter API v2 (just for storage, not for anything else yet).

  1. Open the Dockerfile in the root folder of the repo.
  2. Change CMD run-stream -> CMD run-stream-v2 and save.
  3. Run Actions -> Deploy to Amazon ECS (Production) (aws-prd.yml) -> Run workflow -> Branch: main.
  4. Restart streamer using the website or AWS ECS.
    • You can check the logs of the ECS task to make sure that the correct version is running: the first log should contain the version.

To make sure that the streams are running, either check CrowdbreaksStreaming dashboard on AWS CloudWatch, or check that S3 is up to date for active streams.

Also make sure that the right app is connected to the Crowdbreaks project on the Twitter Developer Portal. The bearer token will not work if the app is not linked to the project.