Skip to content

A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.

License

Notifications You must be signed in to change notification settings

flavienbwk/Pandemic-Knowledge

Repository files navigation

Pandemic-Knowledge

Pandemic Knowledge logo

Code style: black

A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.

  • Contamination figures
  • Vaccination figures
  • Death figures
  • COVID-19-related news (Google News, Twitter)

What you can achieve

Live contaminations map + Latest news Last 7 days news
Live contamination and vaccination world map Last news, live !
France 3-weeks live map (Kibana Canvas) Live vaccinations map
France Live Status World vaccination map

Context

This project was realized over 4 days as part of a MSc hackathon from ETNA, a french computer science school.

The incentives were both to experiment/prototype a big data pipeline and contribute to an open source project.

Install

Below, you'll find the procedure to process COVID-related file and news into the Pandemic Knowledge database (elasticsearch).

The process is scheduled to run every 24 hours so you can update the files and obtain the latest news

Env file

Running this project on your local computer ? Just copy the .env.example file :

cp .env.example .env

Open this .env file and edit password-related variables.

Initialize elasticsearch

Raise your host's ulimits for ElasticSearch to handle high I/O :

sudo sysctl -w vm.max_map_count=500000

Then :

docker-compose -f create-certs.yml run --rm create_certs
docker-compose up -d es01 es02 es03 kibana

Initialize Prefect

Create a ~/.prefect/config.toml file with the following content :

# debug mode
debug = true

# base configuration directory (typically you won't change this!)
home_dir = "~/.prefect"

backend = "server"

[server]
host = "http://172.17.0.1"
port = "4200"
host_port = "4200"
endpoint = "${server.host}:${server.port}"

Run Prefect :

docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui

We need to create a tenant. Execute on your host :

pip3 install prefect
prefect backend server
prefect server create-tenant --name default --slug default

Access the web UI at localhost:8081

Run Prefect workers

Agents are services that run your scheduled flows.

  1. Open and optionally edit the agent/config.toml file.

  2. Let's instanciate 3 workers :

docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent

ℹ️ You can run the agent on another machine than the one with the Prefect server. Edit the agent/config.toml file for that.

COVID-19 data

Injection scripts should are scheduled in Prefect so they automatically inject data with the latest news (delete + inject).

There are several data source supported by Pandemic Knowledge

  1. Start MinIO and import your files according to the buckets evoked upper.

    For Our World In Data, create the contamination-owid bucket and import the CSV file inside.

    docker-compose up -d minio

    MinIO is available at localhost:9000

  2. Download dependencies and start the injection service of your choice. For instance :

    pip3 install -r ./flow/requirements.txt
    docker-compose -f insert.docker-compose.yml up --build insert_owid
  3. In Kibana, create an index pattern contamination_owid_*

  4. Once injected, we recommend to adjust the number of replicas in the DevTool :

    PUT /contamination_owid_*/_settings
    {
        "index" : {
            "number_of_replicas" : "2"
        }
    }
  5. Start making your dashboards in Kibana !

News data

There are two sources for news :

  • Google News (elasticsearch index: news_googlenews)
  • Twitter (elasticsearch index: news_tweets)
  1. Run the Google News crawler :
docker-compose -f crawl.docker-compose.yml up --build crawl_google_news # and/or crawl_tweets
  1. In Kibana, create a news_* index pattern

  2. Edit the index pattern fields :

Name Type Format
img string Url
link string with Type: Image with empty URL template Url
  1. Create your visualisation

News web app

Browse through the news with our web application.

News web app

  1. Make sure you've accepted the self-signed certificate of Elasticsearch at https://localhost:9200

  2. Start-up the app

    docker-compose -f news_app/docker-compose.yml up --build -d
  3. Discover the app at localhost:8080


TODOs

Possible improvements :

  • Using Dask for parallelizing process of CSV lines by batch of 1000
  • Removing indices only when source process is successful (adding new index, then remove old index)
  • Removing indices only when crawling is successful (adding new index, then remove old index)
Useful commands

To stop everything :

docker-compose down
docker-compose -f agent/docker-compose.yml down
docker-compose -f insert.docker-compose.yml down
docker-compose -f crawl.docker-compose.yml down

To start each service, step by step :

docker-compose up -d es01 es02 es03 kibana
docker-compose up -d minio
docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui
docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent

About

A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published