Welcome to our coding challenge for data engineers! In this challenge, you'll be given a dataset containing information about a fictional digital health company. Your task is to extract the necessary data from this dataset, transform it, and load it into a PostgreSQL or Microsoft SQL Server database running in a Docker container.
To complete this challenge, follow the steps below:
- Import this repo to your personal Github account (see https://github.com/new/import)
- Checkout the develop branch
- Create a virtual environment and activate it (use
pipenv
orconda
) - Write Python code to extract, transform, and load the data into a PostgreSQL or Microsoft SQL Server database running in a Docker container. Choose the database you are more comfortable with.
- The goal is to create a database schema that Data Analysts can use to analyze the data
- You can use any Python packages or Python-based workflow management tool (e.g., Airflow, Prefect) of your choice to complete this task. If you use a workflow management tool, please add it to the docker-compose file.
- Optional: Write tests for your code to ensure it's working correctly
- Analyze the number of cases per patient, the average age, and the most common ICPC codes
- Add the used packages to the
requirements.txt
orenvironment.yml
file - Commit your changes and push them to your imported repository
- Create a pull request into the main branch
The dataset you'll be working with contains information about a fictional digital health company's patients. The data is stored in a CSV file called patients.csv
, located in the data
directory.
The file has the following columns:
patient_id
: The unique ID of the patientpatient_name
: The name of the patientpatient_email
: The email of the patientpatient_phone
: The phone number of the patientpatient_address
: The address of the patientpatient_city
: The city of the patientpatient_state
: The state of the patientpatient_zip
: The zip code of the patientpatient_country
: The country of the patientpatient_date_of_birth
: The date of birth of the patientupdated_at
: The datetime the patient was last updated
The dataset you'll be working with contains information about a fictional digital health company's cases. The data is stored in a CSV file called cases.ndjson
, located in the data
directory.
The file has the following columns:
case_id
: The unique ID of the casecase_type
: The type of case, 1 for triage, 2 for non-medical, 3 for medicalpatient_id
: The unique ID of the patientcase_datetime
: The datetime the case was createdcase_closed
: Whether the case is closed or notcase_closed_datetime
: The datetime the case was closedcase_closed_reason
: The reason the case was closedicpc_codes
: The ICPC codes associated with the case, whitespace separatedupdated_at
: The datetime the case was last updated
Your code should satisfy the following requirements:
- Use the provided
docker-compose.yml
file to spin up either a PostgreSQL or a Microsoft SQL Server container - Create a database schema to store the data
- Extract the necessary data from the cases & patients file
- Transform the data to fit the schema
- Load the data into the database Optional:
- Write tests for your code
- Create a Dockerfile to build a Docker image containing your code + integrate it in the
docker-compose.yml
file
Here's an example docker-compose.yml
file that can be used to spin up either a PostgreSQL or a Microsoft SQL Server container:
version: "3.9"
services:
db:
image: postgres:13-alpine # Use "mcr.microsoft.com/mssql/server:2019-latest" for MSSQL
restart: always
environment:
POSTGRES_USER: db_user # Use "SA_PASSWORD" for MSSQL
POSTGRES_PASSWORD: db_password # Use your desired password for MSSQL
POSTGRES_DB: db_name # Use your desired database name for MSSQL
volumes:
- ./init:/docker-entrypoint-initdb.d # Use "./init:/var/opt/mssql/init" for MSSQL
ports:
- "5432:5432" # Use "1433:1433" for MSSQL
To use this docker-compose.yml
file, simply run docker-compose up -d
in the same directory as the file. This will spin up a container running either PostgreSQL or Microsoft SQL Server, depending on which image is used.
Note that for Microsoft SQL Server, you will need to create an init
directory with a SQL script to create your database and any necessary tables. The script should be named init.sql
. For PostgreSQL, this step is not necessary, as the POSTGRES_DB
environment variable will automatically create a database with the specified name.
Once the container is running, you can connect to it using a database client such as psql
or sqlcmd
, depending on which database you chose.
When you're finished, create a PR to the main branch of this repository. We'll review your code and get back to you as soon as possible.