This Python application is an implementation of the requirements from ETL off a SQS Queue
-
Download Docker
-
In a terminal window, get Docker image for Postgres by running the command:
docker pull fetchdocker/data-takehome-postgres
- Get Docker image for LocalStack by running the command:
docker pull fetchdocker/data-takehome-localstack
- Clone the repository.
git clone https://github.com/dschott68/fetch.git
cd fetch
-
Start Docker
-
Build the image
docker build -t fetch-app .
After requirement and building the application.
-
Start Docker
-
In a terminal window, go to the directory where the repo was cloned. Start LocalStack and Postgres by running the command:
docker-compose up
- In a new terminal window, go to the directory where the repo was cloned. Start the application by running the command:
docker run -p 4000:80 fetch-app
FetchApp uses boto3 API and resource instance for an object-oriented interface. The implementation following the AWS Sample Tutorial. The message body is converted from JSON to a Python dictionary and then yielded for processing. Finally, the message is deleted from the queue.
The message JSON is converted to a Python dictionary. By using a generator function yielding a individual dictionary for each message, no extra collection data structure is need to store the messsages. This is more memory-efficient for large datasets.
FetchApp use SHA-256 hash to mask the values. Duplicate values that are hashed will result in the same hash value, thus they can be identified by looking for duplicate hash values. SHA-256 hash is widely used and secure. If higher level of security is a requirement, SHA-512 or even AES 256-bit could be used.
FetchApp uses the popular psycopg2
module for connecting and writing to
Postgres. The connection string is currently supplied by in the config.ini
file. For production, the user and password information should not be located
in the config.ini
file. This information should be obtained using Security
team approved mechanism for access securing stored user and password data.
For production, both a connection pool and retry logic should be implemented.
Batch inserts would likely be more performant than individual inserts
statements.
For the assignment, building a Docker image and running in Docker on the
localhost makes the most sense as LocalStack and Postgres Docker images were
supplied. The Dockerfile
builds the image then docker run
to run the
application.
Tutorial: Using a Lambda function to access an Amazon RDS database provides an excellect guide to a recommendation on where and how to run this application. Since the SQS Queue is already in AWS, it is wise to also run the application in AWS. AWS Lamba is a wise as it is serverless (saving team from managing infrastructure) and autoscaling. If scale get too large, then Amazon EKS is a possible path. For how will the application run, Building Lambda functions with Python provides a guide. For the deployment, the companies standard CI/CD should be used. AWS CodeDeploy is one tool to handling deployment.
- Security for keys, user name, password is a must.
- A database conneciton pool.
- Asynchronous I/O for SQS queue read and database write.
- Threading.
- Mocks and unit tests.
- Integration tests.
- Stats for observability.
- Logging improvement. Remove PII data from logs.
- Exponential backoff retry logic for SQS Queue read, database connection and database writes.
boto3
uses HTTP protocol to communicate with AWS, check that all calls have a timeout set to avoid infinite waits.- Batch database inserts rather than individual insert statements.
- Tranformation and validation rules for data prior to writing into database.
- Write failed messages out to a separate storage location for easy access for investigations.
- Review and implement best practices for
boto3
, SQS,psycopg2
, Postgres, AWS. - Security review of hashing and PII data.
- DBA review.
- Load testing.
Threading and asynchronous I/O is important. Since both the SQS queue wait/read
and the database writes are I/O operations, this is a first area to look at
scale. Several Python modules can be looked at: asyncio
, aioboto3
, aiodbc
.
Use of a database connection pool is standard and should be implemented even
before need for scale.
The current implementation of a SHA-256 hash is not directly recoverable. If staying with hash, then orginal data could be stored in a access-controlled storage. An alternative to hash is encryption. The secret key can be used to recover the PII data via decryption by authorized personnel.
The application shuts down after the queue is empty. Production should wait for future messages.
The table INSERT will fail if the data already exists in Postgres. Requirements should define how to handle as an update, skip or failure.
The region field is sometimes empty. An appropriate default should be specified.
Length of the individual message fields is not check to be sure not longer than the database column size. Length should be checked and should define what to do with data that is too long: truncate, fail, etc.
app_version message field is dot-separated string. The table app_version is an integer. No specification was given for how to handle this. With current imnplementation 2.3.4 becomes 20304. Should decide if table column should stay as integer, change to string or define separate major, minor, patch fields. If changed to string, it is useful to make sure the string can be sorted in lexicographical order to add in comparing older vs newer app versions.