Backend API FULFIL.IO Assignment
is a Django Backend API that reads data from a csv file, delivers its computation
to a celery worker, and stores in PostgreSQL. Thereby, you can interact with those data : retrieve, add, delete, and update
Table of contents
To resolve this problem, we have used django
, djangorestframework
,
celery & redis
, django-signals
, socketIO
, PostgreSQL
, and pandas
.
django
: among the best python web frameworks.djangorestframework
: we are supposed to build a small REST API. Therefore, Django Rest Framework is suitable for the solution.celery & redis
: performs asynchronous tasks with the redis brokerdjango-signals
: handles webhooks configurations.socketIO
: sends socket messages to the client. It's an alternative of SSE, not correctly working with DjangoPostgreSQL
: Database used to store our data.pandas
: reads the input Csv file and deduplicate the data.
To run my Backend solution, you must have python
, pip
, redis-server
, and PostGreSQL
installed in your system and configure
the redis server and postgresql with django
To clone my code, run the command below in the CLI
git clone "https://github.com/adrienTchounkeu/backend_assignment_fulfil.git"
You can also download the project by clicking the link Backend_assignment_fulfil
After downloading the code, open the CLI in the root directory and execute the command :
pip install -r requirements.txt
NB: "requirements.txt is a file which contains all the project dependencies"
After all the project dependencies are installed, run the command
python manage.py runserver # on Windows
or
python3 manage.py runserver # on Linux
To run the Celery worker, run the command
celery -A backend_assignment worker -l info --pool=solo # to launch celery
NB: The server generally starts on the port 8000
The Backend API is available through the link https://backend-assignment-fulfil.herokuapp.com
- To deploy my application, two add-ons were needed : postgresql and redis. I, therefore, connected my visa card account to heroku because unable to add more than one add-on otherwise.
- Due to some dynos(processes on Heroku) limitations, my backend is not working properly. Some endpoints are neither returning the good response nor performing the request. Though, it is working perfectly in the local environment
NB : You will see in the commit history, many useless commits when is was tyring to figure out heroku deployment errors
- The Backend communicate with the Frontend app, written in VueJs. You can access through the link Frontend_assignment_fulfil
Before starting coding, we have to understand the problem and think of the solution. We have structured our project as follow :
- Choose a great tool to read large csv files : Pandas for instance
- Create custom signals to dispatch when there's a manual create/update action.
- After loosing a lot of time on trying to integrate SSE with Django, I finally choose SocketIO to send live streams events to the Client
- To avoid high cost performance in our app, we use a worker to handle asynchronous tasks and a redis server to work along with Celery, and channels our socket events.
- A high in performance SQL Database : PostGreSQL for instance.
To solve the problem, we did some hypothesis:
- The file is stored in other for the worker to efficiently process it.
To solve the problem, we use DataFrames
and pandas as pd
functions, workers, brokers, sockets and signals
- read large CSV files with
pd.read_csv
in chunks(100000) - drop duplicates on sku in DataFrames with
pd.drop_duplicates
- bulk_create django orm functions to store all the data at once
- celery workers to perform asynchronous tasks, along with brokers
- sockets to send data status event messages to the client
- signals to handle webhooks configurations
No tests were done to test the endpoints and functions
Even though my code is solving the problem, I have some performance and resources used issues. To optimize my solution, I think
- implement parallelization : optimization reading CSV files
- use SSE to establish a unidirectional connection with the client, for speed and security issues
- after lots of research, Flask along with SQLAlchemy best fits the solution because it functions smoothly with SSE
- Regarding deployment, we should implement the solution on a well-designed server (Linux for instance) rather than using an easy deploy service(huge limitation)
Assuming that we have files coming from more multiple sources, we will encounter the following problems:
- performance issues while reading files
- storing huge amounts of data
- requesting on huge amount of data
- computing huge amounts of data
To solve this problem, we need, to begin, create indexes on our columns in our database to optimize queries, use a server with great memory and processor, and finally use efficient tools to read and deduplicate, dask must be tested because of his apparently proven performance.