Skip to content

Django Backend API that reads data from a csv file, delivers its computation to a celery worker, and stores in PostgreSQL. Thereby, you can interact with those data : retrieve, add, delete, and update

Notifications You must be signed in to change notification settings

adrienTchounkeu/backend_assignment_fulfil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Backend API FULFIL.IO Assignment

Python-Versions pip-Version django-Version drf-Version postgre-Version celery-Version redis-Version Pandas-Version Requests-Version socket-Version

Backend API FULFIL.IO Assignment is a Django Backend API that reads data from a csv file, delivers its computation to a celery worker, and stores in PostgreSQL. Thereby, you can interact with those data : retrieve, add, delete, and update


To resolve this problem, we have used django, djangorestframework, celery & redis , django-signals , socketIO , PostgreSQL , and pandas.

  • django: among the best python web frameworks.
  • djangorestframework: we are supposed to build a small REST API. Therefore, Django Rest Framework is suitable for the solution.
  • celery & redis: performs asynchronous tasks with the redis broker
  • django-signals: handles webhooks configurations.
  • socketIO: sends socket messages to the client. It's an alternative of SSE, not correctly working with Django
  • PostgreSQL: Database used to store our data.
  • pandas: reads the input Csv file and deduplicate the data.

To run my Backend solution, you must have python, pip, redis-server, and PostGreSQL installed in your system and configure the redis server and postgresql with django

To clone my code, run the command below in the CLI

git clone "https://github.com/adrienTchounkeu/backend_assignment_fulfil.git"

You can also download the project by clicking the link Backend_assignment_fulfil

After downloading the code, open the CLI in the root directory and execute the command :

pip install -r requirements.txt

NB: "requirements.txt is a file which contains all the project dependencies"

After all the project dependencies are installed, run the command

python manage.py runserver # on Windows

or

python3 manage.py runserver # on Linux

To run the Celery worker, run the command

celery -A backend_assignment worker -l info --pool=solo # to launch celery

NB: The server generally starts on the port 8000

The Backend API is available through the link https://backend-assignment-fulfil.herokuapp.com

  • To deploy my application, two add-ons were needed : postgresql and redis. I, therefore, connected my visa card account to heroku because unable to add more than one add-on otherwise.
  • Due to some dynos(processes on Heroku) limitations, my backend is not working properly. Some endpoints are neither returning the good response nor performing the request. Though, it is working perfectly in the local environment

NB : You will see in the commit history, many useless commits when is was tyring to figure out heroku deployment errors

Before starting coding, we have to understand the problem and think of the solution. We have structured our project as follow :

  • Choose a great tool to read large csv files : Pandas for instance
  • Create custom signals to dispatch when there's a manual create/update action.
  • After loosing a lot of time on trying to integrate SSE with Django, I finally choose SocketIO to send live streams events to the Client
  • To avoid high cost performance in our app, we use a worker to handle asynchronous tasks and a redis server to work along with Celery, and channels our socket events.
  • A high in performance SQL Database : PostGreSQL for instance.

To solve the problem, we did some hypothesis:

  • The file is stored in other for the worker to efficiently process it.

To solve the problem, we use DataFrames and pandas as pd functions, workers, brokers, sockets and signals

  • read large CSV files with pd.read_csv in chunks(100000)
  • drop duplicates on sku in DataFrames with pd.drop_duplicates
  • bulk_create django orm functions to store all the data at once
  • celery workers to perform asynchronous tasks, along with brokers
  • sockets to send data status event messages to the client
  • signals to handle webhooks configurations

No tests were done to test the endpoints and functions

Even though my code is solving the problem, I have some performance and resources used issues. To optimize my solution, I think

  • implement parallelization : optimization reading CSV files
  • use SSE to establish a unidirectional connection with the client, for speed and security issues
  • after lots of research, Flask along with SQLAlchemy best fits the solution because it functions smoothly with SSE
  • Regarding deployment, we should implement the solution on a well-designed server (Linux for instance) rather than using an easy deploy service(huge limitation)

Assuming that we have files coming from more multiple sources, we will encounter the following problems:

  • performance issues while reading files
  • storing huge amounts of data
  • requesting on huge amount of data
  • computing huge amounts of data

To solve this problem, we need, to begin, create indexes on our columns in our database to optimize queries, use a server with great memory and processor, and finally use efficient tools to read and deduplicate, dask must be tested because of his apparently proven performance.

About

Django Backend API that reads data from a csv file, delivers its computation to a celery worker, and stores in PostgreSQL. Thereby, you can interact with those data : retrieve, add, delete, and update

Topics

Resources

Stars

Watchers

Forks

Languages