bmat-report-isrcs

bmat-report-isrc is a Data Analysis REST API that consists of reading multiple CSV files which have to be analyzed, cleaned, and computed to bring out the Top 10k ISRCs.

Table of contents

Technologies used and Why ?
Installation
- Download the project from GitHub
- Install Dependencies
Analyzing Data
Solving bmat-report-isrcs
- Assumptions
- Solution
- Use Case
- Tests
Further perspectives
- Limitations & Optimizations
- Real-life Adaptation

Technologies used and Why ?

To resolve this problem, we have used python, flask, pandas , and numpy.

python: among the best programming language used for data analysis purposes.
flask: we are supposed to build a small REST API. Even though Django is a great python web framework, flask has been built for rapid development, provides API support, and has a lightweight codebase. Thereby, it best fits with the solution.
pandas: input CSV files contain millions of lines. Pandas is among the best modules used for data analysis along with DataFrames.
numpy: works along with pandas.

Installation

To run my solution, you must have python and pip installed in your system.

Download the project from GitHub

To clone my code, you run the command below in the CLI

git clone "https://github.com/adrienTchounkeu/bmat-report-isrcs.git"

You can also download the project by clicking the link bmat-report-isrcs

Install Dependencies

After downloading the code, Open the CLI in the root directory and execute the command :

pip install -r requirements.txt

NB: "requirements.txt is a file which contains all the project dependencies"

All the project dependencies installed, run the command

python app.py # on Windows

or

python3 app.py # on Linux

You can also run using the flask command

flask run

NB: The first method is preferred

Analyzing Data

While diving into the solution of this problem, we must, first of all, download the CSV files; Then, observe and come out with all relevant information, and finally, use that information to solve the problem. This step is one of the most important steps. We can also call it the Understanding Data step. ALL the information gathered from the files :

files contain several millions of lines
With data types, isrc files contain 01 column {isrc : string} whereas report files contain 05 columns [{date:string},{isrc:string},{title:string},{artists:string},{streams:long}]
Beware of duplicate isrc values in report files.

Solving `bmat-report-isrcs`

Assumptions

To solve the problem, we did some hypothesis:

data are well-formatted in the files
user can only enter days in [10, 11, 12, 13, 14]
number of plays of each isrc is the sum of all the streams of tracks with the same isrc (in case, it appears in multiple lines of report files)
the first endpoint /report/<date> ingests the Top10k ISRCs in a file top10k_2020-11-{date}.csv contained in the folder named ingests
the second point /tracks lists all the tracks of ingested data. To filter by date and/or isrc, just call the endpoint with arguments /tracks?date={target_date}&isrc={target_isrc}

Solution

To solve the problem, we use DataFrames and pandas as pd functions

read in large CSV files with pd.read_csv in chunks(1000000)
merge DataFrames with pd.merge
groupBy DataFrame with DataFrame.groupby returns DataFrameGroupBy object
sum DataFrameGroupBy object with DataFrameGroupBy.sum
concat dataFrames with pd.concat
sort values with DataFrame.sort_values
write in CSV file with DataFrame.to_csv

Use Case

Below are the results when calling my endpoints: /report/<date> and /tracks

After running the server with the command python app.py, the server will be available under the port 5000. Thereby, 127.0.0.1:5000
To test the first endpoint, you send a get request to the server 127.0.0.1:5000/report/10. You can track the process on the server command line. In the end, a file entitled ingests/top10k_2020-11-10.csv will be created under the folder ingests and the server will return all the information
To test the second endpoint, you send a get request to the server 127.0.0.1:5000/tracks. The server will return all the information

NB: You must have a high quality internet connexion to speed up the download files step

Tests

No unit tests have been done to test the endpoints

Further perspectives

Limitations & Optimizations

Even though my code is solving the problem, I have some performance and resources used issues. To optimize my solution, I think

implement parallelization : optimization of read CSV files
resizing DataFrames before huge memory cost computations
manually freeing up the memory of unused ongoing DataFrames

Real-life Adaptation

Assuming that we have files coming from more than one country, streams count of more than one DSP, we will have major problems:

storing ingested data
searching on huge amounts of data
computing on huge amounts of data

To solve this problem, we need to use a near real-time search engine tool : ElasticSearch , for instance. I would then store ingested data in ElasticSearch, query and retrieve relevant information.

After installing ElasticSearch on my computer, my API will easily communicate with ElasticSearch through an ElasticSearch Client written in Python. My REST API will just perform storing, querying and retrieving functions.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ingests		ingests
.gitignore		.gitignore
README.rst		README.rst
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bmat-report-isrcs

Technologies used and Why ?

Installation

Download the project from GitHub

Install Dependencies

Analyzing Data

Solving `bmat-report-isrcs`

Assumptions

Solution

Use Case

Tests

Further perspectives

Limitations & Optimizations

Real-life Adaptation

About

Languages

adrienTchounkeu/bmat-report-isrcs

Folders and files

Latest commit

History

Repository files navigation

bmat-report-isrcs

About

Topics

Resources

Stars

Watchers

Forks

Languages