A simple example of loading data from CSV into SQLite DB, and then representing some data as charts.
git clone https://github.com/org-not-included/simple_analytics_pipeline
cd simple_analytics_pipeline
source setup.sh
A data engineer is responsible for Extracting (reading), Transforming, and Loading (saving) data between different locations in your environment. Primarily loading data into/out of your database/data warehouse.
As part of their skillset, they need to understand:
- Languages:
Python
(coding)SQL
(querying syntax)
- Tools:
Pandas
(parse/transform)SQLite
(store)Matplotlib
(visualize)
- Use python3 sqlite to initialize a local db connection:
import sqlite3
conn = sqlite3.connect(‘your_database_name’)
1/2. (Extract/Transform) Use pandas (io) to convert the csv into a python object (dictionary):
dataframe = pandas.read_csv(filename)
The reason to convert it into a dict is because you can then use pandas to load a DataFrame into the database your_database_name
.
Read a csv file and use pandas to read/convert it: Create a dataframe which is a pandas object (python dict with extra functionality) of the data that is read from csv.
csv -> Pandas (io read) -> python dict -> Pandas (io write) -> sqlite table with table name
- (Load) Save the Pandas DataFrame into your database, in a new table called
sample_table
.
dataframe.to_sql(‘sample_table’, conn)
Use Dataframe to write into your connection. Result is a table with the name of your choice stored in your_database_name
specified by the connection.
- (Optional) Visualize the Result
def create_chart(sql, conn, x, y):
# Fetch query results as pandas dataframe (dictionary object)
dataframe = pd.read_sql_query(sql, con=conn)
# pprint(dataframe)
dataframe.plot(x=x, y=y, kind="bar")
plt.show()