If there are more tutorials I am eager to put them all as notebooks in root rather than subfolders (each with its own env). Reason being is that each of these notebooks should work with a standard set of requirements. If they don't, that should force us to update some of our packages (if there is a good reason why it should be different then we move to subfolders).
If setting up on the analytical platform it is worth going through how to setup an environment for the jupyterlab deployment.
In terminal (make sure you are in the folder that you cloned this repo into):
python -m venv env
source env/bin/activate
python -m pip install pip==20.3.3
python -m pip install -r platform-requirements.txt
python -m ipykernel install --user --name="mojap-aws-tools-demo" --display-name="mojap-aws-tools-demo"
Select one of the notebooks and then change the kernel to moj-aws-tools-demo
. If it doesn't show straight away give Jupyter a few minutes to discover the new kernel. It should appear.
Otherwise...
python -m venv env
source env/bin/activate
pip install --upgrade pip
Finally do one of the following:
pip install -r requirements.txt
The former is what I ran to setup my env at the time of running this and the latter is my frozen env just incase something changes or you want to recreate my setup exactly.
simple_database_creation_manipulation.ipynb
Get some data, create a database add that data as a table in a database. Get some more data add it to the existing table.
Great place to start if you don't know what Athena, Glue, S3 are and how they fit together with our tools. Note what we run through here is the easiest way to put stuff together (it is not as airtight as using some of the other tools we use). However, it will get you most of the way there, so don't worry about it.
creating_and_maintaining_database_tables_in_athena.ipynb
How to create new databases, add tables to a database using SQL queries and adding more data to these tables. (All using Athena SQL.)
Using these set of tools to maintain a database based on recieving regular deltas of data extracts. We demonstrate how to create a table of deltas followed by another set of tables that run a report off based on what the deltas looked like at a given date.
data_conformance_and_dbs.ipynb
This is a big indepth one, probably more for data engineers or people wanting to make sure their metadata is exact.
Uses aws wrangler, mojap-metadata and arrow-pd-parser to ensure conformance to and from data of different formats and also demonstrates how you can write data to/from S3 using arrow-pd-parser
, pyarrow
and/or awswrangler
and also demonstrates how to create Glue schemas to query said data using aws-wrangler
and/or the glue converter from mojap-metadata
.
How to use our tools to migrate etl-manager schemas to our Metadata schemas.
Simple demo as to how to create temporary athena tables using pydbtools
Simple demo as to how to use SQL templating in pydbtools