Skip to content

Latest commit

 

History

History
133 lines (98 loc) · 7.65 KB

README.md

File metadata and controls

133 lines (98 loc) · 7.65 KB

Data Science Scripts

This is a collection of random data-science exercises either gathered from different articles or videos that I have used for learning.

___Mileage may vary___

Usage

I setup an arguably complicated Makefile structure since I have so many projects in here.

Here is a demonstration of how it works for the related building_ml_powered_apps project: Demo

Projects and Explanations

Here are some of the included projects:

Installations

gum

gum is a library to make good-looking & interactive shell scripts. I use it on some of the scripts in here.

I have included the installation for my system below:

sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://repo.charm.sh/apt/gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/charm.gpg
echo "deb [signed-by=/etc/apt/keyrings/charm.gpg] https://repo.charm.sh/apt/ * *" | sudo tee /etc/apt/sources.list.d/charm.list
sudo apt update && sudo apt install gum
problematic installations

During the creation of some of the projects / examples I have encountered some issues when working with my recent version of python 11. For instance tensorflow will not be happy with you and as a result it's probably a good reason to start to learn pytorch.

Anyways opinions aside (this is programming right??) here are some notes I have on some libraries having trouble with my setup.

umap-learn

In order to try and get umap-learn I needed to get a version of llvm working.

In my attempt it looks like the library is still too old to work with python 3.11 =(

$ poetry add umap-learn

FileNotFoundError: [Errno 2] No such file or directory: 'llvm-config'
RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config

$ apt-cache search "llvm-.*-dev" | grep -v ocaml | sort
llvm-11-dev - Modular compiler and toolchain technologies, libraries and headers

$ sudo apt install llvm-11-dev

$ ls /usr/bin/llvm-config*
/usr/bin/llvm-config-11

$ LLVM_CONFIG=/usr/bin/llvm-config-11  poetry add umap-learn
RuntimeError: Building llvmlite requires LLVM 10.0.x or 9.0.x, got '11.1.0'. Be sure to set LLVM_CONFIG to the right executable path.

Resources

TODO: Add this to Makefile Line count of files in a given directory

find ./quantitative_finance_algorithmic_trading -name '*.py' | xargs wc -l | tail -n1
find ./machine_learning_and_deep_learning_bootcamp -name '*.py' | xargs wc -l | tail -n1

Here is a pandas cookbook that would be good for adding some examples from into this repo...

An interesting thing to check out could be this python profiler py-spy which allows you to profile and debug any running python program, even if it is serving production traffic

Face recognition API in python

Deep learning lessons from Fast AI

Interesting transaction manager called pyWave

Tool called Oxen for versioned datasets

Interesting hacker news article talks about challenges they have with AI in development scenarios. It sounds like it is mostly frustrations due to deployment / infrastructure and rapid prototyping

We serve our models with FastAPI, containerize them, and then deploy them to our GKE clusters. Depending on the model, we choose different machines - some require GPUs, most are decent on CPU. We take models up or down based on usage, so we have cold starts unless otherwise specified by customers. We expose access to the model via a POST call through our cloud app. We track inputs and outputs, as we expect that people will become interested in fine tuning models based on their past usage.

For the original "davinci" models (now 3 generations behind if you count Instruct, ChatGPT, and upcoming DV"), OpenAI recommends "Aim for at least ~500 examples" as a starting point for fine-tuning

How do you monitor and debug models?

When your engineering team builds great tooling, monitoring and debugging get much easier. Stitch Fix has built an internal tool that takes in a modeling pipeline and creates a Docker container, validates arguments and return types, exposes the inference pipeline as an API, deploys it on our infrastructure, and builds a dashboard on top of it. This tooling allows data scientists to directly fix any errors that happen during or after deployment.

How do you deploy new model versions?

In addition, data scientists run experiments by using a custom-built A/B testing service that allows them to define granular parameters. They then analyze test results, and if they are deemed conclusive by the team, they deploy the new version themselves.

When it comes to deployment, we use a system similar to canary development where we start by deploying the new version to one instance and progressively update instances while monitoring performance. Data scientists have access to a dashboard that shows the number of instances under each version and continuous performance metrics as the deployment progresses.