GitHub - transiteration/scaling-ml: A GitHub repository showcasing the implementation of AI scaling techniques and integration with MLflow for streamlined experiment tracking and management in machine learning workflows.

license

Scaling-ML

Scaling-ML is a project that classifies news headlines into 10 groups. The main part of the project fine-tuning of the BERT model and including tools like MLflow for tracking experiments, Ray for scaling and distibuted computing, and MLOps components for seamless management of machine learning workflows.

Set Up

Clone the repository:

git clone https://github.com/your-username/scaling-ml.git
cd scaling-ml

Set up your virtual environment and install dependencies:

export PYTHONPATH=$PYTHONPATH:$PWD
pip install -r requirements.txt

Scripts Overview

scripts
├── app.py
├── config.py
├── data.py
├── evaluate.py
├── model.py
├── predict.py
├── train.py
├── tune.py
└── utils.py

app.py - Implementation of FastAPI web service for serving a model.
config.py - Configuration of logging settings, directory structures, and MLflow registry.
data.py- Functions and a class for data preprocessing tasks in a scalable machine learning project.
evaluate.py - Evaluating the performance of a model, calculating precision, recall and F1 score.
model.py - Finetuned language model by adding a fully connected layer for classification tasks.
predict.py - TorchPredictor class for making predictions using a PyTorch-based model.
train.py - Training process using Ray for distributed training.
tune.py - Hyperparameter tuning for Language Model using Ray Tune.
utils.py - Various utility functions for handling data, setting random seeds, saving and loading dictionaries, etc.

Dataset

For training, small portion of the News Category Dataset was used, which contains numerous headlines and descriptions of various articles.

How to Train

export DATASET_LOC="path/to/dataset"
export TRAIN_LOOP_CONFIG='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
python3 scripts/train.py \
--experiment_name "llm_train" \
--dataset_loc $DATASET_LOC \
--train_loop_config "$TRAIN_LOOP_CONFIG" \
--num_workers 1 \
--cpu_per_worker 1 \
--gpu_per_worker 0 \
--num_epochs 1 \
--batch_size 128 \
--results_fp results.json

experiment_name: A name for the experiment or run, in this case, "llm".
dataset_loc: The location of the training dataset, replace with the actual path.
train_loop_config: The configuration for the training loop, replace with the actual configuration.
num_workers: The number of workers used for parallel processing. Adjust based on available CPU resources.
cpu_per_worker: The number of CPU cores assigned to each worker. Adjust based on available CPU resources.
gpu_per_worker: The number of GPUs assigned to each worker. Adjust based on available GPU resources.
num_epochs: The number of training epochs.
batch_size: The batch size used during training.
results_fp: The file path to save the results.

How to Tune

export DATASET_LOC="path/to/dataset"
export INITIAL_PARAMS='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
python3 scripts/tune.py \
--experiment_name "llm_tune" \
--dataset_loc "$DATASET_LOC" \
--initial_params "$INITIAL_PARAMS" \
--num_workers 1 \
--cpu_per_worker 1 \
--gpu_per_worker 0 \
--num_runs 1 \
--grace_period 1 \
--num_epochs 1 \
--batch_size 128 \
--results_fp results.json

num_runs: The number of tuning runs to perform.
grace_period: The grace period for early stopping during hyperparameter tuning.

Note: modify the values of the --num-workers, --cpu-per-worker, and --gpu-per-worker input parameters below according to the resources available on your system.

Experiment Tracking with MLflow

mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri /path/to/mlflow/folder

Evaluation

export RUN_ID=YOUR_MLFLOW_EXPERIMENT_RUN_ID
python3 evaluate.py --run_id $RUN_ID --dataset_loc "path/to/dataset" --results_fp results.json

{                                                                                                                                                                                                           
  "timestamp": "January 22, 2024 09:57:12 AM",
  "precision": 0.9463323229539818,
  "recall": 0.9424083769633508,
  "f1": 0.9437224104301406,
  "num_samples": 1000.0
}

run_id: ID of the specific MLflow run to load from.

Inference

python3 predict.py --run_id $RUN_ID --headline "Airport Guide: Chicago O'Hare" --keyword "destination"

[
  {
    "prediction": "TRAVEL",
    "probabilities": {
      "BUSINESS": 0.0024151806719601154,
      "ENTERTAINMENT": 0.002721842611208558,
      "FOOD & DRINK": 0.001193400239571929,
      "PARENTING": 0.0015436559915542603,
      "POLITICS": 0.0012392215430736542,
      "SPORTS": 0.0020724297501146793,
      "STYLE & BEAUTY": 0.0018642042996361852,
      "TRAVEL": 0.9841892123222351,
      "WELLNESS": 0.0013303911546245217,
      "WORLD NEWS": 0.0014305398799479008
    }
  }
]

Application

python3 app.py --run_id $RUN_ID --num_cpus 2

Now, we can send requests to our application:

import json
import requests
headline = "Reboot Your Skin For Spring With These Facial Treatments"
keywords = "skin-facial-treatments"
json_data = json.dumps({"headline": headline, "keywords": keywords})
out = requests.post("http://127.0.0.1:8010/predict", data=json_data).json()
print(out["results"][0])

{
  "prediction": "STYLE & BEAUTY",
  "probabilities": {
      "BUSINESS": 0.002265132963657379,
      "ENTERTAINMENT": 0.008689943701028824,
      "FOOD & DRINK": 0.0011296054581180215,
      "PARENTING": 0.002621663035824895,
      "POLITICS": 0.002141285454854369,
      "SPORTS": 0.0017548275645822287,
      "STYLE & BEAUTY": 0.9760453104972839,
      "TRAVEL": 0.0024237297475337982,
      "WELLNESS": 0.001382972695864737,
      "WORLD NEWS": 0.0015455639222636819
}

Testing the Code

How to test the written code for asserted inputs and outputs:

python3 -m pytest tests/code --verbose --disable-warnings

How to test the Model behaviour:

python3 -m pytest --run-id $RUN_ID tests/model --verbose --disable-warnings

Workload

To execute all stages of this project with a single command, workload.sh script has been provided, change the resource(cpu_nums, gpu_nums, etc.) parameters to suit your needs.

bash workload.sh

Extras

Makefile to clean caches from the directories and format scripts:

make style && make clean

Served documentation for functions and classes:

python3 -m mkdocs serve

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
docs		docs
efs/mlflow		efs/mlflow
scripts		scripts
tests		tests
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
workload.sh		workload.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaling-ML

Set Up

Scripts Overview

Dataset

How to Train

How to Tune

Experiment Tracking with MLflow

Evaluation

Inference

Application

Testing the Code

Workload

Extras

About

Releases

Packages

Languages

transiteration/scaling-ml

Folders and files

Latest commit

History

Repository files navigation

Scaling-ML

Set Up

Scripts Overview

Dataset

How to Train

How to Tune

Experiment Tracking with MLflow

Evaluation

Inference

Application

Testing the Code

Workload

Extras

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages