Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename Stack to Stacks and Fix Bugs #107

Merged
merged 8 commits into from
Oct 19, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
__pycache__/
.cache
*.pyc
mlops-stack.iml
mlops-stacks.iml
2 changes: 1 addition & 1 deletion Pipeline.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ML Pipeline Structure and Devloop
The default stack contains an ML pipeline with CI/CD workflows to test and deploy
MLOps Stacks contains an ML pipeline with CI/CD workflows to test and deploy
automated model training and batch inference jobs across your dev, staging, and prod Databricks
workspaces.

Expand Down
68 changes: 33 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Databricks MLOps Stack
# Databricks MLOps Stacks

> **_NOTE:_** This feature is in [private preview](https://docs.databricks.com/release-notes/release-types.html). The interface/APIs may change and no formal support is available during the preview. However, you can still create new production-grade ML projects using the stack.
If interested in trying it out, please fill out this [form](https://docs.google.com/forms/d/e/1FAIpQLSfHXCmkbsEURjQQvtUGObgh2D5q1eD4YRHnUxZ0M4Hu0W63WA/viewform), and you’ll be contacted by a Databricks representative.
> **_NOTE:_** This feature is in [public preview](https://docs.databricks.com/release-notes/release-types.html).

This repo provides a customizable stack for starting new ML projects
on Databricks that follow production best-practices out of the box.
Expand All @@ -19,25 +18,25 @@ Your organization can use the default stack as is or customize it as needed, e.g
adapt individual components to fit your organization's best practices. See the
[stack customization guide](stack-customization.md) for more details.

Using Databricks MLOps stack, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML service state
management, with an easy transition to production. You can also use MLOps stack as a building block
Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML service state
management, with an easy transition to production. You can also use MLOps Stacks as a building block
in automation for creating new data science projects with production-grade CI/CD pre-configured.

![MLOps Stack diagram](doc-images/mlops-stack.png)
![MLOps Stacks diagram](doc-images/mlops-stacks.png)

See the [FAQ](#FAQ) for questions on common use cases.

## ML pipeline structure and devloop
[See this page](Pipeline.md) for detailed description and diagrams of the ML pipeline
structure defined in the default stack.

## Using this stack
## Using MLOps Stacks

### Prerequisites
- Python 3.8+
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) >= v0.204.0
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) >= v0.208.1

[Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) v0.204.0 contains [Databricks asset bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html) for the purpose of project creation.
[Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) v0.208.1 contains [Databricks asset bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html) for the purpose of project creation.

Please follow [the instruction](https://docs.databricks.com/en/dev-tools/cli/databricks-cli-ref.html#install-the-cli) to install and set up databricks CLI. Releases of databricks CLI can be found in the [releases section](https://github.com/databricks/cli/releases) of databricks/cli repository.

Expand All @@ -47,7 +46,7 @@ Please follow [the instruction](https://docs.databricks.com/en/dev-tools/cli/dat

To create a new project, run:

databricks bundle init https://github.com/databricks/mlops-stack
databricks bundle init mlops-stacks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. Shall we cut a release and update the alias pointing to the release version later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! We'll be cutting the MLOps Stacks v0.2 release after this PR merges, and deco team has plans to add this semantic versioning to the bundle aliases.

This will prompt for parameters for project initialization. Some of these parameters are required to get started:
* ``input_project_name``: name of the current project
Expand Down Expand Up @@ -78,42 +77,41 @@ See the generated ``README.md`` for next steps!

## FAQ

### Do I need separate dev/staging/prod workspaces to use this stack?
### Do I need separate dev/staging/prod workspaces to use MLOps Stacks?
We recommend using separate dev/staging/prod Databricks workspaces for stronger
isolation between environments. For example, Databricks REST API rate limits
are applied per-workspace, so if using [Databricks Model Serving](https://docs.databricks.com/applications/mlflow/model-serving.html),
using separate workspaces can help prevent high load in staging from DOSing your
production model serving endpoints.

However, you can run the stack against just a single workspace, against a dev and
staging/prod workspace, etc. Just supply the same workspace URL for
However, you can create a single workspace stack, by supplying the same workspace URL for
`input_databricks_staging_workspace_host` and `input_databricks_prod_workspace_host`. If you go this route, we
recommend using different service principals to manage staging vs prod resources,
to ensure that CI workloads run in staging cannot interfere with production resources.

### I have an existing ML project. Can I productionize it using this stack?
Yes. Currently, you can instantiate a new project from the stack and copy relevant components
into your existing project to productionize it. The stack is modularized, so
### I have an existing ML project. Can I productionize it using MLOps Stacks?
Yes. Currently, you can instantiate a new project and copy relevant components
into your existing project to productionize it. MLOps Stacks is modularized, so
you can e.g. copy just the GitHub Actions workflows under `.github` or ML resource configs
under ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources``
and ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/bundle.yml`` into your existing project.
and ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml`` into your existing project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks

### Can I adopt individual components of the stack?
For this use case, we recommend instantiating the full stack via [Databricks asset bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html)
and copying the relevant stack subdirectories. For example, all ML resource configs
### Can I adopt individual components of MLOps Stacks?
For this use case, we recommend instantiating via [Databricks asset bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html)
and copying the relevant subdirectories. For example, all ML resource configs
are defined under ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources``
and ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/bundle.yml``, while CI/CD is defined e.g. under `.github`
and ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml``, while CI/CD is defined e.g. under `.github`
if using GitHub Actions, or under `.azure` if using Azure DevOps.

### Can I customize this stack?
### Can I customize my MLOps Stack?
Yes. We provide the default stack in this repo as a production-friendly starting point for MLOps.
However, in many cases you may need to customize the stack to match your organization's
best practices. See [the stack customization guide](stack-customization.md)
for details on how to do this.

### Does the MLOps stack cover data (ETL) pipelines?
### Does the MLOps Stacks cover data (ETL) pipelines?

Since MLOps Stack is based on [databricks CLI bundles](https://docs.databricks.com/dev-tools/cli/bundle-commands.html),
Since MLOps Stacks is based on [databricks CLI bundles](https://docs.databricks.com/dev-tools/cli/bundle-commands.html),
it's not limited only to ML workflows and assets - it works for assets across the Databricks Lakehouse. For instance, while the existing ML
code samples contain feature engineering, training, model validation, deployment and batch inference workflows,
you can use it for Delta Live Tables pipelines as well.
Expand All @@ -127,7 +125,7 @@ Please provide feedback (bug reports, feature requests, etc) via GitHub issues.
We welcome community contributions. For substantial changes, we ask that you first file a GitHub issue to facilitate
discussion, before opening a pull request.

This stack is implemented as a [Databricks asset bundle template](https://docs.databricks.com/en/dev-tools/bundles/templates.html)
MLOps Stacks is implemented as a [Databricks asset bundle template](https://docs.databricks.com/en/dev-tools/bundles/templates.html)
that generates new projects given user-supplied parameters. Parametrized project code can be found under
the `{{.input_root_dir}}` directory.

Expand Down Expand Up @@ -164,25 +162,25 @@ Run integration tests only:
pytest tests --large-only
```

### Previewing stack changes
When making changes to the stack, it can be convenient to see how those changes affect
an actual new ML project created from the stack. To do this, you can create an example
project from your local checkout of the stack, and inspect its contents/run tests within
### Previewing changes
When making changes to MLOps Stacks, it can be convenient to see how those changes affect
a generated new ML project. To do this, you can create an example
project from your local checkout of the repo, and inspect its contents/run tests within
the project.

We provide example project configs for Azure (using both GitHub and Azure DevOps) and AWS (using GitHub) under `tests/example-project-configs`.
To create an example Azure project, using Azure DevOps as the CI/CD platform, run the following from the desired parent directory
of the example project:

```
# Note: update MLOPS_STACK_PATH to the path to your local checkout of the stack
MLOPS_STACK_PATH=~/mlops-stack
databricks bundle init "$MLOPS_STACK_PATH" --config-file "$MLOPS_STACK_PATH/tests/example-project-configs/azure/azure-devops.json"
# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/azure/azure-devops.json"
```

To create an example AWS project, using GitHub Actions for CI/CD, run:
```
# Note: update MLOPS_STACK_PATH to the path to your local checkout of the stack
MLOPS_STACK_PATH=~/mlops-stack
databricks bundle init "$MLOPS_STACK_PATH" --config-file "$MLOPS_STACK_PATH/tests/example-project-configs/aws/aws-github.json"
# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/aws/aws-github.json"
```
9 changes: 5 additions & 4 deletions databricks_template_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"order": 1,
"type": "string",
"default": "my-mlops-project",
"description": "Welcome to MLOps Stack. For detailed information on project generation, see the README at https://github.com/databricks/mlops-stack/blob/main/README.md. \n\nProject Name"
"description": "Welcome to MLOps Stacks. For detailed information on project generation, see the README at https://github.com/databricks/mlops-stacks/blob/main/README.md. \n\nProject Name"
},
"input_root_dir": {
"order": 2,
Expand Down Expand Up @@ -63,8 +63,8 @@
"input_schema_name": {
"order": 11,
"type": "string",
"description": "\nName of schema to use when registering a model in Unity Catalog. \nNote that this schema must already exist. Default",
"default": "schema_name"
"description": "\nName of schema to use when registering a model in Unity Catalog. \nNote that this schema must already exist, and we recommend keeping the name the same as the project name. Default",
"default": "my-mlops-project"
},
"input_unity_catalog_read_user_group": {
"order": 12,
Expand All @@ -84,5 +84,6 @@
"description": "\nWhether to include MLflow Recipes. \nChoose from no, yes",
"default": "no"
}
}
},
"success_message" : "\n✨ Your MLOps Stack has been created in the '{{.project_name}}' directory!\n\nPlease refer to the README.md of your project for further instructions on getting started."
}
Binary file removed doc-images/mlops-stack.png
Binary file not shown.
Binary file added doc-images/mlops-stacks.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 9x larger. I don't think it's a big deal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had to take the original image, overlay the new text, and then take a screenshot since I couldn't find the original file where we created the diagram 🤦‍♂️

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
42 changes: 20 additions & 22 deletions stack-customization.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Stack Customization Guide
We provide the default stack in this repo as a production-friendly starting point for MLOps.
# MLOps Stacks Customization Guide
We provide the default MLOps Stack in this repo as a production-friendly starting point for MLOps.

For generic enhancements not specific to your organization
(e.g. add support for a new CI/CD provider), we encourage you to consider contributing the
change back to the default stack, so that the community can help maintain and enhance it.
change back to the MLOps Stacks repo, so that the community can help maintain and enhance it.

However, in many cases you may need to customize the stack, for example if:
* You have different Databricks workspace environments (e.g. a "test" workspace for CI, in addition to dev/staging/prod)
Expand All @@ -19,20 +19,20 @@ default stack. Before getting started, we encourage you to read
the [contributor guide](README.md#contributing) to learn how to
make, preview, and test changes to your custom stack.

### Fork the default stack repo
Fork the default stack repo. You may want to create a private fork if you're tailoring
### Fork the MLOps Stacks repo
Fork the MLOps Stacks repo. You may want to create a private fork if you're tailoring
the stack to the specific needs of your organization, or a public fork if you're creating
a generic new stack.

### (optional) Set up CI for your new stack
Tests for the default stack are defined under the `tests/` directory and are
### (optional) Set up CI
Tests for MLOps Stacks are defined under the `tests/` directory and are
executed in CI by Github Actions workflows defined under `.github/`. We encourage you to configure
CI in your own stack repo to ensure the stack continues to work as you make changes.
CI in your own MLOps Stacks repo to ensure it continues to work as you make changes.
If you use GitHub Actions for CI, the provided workflows should work out of the box.
Otherwise, you'll need to translate the workflows under `.github/` to the CI provider of your
choice.

### Update stack parameters
### Update MLOps Stacks parameters
Update parameters in your fork as needed in `databricks_template_schema.json` and update corresponding template variable in `library/template_variables.tmpl`. Pruning the set of
parameters makes it easier for data scientists to start new projects, at the cost of reduced flexibility.

Expand All @@ -41,16 +41,15 @@ For example, you may have a fixed set of staging & prod Databricks workspaces (o
also run all of your ML pipelines on a single cloud, in which case the `input_cloud` parameter is unnecessary.

The easiest way to prune parameters and replace them with hardcoded values is to follow
the [contributor guide](README.md#previewing-stack-changes) to generate an example project with
parameters substituted-in, and then copy the generated project contents back into your stack.
the [contributor guide](README.md#previewing-changes) to generate an example project with
parameters substituted-in, and then copy the generated project contents back into your MLOps Stacks repo.

## Customize individual components

### Example ML code
The default stack provides example ML code using [MLflow recipes](https://mlflow.org/docs/latest/recipes.html#).
MLOps Stacks provides example ML code.
You may want to customize the example code, e.g. further prune it down into a skeleton for data scientists
to fill out, or remove and replace the use of MLflow Recipes if you expect data scientists to work on problem
types that are currently unsupported by MLflow Recipes.
to fill out.

If you customize this component, you can still use the CI/CD and ML resource components to build production ML pipelines, as long as you provide ML
notebooks with the expected interface. For example, model training under ``template/{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/training/notebooks/`` and inference under
Expand All @@ -60,14 +59,13 @@ You may also want to update developer-facing docs under `template/{{.input_root_
or `template/{{.input_root_dir}}/docs/ml-developer-guide-fs.md`, which will be read by users of your stack.

### CI/CD workflows
The default stack currently has the following sub-components for CI/CD:
MLOps Stacks currently has the following sub-components for CI/CD:
* CI/CD workflow logic defined under `template/{{.input_root_dir}}/.github/` for testing and deploying ML code and models
* Automated scripts and docs for setting up CI/CD under `template/{{.input_root_dir}}/.mlops-setup-scripts/`
* Logic to trigger model deployment through REST API calls to your CD system, when model training completes.
This logic is currently captured in ``template/{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/deployment/model_deployment/notebooks/TriggerModelDeploy.py``
This logic is currently captured in ``template/{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/deployment/model_deployment/notebooks/ModelDeployment.py``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue must be from long time ago 🤦 .

### ML resource configs
Root ML resource config file can be found as ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/bundle.yml``.
Root ML resource config file can be found as ``{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml``.
It defines the ML config resources to be included and workspace host for each deployment target.

ML resource configs (databricks CLI bundles code definitions of ML jobs, experiments, models etc) can be found under
Expand All @@ -80,7 +78,7 @@ When updating this component, you may want to update developer-facing docs in
``template/{{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources/README.md``.

### Docs
After making stack customizations, make any changes needed to
the stack docs under `template/{{.input_root_dir}}/docs` and in the main README
(`template/{{.input_root_dir}}/README.md`) to reflect any updates you've made to the stack.
For example, you may want to include a link to your custom stack in `template/{{.input_root_dir}}/README.md`.
After making customizations, make any changes needed to
the docs under `template/{{.input_root_dir}}/docs` and in the main README
(`template/{{.input_root_dir}}/README.md`) to reflect any updates you've made to the MLOps Stacks repo.
For example, you may want to include a link to your custom MLOps Stacks repo in `template/{{.input_root_dir}}/README.md`.
5 changes: 5 additions & 0 deletions template/update_layout.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@
{{ skip (printf `%s/%s` $root_dir `docs/ml-developer-guide-fs.md`) }}
{{ end }}

# Remove utils if using Models in Unity Catalog
{{ if (eq .input_include_models_in_unity_catalog `yes`) }}
{{ skip (printf `%s/%s/%s` $root_dir $project_name_alphanumeric_underscore `utils.py`) }}
{{ end }}

# Remove template files
{{ skip `update_layout` }}
{{ skip `run_validations` }}
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This Azure Pipeline validates and deploys bundle config (ML resource config and more)
# defined under {{template `project_name_alphanumeric_underscore` .}}/databricks-resource/*
# and {{template `project_name_alphanumeric_underscore` .}}/bundle.yml.
# defined under {{template `project_name_alphanumeric_underscore` .}}/resources/*
# and {{template `project_name_alphanumeric_underscore` .}}/databricks.yml.
# The bundle is validated (CI) upon making a PR against the {{template `default_branch` .}} branch.
# Bundle resources defined for staging are deployed when a PR is merged into the {{template `default_branch` .}} branch.
# Bundle resources defined for prod are deployed when a PR is merged into the {{template `release_branch` .}} branch.
Expand Down
Loading
Loading