Snowflake Cortex comprises a set of ML (machine learning) and LLM (large language models) functions designed to simplify the implementation of various models based on data within the Snowflake Data Warehouse (DWH). This guide provides a simple walkthrough for preparing data, training a model, and invoking function for detecting anomalies in data set.
This guide covers two different concepts:
- Basic (unsupervised) anomaly detection
- Supervised anomaly detection
To run this guide effectively, you will need to meet the following requirements:
-
Snowflake Platform Account: Ensure you have an active account on the Snowflake platform. If you don't have one yet, sign up for a Snowflake account on their website.
-
User/Role Configuration with Permissions:
- Configure a user or role with the necessary permissions to select tables and generate views within your Snowflake account.
- This user or role should have privileges to access the required database objects (tables, views) for data preparation, model training, and anomaly detection.
-
Configured Warehouse:
- Set up a Snowflake warehouse configured to run queries.
- Ensure that the warehouse is properly scaled based on the size of your dataset and computational requirements.
-
Test Data According to Data Schema:
- Prepare test data that adheres to the data schema required for the anomaly detection outlined in the guide.
- The data schema should include all necessary fields and formats expected by the anomaly detection model.
Guide is based on following data scheme:
create or replace TABLE MYDATABASE.PUBLIC.EVENTS (
TIME TIMESTAMP_NTZ(9) NOT NULL,
EVENT_TYPE VARCHAR(50) NOT NULL
);
Here are guidelines (set of SQL queries) for two different concepts:
-
Basic Anomaly Detection: A simple (unsupervised) anomaly detection model trained with historical data.
-
Supervised Anomaly Detection: Supervised anomaly detection model that enhances data with additional label that signals whether some data point in the historic data set is an exception.
I ran a couple of different tests comparing the anomalies detected based on a basic model and a supervised model. Given that my data showed an anomaly in the original dataset on day '2024-02-07', the supervised model using this as a label, did find one anomaly less in new data in comparison to the simple model.
The figures below show the detected anomalies using both models:
The figure below shows the difference between the forecasts provided by both models. The difference is not big because the anomaly on '2024-02-07' is not really a big issue and just slightly affects the forecasting model. However, the label still affects the anomaly detection enough to prove that supervised model works better.