Skip to content

Latest commit

 

History

History
100 lines (51 loc) · 3.37 KB

ml-sql-dataflow.md

File metadata and controls

100 lines (51 loc) · 3.37 KB

ML-SQL Language Structures


This document specifies a preliminary set of keywords that we can use to create a SQL like language for machine learning. There are a few keywords that are used for all machine learning tasks, however, some there is a slight difference between supervised (classification and regression) and unsupervised algorithms (clustering, etc.) which can be seen below.

These keywords were formed based off examples dataflows included in this repository as well as prior machine learning projects.

At the bottom I also include a few examples of regression and clustering tasks written in this language.

Assumptions

I will assume we start with a datafile (ex. already have the .csv, .txt, .data, etc.) downloaded locally on our computer. From here our language will be able to perform machine learning tasks on these datasets.



Universal Keywords

READ

(file, separator, header, column_names)

Reads file into a matrix to be operated on. The file should be in some CSV like format with values separated by the separtor. Additionally the presence of a header can be specified along with names for column_names.

REPLACE

([column_name, method={mean, NaN, nearest_neighbor, etc., drop_row}])

Changes missing or NaN values in specified columns by column mean, 0, NaN, etc. Specified as a list of tuples with the column_name and the method*.

TRANSFORM

(mean_scale, variance_scale, PCA=?, combine, etc.)

Used to scale, combine, or modify the data that is read in.

SPLIT

(train, test, validation)

Splits the data into training, testing, and validation sets for model building. train, test, and validation can either be percentages that sum to 1 or hard coded values that specify the relative sizes of each of the sets.

EVALUATE

(deviance, r-squared, residuals, etc.)

Defines the metrics that the user wants to see to evaluate their machine learning model.

VISUALIZE

(scatter_plot, diagnostics, etc)

Visualizations or graphs the user wants to see to evaluate model performance.


Classification/Regression Keywords

CLASSIFY / REGRESS

(predictors, labels, algorithm={svm, regression, lasso, ridge, etc.})

Specifies the supervised machine learning algorithm being used to classify data. The predictors and labels are also specified using column names or indices.

USING

(lambda, c, etc.)

Used to specify hyperparameters or values for some machine learning algorithms.


Clustering Keywords

CLUSTER

(columns, algorithm={k-means, nonparametric})

Specifies the unsupervised machine learning algorithm being used to cluster data. The columns are specified using column names or indices.

USING

(number_clusters, lambda, etc.)

Used to specify hyperparameters or values for some machine learning algorithms.



Examples

  1. Auto-mpg dataset (regression)

    • READ "auto-mpg.data" REPLACE [2: drop_row, 6: drop_row] SPLIT train = .8 test = .2 CLASSIFY simple predictors = [1:7] labels = 8 USING lambda = .01 EVALUATE r-squared VISUALIZE scatter_plot
  2. Seeds dataset (cluster)

    • READ "seeds.txt" REPLACE [2: mean, 6: mean] TRANSFORM PCA=2 CLUSTER all k-means USING clusters = 3 EVALUATE residuals VISUALIZE diagnostics