diff --git a/notebooks/10 Automated Machine Learning.ipynb b/notebooks/10 Automated Machine Learning.ipynb
index b6301cc..bc49289 100644
--- a/notebooks/10 Automated Machine Learning.ipynb
+++ b/notebooks/10 Automated Machine Learning.ipynb
@@ -11,7 +11,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In this notebook, we demonstrated how to use the Kx kdb+/q Automated Machine Learning library. The example below use samples from the Telco Customer Churn dataset.\n",
+ "In this notebook, we demonstrate how to use the Kx kdb+/q Automated Machine Learning library. The example below uses samples from the Telco Customer Churn dataset and IMBD movie review dataset.\n",
"\n",
"
\n",
"To run the below notebook, ensure that dependencies specified in
requirements.txt have been correctly installed.\n",
@@ -29,7 +29,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The Machine Learning Toolkit ([ML-Toolkit](https://github.com/KxSystems/ml)) contains general use utilities, an implementation of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm and cross validation functions. The primary purpose of these libraries are to provide kdb+/q users with access to commonly-used ML functions for preprocessing data, extracting features and scoring results."
+ "The Machine Learning Toolkit ([ML-Toolkit](https://github.com/KxSystems/ml)) contains general use utilities, an implementation of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm, cross validation functions, clustering libraries and time series functionality. The primary purpose of these libraries are to provide kdb+/q users with access to commonly-used ML functions for preprocessing data, extracting features and scoring results."
]
},
{
@@ -48,15 +48,16 @@
"- Data preprocessing\n",
"- Feature engineering and feature selection\n",
"- Model selection\n",
- "- Hyperparameter Tuning\n",
+ "- Hyperparameter tuning\n",
"- Report generation and model persistence\n",
"\n",
"Each of these steps is outlined in depth within the documentation for this platform [here](https://code.kx.com/q/ml/automl). This allows users to understand the processes by which decisions are being made and the transformations which their data undergo during the production of the output models.\n",
"\n",
- "At present the supported machine learning problem types are classification and regression and based on:\n",
+ "At present the supported machine learning problem types for classification and regression tasks and based on:\n",
"\n",
"- One-to-one feature to target non time-series\n",
"- FRESH based feature extraction and model production\n",
+ "- NLP-based feature creation and word2vec transformation\n",
"\n",
"The problems which can be solved by this framework will be expanded over time as will the available functionality."
]
@@ -67,7 +68,7 @@
"source": [
"### Multi-processing\n",
"\n",
- "This library supports multi-processed grid-search/cross-validation procedures and FRESH feature creation provided a user set `-s -8` in the JUPYTERQ_SERVERARGS, access to which can be found [here](https://code.kx.com/q/ml/jupyterq/notebooks/#server-command-line-arguments). In this demo, we use 8 worker processes and open a centralised port as below."
+ "This library supports multi-processed grid-search/cross-validation procedures and FRESH feature creation provided a user set `-s -8` in the JUPYTERQ_SERVERARGS entry to the appropriate JSON file, instructions to facilitate this can be found [here](https://code.kx.com/q/ml/jupyterq/notebooks/#server-command-line-arguments). In this demo, we use 8 worker processes and open a centralised port as below."
]
},
{
@@ -92,11 +93,24 @@
"metadata": {
"scrolled": true
},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Documentation can be found at https://code.kx.com/q/ml/automl/\n"
+ ]
+ }
+ ],
"source": [
"// load in automl\n",
"\\l automl/automl.q\n",
- ".automl.loadfile`:init.q"
+ ".automl.loadfile`:init.q\n",
+ "\n",
+ "// load utils\n",
+ "\\l ../utils/util.q\n",
+ "\\l ../utils/graphics.q"
]
},
{
@@ -125,13 +139,6 @@
"---"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Default Configurations"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
@@ -143,9 +150,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The [Telco Customer Churn dataset](https://www.kaggle.com/blastchar/telco-customer-churn/data) contains entries for 7043 customers. In each case below, we aim to create a model which can accurately predict customer churn based on 20 features relating to each customer.\n",
+ "The [Telco Customer Churn dataset](https://www.kaggle.com/blastchar/telco-customer-churn/data) contains the following information.\n",
+ "* Data on 7,043 customers of a telecom provider provided by IBM\n",
+ "* Customer feature information including\n",
+ " * What form of internet the user has (DSL/Fiber Optic)?\n",
+ " * What are the users monthly payments?\n",
+ " * How long has the customer been in their contract?\n",
+ " * What services does the customer use? i.e. phone, internet, online backup, streaming etc.\n",
+ "* A target variable 'Churn' indicating if a user has cancelled their contract in the last month\n",
"\n",
- "Below we load in the data and select a subset of 5000 random data points to train and test the pipeline on. We also load in additional graphics and utility functions required throughout this notebook."
+ "\n",
+ "In each case of the examples below, we aim to create a model which can accurately predict customer churn based on 20 features relating to each customer."
]
},
{
@@ -159,17 +174,6 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {},
- "outputs": [],
- "source": [
- "// load utils\n",
- "\\l ../utils/util.q\n",
- "\\l ../utils/graphics.q"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
"outputs": [
{
"name": "stdout",
@@ -220,12 +224,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "As we want to run both `.automl.run` and `.automl.new` we start by splitting our data into a training and testing set, where 10% has been chosen for the testing set. Note that we have set a random seed so that results can be replicated."
+ "In order to test both the model generation and prediction steps of the workflow we split the dataset into a training and testing set where\n",
+ "\n",
+ "| Dataset form | Purpose | Percentage (%)|\n",
+ "|--------------|:-------------------------------------------------------------------|---------------|\n",
+ "| Training | Generate model for deployment using `.automl.fit` | 90 |\n",
+ "| Testing | Independent dataset to test application of `predict` functionality | 10 |\n",
+ "\n",
+ "__*Note:*__ \n",
+ "\n",
+ " We have set a random seed so that results can be replicated."
]
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 4,
"metadata": {},
"outputs": [
{
@@ -248,19 +261,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### User Interface"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "At the highest level the automated machine learning library contains two primary callable functions:\n",
- "\n",
- "- `.automl.run` = Run the automated machine learning pipeline on user defined data and target\n",
- "- `.automl.new` = Using a previously fit model and set of instructions to produce an appropriate pipeline derived from a defined run, predict the target value for new tabular data\n",
- "\n",
- "Both of these functions are modifiable by a user to suit specific use cases and have been designed to cover a wide range of functional options and to be extensible to a users needs."
+ "## Default Configurations"
]
},
{
@@ -274,61 +275,49 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Below we demonstrate how to apply `.automl.run` to our features and targets in the default setting, where the function has the syntax:\n",
- "\n",
- "```.automl.run[tab;tgt;ftype;ptype;dict]```\n",
+ "The automated machine learning pipeline will use the training features (`xtrain`) and targets (`ytrain`) from `telcoInputs` above as input to `automl.fit`. \n",
"\n",
- "Where:\n",
- "- `tab` is unkeyed tabular data from which the models will be created\n",
- "- `tgt` is the target vector\n",
- "- `ftype` type of feature extraction being completed on the dataset as a symbol (``` `fresh```/``` `normal```)\n",
- "- `ptype` type of problem, regression/class, as a symbol (``` `reg```/``` `class```)\n",
- "- `dict` is one of `(::)` for default behaviour, a kdb+ dictionary or path to a user defined flat file for modifying default parameters.\n",
+ "Appropriate preprocessing steps including feature creation and selection will be applied to the data before being passed to a variety of machine learning models, choosing the best performing model. \n",
"\n",
"In this case, we select ``` `normal``` feature extraction as we have a 1-to-1 mapping between features and targets. We also use ``` `class``` for the problem type as we are dealing with a binary classification problem.\n",
"\n",
- "**NB:** For the purposes of this demonstration we will pass in a dictionary in place of the default parameter `(::)`. In order to ensure replication for users of this notebook the random seed parameter ``` `seed``` is set in this example with the remaining parameters defaulted."
+ "**Inportant:** \n",
+ "\n",
+ " For the purposes of this demonstration we will pass in a dictionary in place of the default parameter (::). In order to ensure replication for users of this notebook the random seed parameter `seed is set in this example with the remaining parameters defaulted."
]
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
- "tab:telcoInputs`xtrain / features\n",
- "tgt:telcoInputs`ytrain / targets\n",
- "ftype:`normal / normal feature extraction\n",
- "ptype:`class / classification problem\n",
- "dict:enlist[`seed]!enlist 42 / default configuration"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Outputs"
+ "telcoFeats :telcoInputs`xtrain / features\n",
+ "telcoTarget :telcoInputs`ytrain / targets\n",
+ "featureType1:`normal / normal feature extraction\n",
+ "problemType1:`class / classification problem\n",
+ "paramDict1 :enlist[`seed]!enlist 350 / default configuration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "In the default configuration, the following items will be returned/saved during an individual run:\n",
+ "In the default configuration, information generated during the fitting of the model will we saved to the outputs folder. This includes metadata information, graphs, reports and the fitted model.\n",
"\n",
- "- The best model, saved as a hdf5 file for keras models, or \"pickled\" byte objects for sklearn models.\n",
- "- A saved report indicating the procedure taken and scores achieved.\n",
- "- A saved byte encoded dictionary denoting the procedure to be taken for reproducing results or running on new data.\n",
- "- Results from each step of the pipeline published to console.\n",
+ "In addition to saving outputs, the function returns a dictionary with two keys:\n",
"\n",
- "In addition to the saved outputs, the function will also return the date and time of the current run. This allows users to run the best model from a defined run on new data by passing the date and time to `.automl.new` (see example [below](#Test-on-new-data)).\n",
+ " Return key | Description\n",
+ "-------------|:-------------\n",
+ " `modelInfo` | Metadata information generated from the pipeline such as preprocessing steps taken, significant features chosen and any other information needed to replicate the results.\n",
+ " `predict` | A function containing all relevant information and procedures required to generate new predictions using the fit model\n",
"\n",
- "We can now run `.aml.run` using the default setting with out training set from the Telco Customer Churn dataset."
+ "We can now run `.automl.fit` using the default setting with out training set from the Telco Customer Churn dataset."
]
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 6,
"metadata": {
"scrolled": false
},
@@ -337,62 +326,131 @@
"name": "stdout",
"output_type": "stream",
"text": [
+ "Executing node: automlConfig\n",
+ "Executing node: configuration\n",
+ "Executing node: targetDataConfig\n",
+ "Executing node: targetData\n",
+ "Executing node: featureDataConfig\n",
+ "Executing node: featureData\n",
+ "Executing node: dataCheck\n",
+ "Executing node: featureDescription\n",
"\n",
"The following is a breakdown of information for each of the relevant columns in the dataset\n",
"\n",
- " | count unique mean std min max type\n",
- "-------| ----------------------------------\n",
- "comment| 1350 1350 :: :: :: :: text\n",
+ "\n",
+ " | count unique mean std min max type \n",
+ "------ | --------------------------------------------------------\n",
+ "tenure | 4500 73 32.326 24.55931 0i 72i numeric \n",
+ "MonthlyCharges | 4500 1251 64.88498 30.49795 18.55 118.75 numeric \n",
+ "TotalCharges | 4500 3178 2284.252 2275.078 18.85 8672.45 numeric \n",
+ "customerID | 4500 3310 :: :: :: :: categorical\n",
+ "gender | 4500 2 :: :: :: :: categorical\n",
+ "Partner | 4500 2 :: :: :: :: categorical\n",
+ "Dependents | 4500 2 :: :: :: :: categorical\n",
+ "PhoneService | 4500 2 :: :: :: :: categorical\n",
+ "MultipleLines | 4500 3 :: :: :: :: categorical\n",
+ "InternetService | 4500 3 :: :: :: :: categorical\n",
+ "OnlineSecurity | 4500 3 :: :: :: :: categorical\n",
+ "OnlineBackup | 4500 3 :: :: :: :: categorical\n",
+ "DeviceProtection| 4500 3 :: :: :: :: categorical\n",
+ "TechSupport | 4500 3 :: :: :: :: categorical\n",
+ "StreamingTV | 4500 3 :: :: :: :: categorical\n",
+ "StreamingMovies | 4500 3 :: :: :: :: categorical\n",
+ "Contract | 4500 3 :: :: :: :: categorical\n",
+ "PaperlessBilling| 4500 2 :: :: :: :: categorical\n",
+ "PaymentMethod | 4500 4 :: :: :: :: categorical\n",
+ "SeniorCitizen | 4500 2 :: :: :: :: boolean \n",
+ "\n",
+ "\n",
+ "Executing node: dataPreprocessing\n",
"\n",
"Data preprocessing complete, starting feature creation\n",
"\n",
- "Feature creation and significance testing complete\n",
+ "Executing node: featureCreation\n",
+ "Executing node: labelEncode\n",
+ "Executing node: featureSignificance\n",
+ "\n",
+ "Total number of significant features being passed to the models = 40\n",
+ "\n",
+ "Executing node: trainTestSplit\n",
+ "Executing node: modelGeneration\n",
+ "Executing node: selectModels\n",
+ "\n",
"Starting initial model selection - allow ample time for large datasets\n",
"\n",
- "Total features being passed to the models = 88\n",
+ "Executing node: runModels\n",
"\n",
- "Scores for all models, using .ml.accuracy\n",
- "RandomForestClassifier | 0.7546512\n",
- "GradientBoostingClassifier| 0.7512166\n",
- "AdaBoostClassifier | 0.748891\n",
- "SVC | 0.7476879\n",
- "LinearSVC | 0.746552\n",
- "MLPClassifier | 0.7454093\n",
- "LogisticRegression | 0.7430972\n",
- "binarykeras | 0.7338083\n",
- "KNeighborsClassifier | 0.7291504\n",
- "GaussianNB | 0.620453\n",
+ "Scores for all models using .ml.accuracy\n",
+ "\n",
+ "\n",
+ "RandomForestClassifier | 0.8482639\n",
+ "LogisticRegression | 0.7993056\n",
+ "GradientBoostingClassifier| 0.7989583\n",
+ "AdaBoostClassifier | 0.7920139\n",
+ "KNeighborsClassifier | 0.7666667\n",
+ "MLPClassifier | 0.7493056\n",
+ "GaussianNB | 0.7451389\n",
+ "SVC | 0.7225694\n",
+ "BinaryKeras | 0.7038194\n",
+ "LinearSVC | 0.59375\n",
"\n",
- "Best scoring model = RandomForestClassifier\n",
- "Score for validation predictions using best model = 0.6944444\n",
"\n",
"\n",
- "Feature impact calculated for features associated with RandomForestClassifier model\n",
- "Plots saved in /outputs/2020.09.22/run_13.34.41.579/images/\n",
+ "Best scoring model = RandomForestClassifier\n",
+ "\n",
+ "Executing node: optimizeModels\n",
"\n",
"Continuing to hyperparameter search and final model fitting on testing set\n",
"\n",
- "Best model fitting now complete - final score on testing set = 0.7407407\n",
+ "\n",
+ "Best model fitting now complete - final score on testing set = 0.8655556\n",
+ "\n",
"\n",
"Confusion matrix for testing set:\n",
"\n",
- " | pred_0 pred_1\n",
+ "\n",
+ " | true_0 true_1\n",
"------| -------------\n",
- "true_0| 86 35 \n",
- "true_1| 35 114 \n",
+ "pred_0| 619 49 \n",
+ "pred_1| 72 160 \n",
+ "\n",
+ "\n",
+ "Executing node: predictParams\n",
+ "Executing node: preprocParams\n",
+ "Executing node: pathConstruct\n",
+ "Executing node: saveGraph\n",
+ "\n",
+ "Saving down graphs to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.23.40.301/images/\n",
+ "\n",
+ "Executing node: saveReport\n",
"\n",
- "Saving down procedure report to /outputs/2020.09.22/run_13.34.41.579/report/\n",
- "Saving down RandomForestClassifier model to /outputs/2020.09.22/run_13.34.41.579/models/\n",
- "Saving down model parameters to /outputs/2020.09.22/run_13.34.41.579/config/\n",
+ "Saving down procedure report to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.23.40.301/report/\n",
"\n",
- ".automl.run took 00:02:30.535\n"
+ "Executing node: saveMeta\n",
+ "\n",
+ "Saving down model parameters to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.23.40.301/config/\n",
+ "\n",
+ "Executing node: saveModels\n",
+ "\n",
+ "Saving down model to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.23.40.301/models/\n",
+ "\n",
+ "\n",
+ ".automl.fit took 00:00:37.773\n",
+ "\n",
+ "Return of .automl.fit:\n",
+ "modelInfo| `startDate`startTime`featureExtractionType`problemType`saveOption`..\n",
+ "predict | {[config;features]\n",
+ " original_print:utils.printing;\n",
+ " utils.printi..\n",
+ "\n"
]
}
],
"source": [
"start:.z.t\n",
- "r1:.automl.run[tab;tgt;ftype;ptype;dict]\n",
- "-1\"\\n.automl.run took \",string .z.t-start;"
+ "model1:.automl.fit[telcoFeats;telcoTarget;featureType1;problemType1;paramDict1]\n",
+ "-1\"\\n.automl.fit took \",string .z.t-start;\n",
+ "-1\"\\nReturn of .automl.fit:\\n\",.Q.s[model1];"
]
},
{
@@ -403,19 +461,17 @@
"\n",
"
\n",
"\n",
- "We see that in the above example, 8 features were passed to the model following the application of feature extraction and significance testing. \n",
+ "We see that in the above example, 40 features were passed to the model following the application of feature extraction and significance testing. \n",
"\n",
- "**NB:** In the default case, normal feature extraction only uses the original features passed into the system, while FRESH feature extraction would apply the functions available for FRESH within the ML-Toolkit as defined by `.ml.fresh.params`.\n",
- "\n",
- "Looking at the feature impact above, we can see that `tenure` had the highest feature impact in the dataset when applied to the best model.\n",
+ "Looking at the feature impact plot above, we can see that `Contract_One year` had the highest feature impact in the dataset when applied to the best model, indicating this was the most important feature when generating predictions.\n",
"\n",
"#### Confusion matrix\n",
"\n",
"
\n",
"\n",
- "A confusion matrix is also produced within the pipeline for classification problems. We see that the final `RandomForestClassifier` model correctly classified 724 data points. \n",
+ "A confusion matrix is also produced within the pipeline for classification problems. We see that the final `RandomForestClassifier` model correctly classified 779 data points. \n",
"\n",
- "All other outputs from this run have been stored in a directory of format `/outputs/date/run_time/`"
+ "All other outputs from this run have been stored in a directory of format `/outputs/dateTimeModels/date/run_time/`"
]
},
{
@@ -429,38 +485,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can apply the workflow and fitted model associated with our specified run to new data using:\n",
- "\n",
- "```.automl.new[tab;dt;tm]```\n",
- "\n",
- "Where:\n",
- "\n",
- "- `tab` is an unkeyed tabular dataset which has the same schema as the input data from the run specified in fpath\n",
- "- `dt` is the date for a specified run as a date `yyyy.mm.dd` or a string of format `\"yyyy.mm.dd\"`\n",
- "- `tm` is the timestamp for a specified run as a timestamp `hh:mm:ss.xxx` or a string of format `\"hh:mm:ss.xxx\"` or `\"hh.mm.ss.xxx\"` \n",
- "\n",
- "**NB:** Outputs from previous runs, such as `models` or `config`, are stored in the `outputs` directory and are organised such that we have the following file structure: `outputs/dt/run_tm/`, e.g. `outputs/2001.01.01/run_12.00.00.000\"`.\n",
+ "We can apply the workflow associated with our specified run to new data using the `predict` attribute returned\n",
"\n",
- "The function will return the target predictions for new data based on the previously fitted model and workflow.\n",
- "\n",
- "Below we apply `.automl.new` to the test set we created earlier and pass in the date and time of the previous run. This will apply the best model from the run above to our new data:"
+ "The function will return the target predictions for new data based on the previously fitted model and workflow."
]
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Run applied to dataset:\n",
+ "Model applied to dataset:\n",
"\n",
- "Run date: 2020.09.22. Run time: 13:10:17.224.\n",
+ "Model date: 2020.12.22. Model time: 13:23:40.301.\n",
"\n",
"Predictions: \n",
- "0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 0..\n",
+ "0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0..\n",
"\n",
"Targets:\n",
"0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0..\n"
@@ -468,10 +512,10 @@
}
],
"source": [
- "-1\"Run applied to dataset:\\n\";\n",
- ".util.print_runid . r1;\n",
+ "-1\"Model applied to dataset:\\n\";\n",
+ ".util.printDateTimeId model1.modelInfo;\n",
"-1\"\\nPredictions: \";\n",
- "show pred:.automl.new[telcoInputs`xtest]. r1\n",
+ "show pred1:model1.predict[telcoInputs`xtest]\n",
"-1\"\\nTargets:\";\n",
"show telcoInputs`ytest"
]
@@ -485,30 +529,30 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Accuracy on model run using hold-out data: 0.816\n"
+ "Accuracy on model run using hold-out data: 0.866\n"
]
}
],
"source": [
- "-1\"Accuracy on model run using hold-out data: \",string acc1:.ml.accuracy[telcoInputs`ytest;pred];"
+ "-1\"Accuracy on model run using hold-out data: \",string accuracy1:.ml.accuracy[telcoInputs`ytest;pred1];"
]
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "
"
+ ""
]
},
"metadata": {},
@@ -516,7 +560,7 @@
},
{
"data": {
- "image/png": "",
+ "image/png": "",
"text/plain": [
""
]
@@ -526,7 +570,7 @@
}
],
"source": [
- "displayCM[value .ml.confmat[telcoInputs`ytest;pred];`0`1;\"Test Set Confusion Matrix\";()];"
+ ".util.displayCM[value .ml.confmat[telcoInputs`ytest;pred1];`0`1;\"Test Set Confusion Matrix\";()];"
]
},
{
@@ -547,7 +591,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The function `.automl.run` can also be applied to textual data using its default configuration.\n",
+ "The function `.automl.fit` can also be applied to text data using its default configuration.\n",
"\n",
"As with the example above, data must be presented with a 1-to-1 mapping between features and targets.\n",
"\n",
@@ -560,7 +604,7 @@
"source": [
"### IMBD Dataset\n",
"\n",
- "The [IMBD](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) dataset contains reviews of over 50,000 movies for NLP or text analytics. The dataset consists of 2 columns, containing textual reviews and their associated positive or negative sentiment classification."
+ "The [IMBD](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) dataset contains reviews of over 50,000 movies for NLP or text analysis. The dataset consists of 2 columns, containing text reviews and the target indicating if they were positively or negatively reviewed."
]
},
{
@@ -579,14 +623,14 @@
},
{
"cell_type": "code",
- "execution_count": 11,
+ "execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Shape of feature data is: 1500 x 1 x 585\n",
+ "Shape of feature data is: 1000 x 1 x 585\n",
"\n",
"comment \n",
"-------------------------------------------------------------------------------\n",
@@ -598,16 +642,16 @@
"\n",
"Distribution of target values:\n",
"\n",
- "target| num pcnt \n",
- "------| ---------\n",
- "0 | 740 49.33\n",
- "1 | 760 50.67\n"
+ "target| num pcnt\n",
+ "------| --------\n",
+ "0 | 477 47.7\n",
+ "1 | 523 52.3\n"
]
}
],
"source": [
"// load data\n",
- "imdbData:1500#(\"SI\";enlist \",\")0:`:../data/IMBD.csv\n",
+ "imdbData:1000#(\"SI\";enlist \",\")0:`:../data/IMBD.csv\n",
"\n",
"// convert text data to string\n",
"imdbData:update string each comment from imdbData\n",
@@ -627,22 +671,22 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We now split the data into training and testing sets to be used with `.automl.run` and `.automl.new`."
+ "We now split the data into training and testing sets to be used with `.automl.fit` and as an independent testing set for application of the `predict` attribute."
]
},
{
"cell_type": "code",
- "execution_count": 12,
+ "execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "xtrain| +(,`comment)!,(\"Three years ago, Rachel(Therese Fretwell) was partyin..\n",
- "ytrain| 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1..\n",
- "xtest | +(,`comment)!,(\"CAROL'S JOURNEY is a pleasure to watch for so many re..\n",
- "ytest | 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1..\n"
+ "xtrain| +(,`comment)!,(\"The creativeness of this movie was lost from the begi..\n",
+ "ytrain| 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 1 1 1 1 0..\n",
+ "xtest | +(,`comment)!,(\"I'm watching this on the Star World network overseas ..\n",
+ "ytest | 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1..\n"
]
}
],
@@ -663,34 +707,27 @@
"source": [
"The below example demonstrates a binary classification problem. Notice that this time `nlp` is being passed as the feature extraction type.\n",
"\n",
- "The default configuration for AutoML will again be used, with a random seed included so that results can be replicated."
+ "A slight modification will be made to the default parameters as this model will be saved under the name `nlpModelNotebook` and the overWriteFiles parameter will also be set to `1b` to allow users to run this notebook multiple times, overwriting the saved model each iteration."
]
},
{
"cell_type": "code",
- "execution_count": 13,
+ "execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
- "tab:imdbInputs`xtrain / features\n",
- "tgt:imdbInputs`ytrain / targets\n",
- "ftype:`nlp / NLP feature extraction\n",
- "ptype:`class / classification problem\n",
- "dict:enlist[`seed]!enlist 168 / default configuration"
+ "IMBDfeats :imdbInputs`xtrain / features\n",
+ "IMBDtarget :imdbInputs`ytrain / targets\n",
+ "featureType2:`nlp / NLP feature extraction\n",
+ "problemType2:`class / classification problem\n",
+ "paramDict2 :`savedModelName`overWriteFiles`seed!(`nlpModelNotebook;1b;100) / define name of model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Outputs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now run `automl.run` utilizing the NLP functionality."
+ "We can now run `automl.fit` utilizing the NLP functionality."
]
},
{
@@ -704,7 +741,7 @@
},
{
"cell_type": "code",
- "execution_count": 14,
+ "execution_count": 13,
"metadata": {
"scrolled": false
},
@@ -713,62 +750,107 @@
"name": "stdout",
"output_type": "stream",
"text": [
+ "Executing node: automlConfig\n",
+ "Executing node: configuration\n",
+ "Executing node: targetDataConfig\n",
+ "Executing node: targetData\n",
+ "Executing node: featureDataConfig\n",
+ "Executing node: featureData\n",
+ "Executing node: dataCheck\n",
+ "\n",
+ "For full reproducibility between q processes of the NLP word2vec implementation, the PYTHONHASHSEED environment variable must be set upon initialization of q. See https://code.kx.com/q/ml/automl/ug/options/#seed for details.\n",
+ "\n",
+ "Executing node: featureDescription\n",
"\n",
"The following is a breakdown of information for each of the relevant columns in the dataset\n",
"\n",
+ "\n",
" | count unique mean std min max type\n",
"-------| ----------------------------------\n",
- "comment| 1350 1350 :: :: :: :: text\n",
+ "comment| 900 900 :: :: :: :: text\n",
+ "\n",
+ "\n",
+ "Executing node: dataPreprocessing\n",
"\n",
"Data preprocessing complete, starting feature creation\n",
"\n",
- "Feature creation and significance testing complete\n",
+ "Executing node: featureCreation\n",
+ "Executing node: labelEncode\n",
+ "Executing node: featureSignificance\n",
+ "\n",
+ "Total number of significant features being passed to the models = 254\n",
+ "\n",
+ "Executing node: trainTestSplit\n",
+ "Executing node: modelGeneration\n",
+ "Executing node: selectModels\n",
+ "\n",
"Starting initial model selection - allow ample time for large datasets\n",
"\n",
- "Total features being passed to the models = 88\n",
+ "Executing node: runModels\n",
+ "\n",
+ "Scores for all models using .ml.accuracy\n",
+ "\n",
+ "\n",
+ "RandomForestClassifier | 0.7586657\n",
+ "GradientBoostingClassifier| 0.7447826\n",
+ "SVC | 0.736027\n",
+ "MLPClassifier | 0.7342579\n",
+ "AdaBoostClassifier | 0.7327286\n",
+ "LinearSVC | 0.7291904\n",
+ "KNeighborsClassifier | 0.7186657\n",
+ "LogisticRegression | 0.709985\n",
+ "BinaryKeras | 0.6977961\n",
+ "GaussianNB | 0.6910795\n",
"\n",
- "Scores for all models, using .ml.accuracy\n",
- "RandomForestClassifier | 0.7546512\n",
- "GradientBoostingClassifier| 0.7512166\n",
- "AdaBoostClassifier | 0.748891\n",
- "SVC | 0.7476879\n",
- "LinearSVC | 0.746552\n",
- "MLPClassifier | 0.7454093\n",
- "LogisticRegression | 0.7430972\n",
- "binarykeras | 0.7338083\n",
- "KNeighborsClassifier | 0.7291504\n",
- "GaussianNB | 0.620453\n",
"\n",
- "Best scoring model = RandomForestClassifier\n",
- "Score for validation predictions using best model = 0.6944444\n",
"\n",
+ "Best scoring model = RandomForestClassifier\n",
"\n",
- "Feature impact calculated for features associated with RandomForestClassifier model\n",
- "Plots saved in /outputs/2020.09.22/run_13.10.48.282/images/\n",
+ "Executing node: optimizeModels\n",
"\n",
"Continuing to hyperparameter search and final model fitting on testing set\n",
"\n",
- "Best model fitting now complete - final score on testing set = 0.7407407\n",
+ "\n",
+ "Best model fitting now complete - final score on testing set = 0.7611111\n",
+ "\n",
"\n",
"Confusion matrix for testing set:\n",
"\n",
- " | pred_0 pred_1\n",
+ "\n",
+ " | true_0 true_1\n",
"------| -------------\n",
- "true_0| 86 35 \n",
- "true_1| 35 114 \n",
+ "pred_0| 56 27 \n",
+ "pred_1| 16 81 \n",
+ "\n",
+ "\n",
+ "Executing node: predictParams\n",
+ "Executing node: preprocParams\n",
+ "Executing node: pathConstruct\n",
+ "Executing node: saveGraph\n",
+ "\n",
+ "Saving down graphs to /Users/dianeodonoghue/q/automl/outputs/namedModels/nlpModelNotebook/images/\n",
+ "\n",
+ "Executing node: saveReport\n",
+ "\n",
+ "Saving down procedure report to /Users/dianeodonoghue/q/automl/outputs/namedModels/nlpModelNotebook/report/\n",
+ "\n",
+ "Executing node: saveMeta\n",
+ "\n",
+ "Saving down model parameters to /Users/dianeodonoghue/q/automl/outputs/namedModels/nlpModelNotebook/config/\n",
"\n",
- "Saving down procedure report to /outputs/2020.09.22/run_13.10.48.282/report/\n",
- "Saving down RandomForestClassifier model to /outputs/2020.09.22/run_13.10.48.282/models/\n",
- "Saving down model parameters to /outputs/2020.09.22/run_13.10.48.282/config/\n",
+ "Executing node: saveModels\n",
"\n",
- ".automl.run took 00:02:15.215\n"
+ "Saving down model to /Users/dianeodonoghue/q/automl/outputs/namedModels/nlpModelNotebook/models/\n",
+ "\n",
+ "\n",
+ ".automl.fit took 00:01:17.022\n"
]
}
],
"source": [
"start:.z.t\n",
- "r2:.automl.run[tab;tgt;ftype;ptype;dict]\n",
- "-1\"\\n.automl.run took \",string .z.t-start;"
+ ".automl.fit[IMBDfeats;IMBDtarget;featureType2;problemType2;paramDict2];\n",
+ "-1\"\\n.automl.fit took \",string .z.t-start;"
]
},
{
@@ -781,15 +863,15 @@
"\n",
"From the above example, we can see that even though one feature was passed to the model, multiple features were created using the `nlp` feature creation methods. If there was any additional non textual data present, the `normal` feature creation procedures would of been applied to them. \n",
"\n",
- "Looking at the feature impact above, we can see that the features created by the `word2vec` module (`colx`) were deemed to be the most important \n",
+ "Looking at the feature impact above, we can see that the majority of features created by the `word2vec` module (`colx`) were deemed to be important along with various features created from the NLP spacy library\n",
"\n",
"#### Confusion matrix\n",
"\n",
" \n",
"\n",
- "A confusion matrix is also produced within the pipeline for classification problems. We see that the final `MLPClassifier` model correctly classified 284 data points. \n",
+ "A confusion matrix is also produced within the pipeline for classification problems. We see that the final `RandomForestClassifier` model correctly classified 137 out of 180 data points. \n",
"\n",
- "All other outputs from this run have been stored in a directory of format `/outputs/date/run_time/`"
+ "All other outputs from this run have been stored in a directory of format `/outputs/namedModels/modelName/`"
]
},
{
@@ -803,7 +885,34 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The best model created within `automl.run` , is applied to the unseen test data to evaluate the models performance"
+ "To retrieve a model, `.automl.getModel` can be used to retrieve the metadata and associated prediction function to be used on new data from disk, either, by passing the name or the date/time of the desired model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "modelInfo| `modelLib`modelFunc`startDate`startTime`featureExtractionType`prob..\n",
+ "predict | {[config;features]\n",
+ " original_print:utils.printing;\n",
+ " utils.printi..\n"
+ ]
+ }
+ ],
+ "source": [
+ "show model2:.automl.getModel[enlist[`savedModelName]!enlist \"nlpModelNotebook\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The best model created within `automl.fit` is applied to the unseen test data to evaluate the models performance"
]
},
{
@@ -815,23 +924,23 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Run applied to dataset:\n",
+ "Model applied to dataset:\n",
"\n",
- "Run date: 2020.09.22. Run time: 13:10:48.282.\n",
+ "Model Name: nlpModelNotebook.\n",
"\n",
"Predictions: \n",
- "1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0..\n",
+ "1 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 0 1 0 1..\n",
"\n",
"Targets:\n",
- "1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 0 1 1 1..\n"
+ "1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1..\n"
]
}
],
"source": [
- "-1\"Run applied to dataset:\\n\";\n",
- ".util.print_runid . r2;\n",
+ "-1\"Model applied to dataset:\\n\";\n",
+ ".util.printSavedModelId model2.modelInfo;\n",
"-1\"\\nPredictions: \";\n",
- "show imdbPred:.automl.new[imdbInputs`xtest]. r2\n",
+ "show pred2:model2.predict[imdbInputs`xtest]\n",
"-1\"\\nTargets:\";\n",
"show imdbInputs`ytest"
]
@@ -845,12 +954,12 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Accuracy on model run using hold-out data: 0.6933333\n"
+ "Accuracy on model run using hold-out data: 0.78\n"
]
}
],
"source": [
- "-1\"Accuracy on model run using hold-out data: \",string acc2:.ml.accuracy[imdbInputs`ytest;imdbPred];"
+ "-1\"Accuracy on model run using hold-out data: \",string accuracy2:.ml.accuracy[imdbInputs`ytest;pred2];"
]
},
{
@@ -861,7 +970,7 @@
{
"data": {
"text/plain": [
- ""
+ ""
]
},
"metadata": {},
@@ -869,7 +978,7 @@
},
{
"data": {
- "image/png": "",
+ "image/png": "",
"text/plain": [
""
]
@@ -879,7 +988,7 @@
}
],
"source": [
- "displayCM[value .ml.confmat[imdbInputs`ytest;imdbPred];`0`1;\"Test Set Confusion Matrix\";()];"
+ ".util.displayCM[value .ml.confmat[imdbInputs`ytest;pred2];`0`1;\"Test Set Confusion Matrix\";()];"
]
},
{
@@ -900,44 +1009,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In the previous section of the notebook, we showcased how to apply default parameters within the pipeline (excluding the random seed). In this section we will focus on how the final parameter of `.automl.run` can be modified to apply changes to the default behaviour.\n",
- "\n",
- "There are two options for how this final parameter can be input:\n",
- "- **kdb+ dictionary** outlining the changes to default behaviour that are to be made\n",
- "- The path to a **flat file** containing more human readable configuration updates."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Advanced parameters"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The following lists the parameters which can be altered by users to modify the functionality of the automl platform. In each case, the parameter name corresponds to the kdb+ dictionary key which would be passed, alongside its user defined value, to the `.automl.run` function in order to update functionality.\n",
- "\n",
- "Parameters:\n",
- "\n",
- "```txt\n",
- "aggcols Aggregation columns for FRESH\n",
- "funcs Functions to be applied for feature extraction\n",
- "gs Grid search function and no. of folds/percentage of data in validation set\n",
- "hld Size of the testing set on which the final model is tested\n",
- "hp Type of hyperparameter search to perform - `grid`random`sobol\n",
- "rs Random search function and no. of folds/percentage of data in validation set\n",
- "saveopt Saving options outlining what is to be saved to disk from a run\n",
- "scf Scoring functions for classification/regression tasks\n",
- "seed Random seed to be used\n",
- "sigfeats Feature significance procedure to be applied to the data\n",
- "sz Size of validation set used.\n",
- "trials Number of random/Sobol-random hyperparameters to generate\n",
- "tts Train-test split function to be applied\n",
- "xv Cross-validation function and # of folds/percentage of data in validation set\n",
- "```"
+ "In the previous section of the notebook, we showcased how to apply default parameters within the pipeline. In this section we will focus on how the final parameter of `.automl.fit` can be modified to apply changes to the default behaviour."
]
},
{
@@ -951,11 +1023,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In this case we use the Telco dataset and alter the parameter dictionary `p` in the following ways:\n",
- "1. Added a **random seed**: Here we have altered the ``` `seed``` parameter to be `75`.\n",
- "2. Added **feature extraction**: As mentioned above, in the default setting no functions are applied to the table during feature extraction. Below we apply `.automl.prep.i.truncsvd` to the data, this is a truncated singular value decomposition outlined [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) and applied to all combinations of columns of type float.\n",
- "3. Changed the size of the **testing** and **holdout** sets to be 10% of the data at each stage.\n",
- "4. Changed the **hyperparameter search** type from the default of grid search to random search. Note that Sobol-random search is also available."
+ "Below we apply `.automl.featureCreation.normal.truncSingleDecomp` to the data, this is a truncated singular value decomposition outlined [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) and applied to all combinations of columns of type float.\n",
+ "\n",
+ "A random seed of `100` will also be set."
]
},
{
@@ -967,16 +1037,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "seed | 75\n",
- "funcs| `.automl.prep.i.truncsvd\n",
- "sz | 0.1\n",
- "hld | 0.1\n",
- "hp | `random\n"
+ "seed | 100\n",
+ "functions| {[features]\n",
+ " truncCols:.ml.i.fndcols[features;\"f\"];\n",
+ " truncCols@:..\n"
]
}
],
"source": [
- "show p:`seed`funcs`sz`hld`hp!(75;`.automl.prep.i.truncsvd;.1;.1;`random)"
+ "paramKeys:`seed`functions // parameter names to amend\n",
+ "paramVals:(100;.automl.featureCreation.normal.truncSingleDecomp) // amended values\n",
+ "show paramDict3:paramKeys!paramVals"
]
},
{
@@ -990,9 +1061,18 @@
"name": "stdout",
"output_type": "stream",
"text": [
+ "Executing node: automlConfig\n",
+ "Executing node: configuration\n",
+ "Executing node: targetDataConfig\n",
+ "Executing node: targetData\n",
+ "Executing node: featureDataConfig\n",
+ "Executing node: featureData\n",
+ "Executing node: dataCheck\n",
+ "Executing node: featureDescription\n",
"\n",
"The following is a breakdown of information for each of the relevant columns in the dataset\n",
"\n",
+ "\n",
" | count unique mean std min max type \n",
"------ | --------------------------------------------------------\n",
"tenure | 4500 73 32.326 24.55931 0i 72i numeric \n",
@@ -1016,62 +1096,95 @@
"PaymentMethod | 4500 4 :: :: :: :: categorical\n",
"SeniorCitizen | 4500 2 :: :: :: :: boolean \n",
"\n",
+ "\n",
+ "Executing node: dataPreprocessing\n",
+ "\n",
"Data preprocessing complete, starting feature creation\n",
"\n",
- "Feature creation and significance testing complete\n",
+ "Executing node: featureCreation\n",
+ "Executing node: labelEncode\n",
+ "Executing node: featureSignificance\n",
+ "\n",
+ "Total number of significant features being passed to the models = 1021\n",
+ "\n",
+ "Executing node: trainTestSplit\n",
+ "Executing node: modelGeneration\n",
+ "Executing node: selectModels\n",
+ "\n",
"Starting initial model selection - allow ample time for large datasets\n",
"\n",
- "Total features being passed to the models = 270\n",
+ "Executing node: runModels\n",
"\n",
- "Scores for all models, using .ml.accuracy\n",
- "RandomForestClassifier | 0.8403292\n",
- "GradientBoostingClassifier| 0.8101509\n",
- "AdaBoostClassifier | 0.7975309\n",
- "LogisticRegression | 0.7964335\n",
- "MLPClassifier | 0.7876543\n",
- "KNeighborsClassifier | 0.7805213\n",
- "binarykeras | 0.7780521\n",
- "SVC | 0.7572016\n",
- "LinearSVC | 0.7322359\n",
- "GaussianNB | 0.7308642\n",
+ "Scores for all models using .ml.accuracy\n",
+ "\n",
+ "\n",
+ "RandomForestClassifier | 0.8204861\n",
+ "GradientBoostingClassifier| 0.8065972\n",
+ "AdaBoostClassifier | 0.7982639\n",
+ "LogisticRegression | 0.7975694\n",
+ "KNeighborsClassifier | 0.771875\n",
+ "MLPClassifier | 0.7670139\n",
+ "GaussianNB | 0.7392361\n",
+ "SVC | 0.7309028\n",
+ "BinaryKeras | 0.6506944\n",
+ "LinearSVC | 0.6340278\n",
"\n",
- "Best scoring model = RandomForestClassifier\n",
- "Score for validation predictions using best model = 0.8691358\n",
"\n",
"\n",
- "Feature impact calculated for features associated with RandomForestClassifier model\n",
- "Plots saved in /outputs/2020.09.22/run_13.13.15.766/images/\n",
+ "Best scoring model = RandomForestClassifier\n",
+ "\n",
+ "Executing node: optimizeModels\n",
"\n",
"Continuing to hyperparameter search and final model fitting on testing set\n",
- "Number of distinct hp sets less than n, returning 222 sets.\n",
"\n",
- "Best model fitting now complete - final score on testing set = 0.8177778\n",
+ "\n",
+ "Best model fitting now complete - final score on testing set = 0.8644444\n",
+ "\n",
"\n",
"Confusion matrix for testing set:\n",
"\n",
- " | pred_0 pred_1\n",
+ "\n",
+ " | true_0 true_1\n",
"------| -------------\n",
- "true_0| 302 32 \n",
- "true_1| 50 66 \n",
+ "pred_0| 596 41 \n",
+ "pred_1| 81 182 \n",
+ "\n",
+ "\n",
+ "Executing node: predictParams\n",
+ "Executing node: preprocParams\n",
+ "Executing node: pathConstruct\n",
+ "Executing node: saveGraph\n",
+ "\n",
+ "Saving down graphs to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.35.10.726/images/\n",
"\n",
- "Saving down procedure report to /outputs/2020.09.22/run_13.13.15.766/report/\n",
- "Saving down RandomForestClassifier model to /outputs/2020.09.22/run_13.13.15.766/models/\n",
- "Saving down model parameters to /outputs/2020.09.22/run_13.13.15.766/config/\n",
+ "Executing node: saveReport\n",
"\n",
- ".automl.run took 00:12:50.186\n"
+ "Saving down procedure report to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.35.10.726/report/\n",
+ "\n",
+ "Executing node: saveMeta\n",
+ "\n",
+ "Saving down model parameters to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.35.10.726/config/\n",
+ "\n",
+ "Executing node: saveModels\n",
+ "\n",
+ "Saving down model to /Users/dianeodonoghue/q/automl/outputs/dateTimeModels/2020.12.22/run_13.35.10.726/models/\n",
+ "\n",
+ "\n",
+ ".automl.fit took 00:02:01.191\n"
]
}
],
"source": [
"start:.z.t\n",
- "r3:.automl.run[telcoInputs`xtrain;telcoInputs`ytrain;`normal;`class;p]\n",
- "-1\"\\n.automl.run took \",string .z.t-start;"
+ "model3:.automl.fit[telcoInputs`xtrain;telcoInputs`ytrain;`normal;`class;paramDict3]\n",
+ "-1\"\\n.automl.fit took \",string .z.t-start;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
+ "#### Feature impact\n",
" "
]
},
@@ -1079,9 +1192,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can see by looking at the feature impact that many of the most impactful features are now derived from those generated when `.automl.prep.i.truncsvd` was applied during feature extraction, this gives some insight into the potential benefit of this form of feature extraction. \n",
+ "We can see by looking at the feature impact that a number of the most impactful features are now derived from those generated when `.automl.featureCreation.normal.truncSingleDecomp` was applied during feature extraction, this gives some insight into the potential benefit of this form of feature extraction. \n",
"\n",
- "While benefiting from increases in accuracy the addition of larger numbers of features can have the effect of slowing training time and scoring time which have have an impact in time critical use-cases.\n",
+ "While the model may benefit from an increases in accuracy, the addition of larger numbers of features can have the effect of slowing training time and scoring time which have have an impact in time critical use-cases.\n",
"\n",
"We can now predict on the hold-out dataset in order to compare accuracy results to the default case."
]
@@ -1095,12 +1208,12 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Run applied to dataset:\n",
+ "Model applied to dataset:\n",
"\n",
- "Run date: 2020.09.22. Run time: 13:13:15.766.\n",
+ "Model date: 2020.12.22. Model time: 13:35:10.726.\n",
"\n",
"Predictions: \n",
- "0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0..\n",
+ "0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0..\n",
"\n",
"Targets:\n",
"0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0..\n"
@@ -1108,14 +1221,21 @@
}
],
"source": [
- "-1\"Run applied to dataset:\\n\";\n",
- ".util.print_runid . r3;\n",
+ "-1\"Model applied to dataset:\\n\";\n",
+ ".util.printDateTimeId model3.modelInfo;\n",
"-1\"\\nPredictions: \";\n",
- "show pred:.automl.new[telcoInputs`xtest]. r3\n",
+ "show pred3:model3.predict[telcoInputs`xtest]\n",
"-1\"\\nTargets:\";\n",
"show telcoInputs`ytest"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see that by adding feature extraction in the normal case, we have improved the accuracy slightly. This is highlighted in the confusion matrix below."
+ ]
+ },
{
"cell_type": "code",
"execution_count": 21,
@@ -1125,26 +1245,19 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Run date: 2020.09.22. Run time: 13:10:17.224.\n",
- "Accuracy on default model run using held-out data: 0.816\n",
+ "Model date: 2020.12.22. Model time: 13:23:40.301.\n",
+ "Accuracy on default model run using held-out data: 0.866\n",
"\n",
- "Run date: 2020.09.22. Run time: 13:13:15.766.\n",
- "Accuracy on custom model run using held-out data : 0.82\n"
+ "Model date: 2020.12.22. Model time: 13:35:10.726.\n",
+ "Accuracy on custom model run using held-out data : 0.878\n"
]
}
],
"source": [
- ".util.print_runid . r1;\n",
- "-1\"Accuracy on default model run using held-out data: \",string[acc1],\"\\n\";\n",
- ".util.print_runid . r3;\n",
- "-1\"Accuracy on custom model run using held-out data : \",string .ml.accuracy[telcoInputs`ytest;pred];"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can see that by adding feature extraction in the normal case we have improved our accuracy by ~ 8%. This is highlighted in the confusion matrix below."
+ ".util.printDateTimeId model1.modelInfo;\n",
+ "-1\"Accuracy on default model run using held-out data: \",string[accuracy1],\"\\n\";\n",
+ ".util.printDateTimeId model3.modelInfo;\n",
+ "-1\"Accuracy on custom model run using held-out data : \",string .ml.accuracy[telcoInputs`ytest;pred3];"
]
},
{
@@ -1154,16 +1267,7 @@
"outputs": [
{
"data": {
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "image/png": "",
+ "image/png": "",
"text/plain": [
""
]
@@ -1173,7 +1277,7 @@
}
],
"source": [
- "displayCM[value .ml.confmat[telcoInputs`ytest;pred];`0`1;\"Test Set Confusion Matrix\";()];"
+ ".util.displayCM[value .ml.confmat[telcoInputs`ytest;pred3];`0`1;\"Test Set Confusion Matrix\";()];"
]
},
{
@@ -1187,40 +1291,68 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In this example we again use the Telco dataset and highlight how to change the save options, contained under `saveopt` within the parameter dictionary.\n",
+ "In this example we again use the Telco dataset and highlight how to change the save options contained under `saveOption` within the parameter dictionary.\n",
"\n",
- "In the default case, not modified in the examples above, the system will save all outputs to disk (reports, images, config file and models). This can be altered by the user to reduce the number of outputs saved to disk, where:\n",
+ "In the default case, not modified in the examples above, the system will save all outputs to disk (reports, images, config file and models). Below we will set the `saveOption` to be `0`, which means that the results will be displayed to console but nothing is persisted to disk.\n",
"\n",
- "- `0` = Nothing is saved the models will run and display results to console but nothing persisted\n",
- "- `1` = Save the model and configuration file only, will not generate a report for the user or any images\n",
- "- `2` = Save all possible outputs to disk for the user including reports, images, config and models\n",
+ "Other alterations made to the `paramDict` in the below model were\n",
+ "1. Added a **random seed**: Here we have altered the ``` `seed``` parameter to be `175`.\n",
+ "2. Changed the size of the **holdout** sets to be 30% of the data at each stage.\n",
+ "3. Changed the **hyperparameter search** type from the default of grid search to random search and set the number of repetitions to `2`. Note that Sobol-random search is also available.\n",
"\n",
- "We demonstrate the case for `0` below for a subset of 1000 data points."
+ "A smaller subset of 1000 datapoints will be used"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "saveOption | 0\n",
+ "seed | 175\n",
+ "holdoutSize | 0.3\n",
+ "hyperparameterSearchType| `random\n",
+ "randomSearchArgument | 2\n"
+ ]
+ }
+ ],
"source": [
"\\S 42\n",
- "feat:1000?telcoFeat\n",
- "targ:1000?telcoTarg"
+ "features:1000?telcoFeat\n",
+ "target :1000?telcoTarg\n",
+ "\n",
+ "paramKeys:`saveOption`seed`holdoutSize`hyperparameterSearchType`randomSearchArgument // parameter names to amend\n",
+ "paramVals:(0;175;.3;`random;2) // amended values\n",
+ "show paramDict:paramKeys!paramVals"
]
},
{
"cell_type": "code",
"execution_count": 24,
- "metadata": {},
+ "metadata": {
+ "scrolled": false
+ },
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
+ "Executing node: automlConfig\n",
+ "Executing node: configuration\n",
+ "Executing node: targetDataConfig\n",
+ "Executing node: targetData\n",
+ "Executing node: featureDataConfig\n",
+ "Executing node: featureData\n",
+ "Executing node: dataCheck\n",
+ "Executing node: featureDescription\n",
"\n",
"The following is a breakdown of information for each of the relevant columns in the dataset\n",
"\n",
+ "\n",
" | count unique mean std min max type \n",
"------ | --------------------------------------------------------\n",
"tenure | 1000 73 33.551 25.0546 0i 72i numeric \n",
@@ -1244,43 +1376,72 @@
"PaymentMethod | 1000 4 :: :: :: :: categorical\n",
"SeniorCitizen | 1000 2 :: :: :: :: boolean \n",
"\n",
+ "\n",
+ "Executing node: dataPreprocessing\n",
+ "\n",
"Data preprocessing complete, starting feature creation\n",
"\n",
- "Feature creation and significance testing complete\n",
+ "Executing node: featureCreation\n",
+ "Executing node: labelEncode\n",
+ "Executing node: featureSignificance\n",
+ "\n",
+ "Total number of significant features being passed to the models = 12\n",
+ "\n",
+ "Executing node: trainTestSplit\n",
+ "Executing node: modelGeneration\n",
+ "Executing node: selectModels\n",
+ "\n",
"Starting initial model selection - allow ample time for large datasets\n",
"\n",
- "Total features being passed to the models = 12\n",
+ "Executing node: runModels\n",
+ "\n",
+ "Scores for all models using .ml.accuracy\n",
+ "\n",
+ "\n",
+ "MLPClassifier | 0.7482143\n",
+ "LogisticRegression | 0.7446429\n",
+ "AdaBoostClassifier | 0.7428571\n",
+ "SVC | 0.7375\n",
+ "LinearSVC | 0.7357143\n",
+ "GaussianNB | 0.725\n",
+ "GradientBoostingClassifier| 0.7125\n",
+ "BinaryKeras | 0.7017857\n",
+ "KNeighborsClassifier | 0.6857143\n",
+ "RandomForestClassifier | 0.6839286\n",
"\n",
- "Scores for all models, using .ml.accuracy\n",
- "AdaBoostClassifier | 0.775\n",
- "LogisticRegression | 0.775\n",
- "LinearSVC | 0.775\n",
- "SVC | 0.771875\n",
- "MLPClassifier | 0.765625\n",
- "GradientBoostingClassifier| 0.759375\n",
- "KNeighborsClassifier | 0.7546875\n",
- "RandomForestClassifier | 0.74375\n",
- "GaussianNB | 0.7328125\n",
- "binarykeras | 0.684375\n",
"\n",
- "Best scoring model = AdaBoostClassifier\n",
- "Score for validation predictions using best model = 0.71875\n",
+ "\n",
+ "Best scoring model = MLPClassifier\n",
+ "\n",
+ "Executing node: optimizeModels\n",
"\n",
"Continuing to hyperparameter search and final model fitting on testing set\n",
"\n",
- "Best model fitting now complete - final score on testing set = 0.71\n",
+ "\n",
+ "Best model fitting now complete - final score on testing set = 0.805\n",
+ "\n",
"\n",
"Confusion matrix for testing set:\n",
"\n",
- " | pred_0 pred_1\n",
+ "\n",
+ " | true_0 true_1\n",
"------| -------------\n",
- "true_0| 142 0 \n",
- "true_1| 58 0 \n"
+ "pred_0| 161 0 \n",
+ "pred_1| 39 0 \n",
+ "\n",
+ "\n",
+ "Executing node: predictParams\n",
+ "Executing node: preprocParams\n",
+ "Executing node: pathConstruct\n",
+ "Executing node: saveGraph\n",
+ "Executing node: saveReport\n",
+ "Executing node: saveMeta\n",
+ "Executing node: saveModels\n"
]
}
],
"source": [
- ".automl.run[feat;targ;`normal;`class;enlist[`saveopt]!enlist 0];"
+ ".automl.fit[features;target;`normal;`class;paramDict];"
]
},
{
@@ -1301,11 +1462,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In this example, the IMDB dataset is used with the following changes made to the input dictionary `p`:\n",
+ "In this example, the IMDB dataset is used with the following changes made to the input dictionary `paramDict`:\n",
"\n",
"1. **Word2vec transformation** changed from default `continuous bag of words` method to `skip-gram`.\n",
- "2. **Significant feature function** changed to use the Benjamini-Hochberg-Yekutieli procedure.\n",
- "3. **Random seed** set as `100`."
+ "2. **Significant feature function** changed to use the percentile based procedure.\n",
+ "3. **Random seed** set as `275`.\n",
+ "\n",
+ "In this example, printing to screen will also be suppressed and redirected to a log file called `LogFile`"
]
},
{
@@ -1317,20 +1480,27 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "sigfeats| `.automl.newsigfeat\n",
- "w2v | 1\n",
- "seed | 150\n"
+ "significantFeatures| `.automl.newSigFeat\n",
+ "w2v | 1\n",
+ "seed | 275\n",
+ "loggingFile | \"logFile\"\n"
]
}
],
"source": [
+ ".automl.updatePrinting[] // Disable printing to screen \n",
+ ".automl.updateLogging[] // Redirect printing to log file\n",
+ "\n",
+ "\n",
"// new significant feature function \n",
- ".automl.newsigfeat:{[x;y]\n",
- " .ml.fresh.significantfeatures[x;y;.ml.fresh.benjhoch 0.05]\n",
+ ".automl.newSigFeat:{[x;y]\n",
+ " .ml.fresh.significantfeatures[x;y;.ml.fresh.percentile 0.10]\n",
" }\n",
"\n",
"// new parameter dictionary\n",
- "show p:`sigfeats`w2v`seed!(`.automl.newsigfeat;1;150)"
+ "paramKeys:`significantFeatures`w2v`seed`loggingFile // parameter names to amend\n",
+ "paramVals:(`.automl.newSigFeat;1;275;\"logFile\") // amended values\n",
+ "show paramDict4:paramKeys!paramVals"
]
},
{
@@ -1344,66 +1514,43 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "\n",
- "The following is a breakdown of information for each of the relevant columns in the dataset\n",
- "\n",
- " | count unique mean std min max type\n",
- "-------| ----------------------------------\n",
- "comment| 1350 1350 :: :: :: :: text\n",
- "\n",
- "Data preprocessing complete, starting feature creation\n",
- "\n",
- "Feature creation and significance testing complete\n",
- "Starting initial model selection - allow ample time for large datasets\n",
- "\n",
- "Total features being passed to the models = 230\n",
- "\n",
- "Scores for all models, using .ml.accuracy\n",
- "MLPClassifier | 0.8090335\n",
- "LinearSVC | 0.7940046\n",
- "RandomForestClassifier | 0.7881839\n",
- "GradientBoostingClassifier| 0.7824304\n",
- "AdaBoostClassifier | 0.773202\n",
- "LogisticRegression | 0.743077\n",
- "SVC | 0.7350181\n",
- "KNeighborsClassifier | 0.7245665\n",
- "binarykeras | 0.7130125\n",
- "GaussianNB | 0.6747883\n",
- "\n",
- "Best scoring model = MLPClassifier\n",
- "Score for validation predictions using best model = 0.8518519\n",
- "\n",
- "\n",
- "Feature impact calculated for features associated with MLPClassifier model\n",
- "Plots saved in /outputs/2020.09.22/run_13.26.49.103/images/\n",
- "\n",
- "Continuing to hyperparameter search and final model fitting on testing set\n",
- "\n",
- "Best model fitting now complete - final score on testing set = 0.7814815\n",
- "\n",
- "Confusion matrix for testing set:\n",
- "\n",
- " | pred_0 pred_1\n",
- "------| -------------\n",
- "true_0| 95 32 \n",
- "true_1| 27 116 \n",
- "\n",
- "Saving down procedure report to /outputs/2020.09.22/run_13.26.49.103/report/\n",
- "Saving down MLPClassifier model to /outputs/2020.09.22/run_13.26.49.103/models/\n",
- "Saving down model parameters to /outputs/2020.09.22/run_13.26.49.103/config/\n"
+ "Executing node: automlConfig\n",
+ "Executing node: configuration\n",
+ "Executing node: targetDataConfig\n",
+ "Executing node: targetData\n",
+ "Executing node: featureDataConfig\n",
+ "Executing node: featureData\n",
+ "Executing node: dataCheck\n",
+ "Executing node: featureDescription\n",
+ "Executing node: dataPreprocessing\n",
+ "Executing node: featureCreation\n",
+ "Executing node: labelEncode\n",
+ "Executing node: featureSignificance\n",
+ "Executing node: trainTestSplit\n",
+ "Executing node: modelGeneration\n",
+ "Executing node: selectModels\n",
+ "Executing node: runModels\n",
+ "Executing node: optimizeModels\n",
+ "Executing node: predictParams\n",
+ "Executing node: preprocParams\n",
+ "Executing node: pathConstruct\n",
+ "Executing node: saveGraph\n",
+ "Executing node: saveReport\n",
+ "Executing node: saveMeta\n",
+ "Executing node: saveModels\n"
]
}
],
"source": [
"// run automl with new parameters\n",
- "r4:.automl.run[imdbInputs`xtrain;imdbInputs`ytrain;`nlp;`class;p]"
+ "model4:.automl.fit[imdbInputs`xtrain;imdbInputs`ytrain;`nlp;`class;paramDict4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "We can see that the features deemed important compared with the initial run have varied. In this iteration, some of the features that were created from the NLP parser such as `CARDINAL` and compound were identified as significant in predicting the target value\n",
+ "We can see that the features deemed important compared with the initial run have varied. In this iteration, very few of the features created from the NLP spacy library were deemed to be significant when predicting the target value\n",
"\n",
" "
]
@@ -1417,162 +1564,53 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Run date: 2020.09.22. Run time: 13:26:49.103.\n",
+ "Model date: 2020.12.22. Model time: 13:38:49.990.\n",
"\n",
"Predictions: \n",
- "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0..\n",
+ "1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 1..\n",
"\n",
"Targets:\n",
- "1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 0 1 1 1..\n"
+ "1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1..\n"
]
}
],
"source": [
- ".util.print_runid . r4;\n",
+ ".util.printDateTimeId model4.modelInfo;\n",
"-1\"\\nPredictions: \";\n",
- "show pred:.automl.new[imdbInputs`xtest]. r4\n",
+ "show pred4:model4.predict[imdbInputs`xtest]\n",
"-1\"\\nTargets:\";\n",
"show imdbInputs`ytest"
]
},
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Run date: 2020.09.22. Run time: 13:10:48.282.\n",
- "Accuracy on default model run using held-out data: 0.6933333\n",
- "\n",
- "Run date: 2020.09.22. Run time: 13:10:48.282.\n",
- "Accuracy on custom model run using held-out data : 0.6933333\n"
- ]
- }
- ],
- "source": [
- ".util.print_runid . r2;\n",
- "-1\"Accuracy on default model run using held-out data: \",string[acc2],\"\\n\";\n",
- ".util.print_runid . r2;\n",
- "-1\"Accuracy on custom model run using held-out data : \",string .ml.accuracy[imdbInputs`ytest;imdbPred];"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Example 4"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We have shown in the previous examples that the pipeline can be altered by passing in a dictionary of parameters as the last argument in `.automl.run`. As mentioned previously, we can also pass the path to a flat file.\n",
- "\n",
- "Default flat files are saved to `automl/code/models/` where users can change parameters within each file. These are generated by a user using the function `.automl.savedefault` as follows:"
+ "Below we can see how changing the `w2v` implementation decreased the accuracy of the model compared with the initial run"
]
},
{
"cell_type": "code",
- "execution_count": 29,
- "metadata": {},
- "outputs": [],
- "source": [
- ".automl.savedefault[\"new_normal_defaults.txt\";`normal]"
- ]
- },
- {
- "cell_type": "markdown",
+ "execution_count": 28,
"metadata": {},
- "source": [
- "We can then run the pipeline using this new file as our final argument:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {
- "scrolled": false
- },
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
+ "Model date: 2020.12.22. Model time: 13:24:18.472.\n",
+ "Accuracy on default model run using held-out data: 0.78\n",
"\n",
- "The following is a breakdown of information for each of the relevant columns in the dataset\n",
- "\n",
- " | count unique mean std min max type \n",
- "------ | --------------------------------------------------------\n",
- "tenure | 5000 73 32.2504 24.55779 0i 72i numeric \n",
- "MonthlyCharges | 5000 1285 64.82904 30.46505 18.55 118.75 numeric \n",
- "TotalCharges | 5000 3413 2281.232 2279.37 18.85 8672.45 numeric \n",
- "customerID | 5000 3566 :: :: :: :: categorical\n",
- "gender | 5000 2 :: :: :: :: categorical\n",
- "Partner | 5000 2 :: :: :: :: categorical\n",
- "Dependents | 5000 2 :: :: :: :: categorical\n",
- "PhoneService | 5000 2 :: :: :: :: categorical\n",
- "MultipleLines | 5000 3 :: :: :: :: categorical\n",
- "InternetService | 5000 3 :: :: :: :: categorical\n",
- "OnlineSecurity | 5000 3 :: :: :: :: categorical\n",
- "OnlineBackup | 5000 3 :: :: :: :: categorical\n",
- "DeviceProtection| 5000 3 :: :: :: :: categorical\n",
- "TechSupport | 5000 3 :: :: :: :: categorical\n",
- "StreamingTV | 5000 3 :: :: :: :: categorical\n",
- "StreamingMovies | 5000 3 :: :: :: :: categorical\n",
- "Contract | 5000 3 :: :: :: :: categorical\n",
- "PaperlessBilling| 5000 2 :: :: :: :: categorical\n",
- "PaymentMethod | 5000 4 :: :: :: :: categorical\n",
- "SeniorCitizen | 5000 2 :: :: :: :: boolean \n",
- "\n",
- "Data preprocessing complete, starting feature creation\n",
- "\n",
- "Feature creation and significance testing complete\n",
- "Starting initial model selection - allow ample time for large datasets\n",
- "\n",
- "Total features being passed to the models = 7\n",
- "\n",
- "Scores for all models, using .ml.accuracy\n",
- "AdaBoostClassifier | 0.8009375\n",
- "MLPClassifier | 0.8003125\n",
- "LogisticRegression | 0.796875\n",
- "GradientBoostingClassifier| 0.795625\n",
- "RandomForestClassifier | 0.790625\n",
- "KNeighborsClassifier | 0.7903125\n",
- "binarykeras | 0.78125\n",
- "SVC | 0.75875\n",
- "GaussianNB | 0.7475\n",
- "LinearSVC | 0.7353125\n",
- "\n",
- "Best scoring model = AdaBoostClassifier\n",
- "Score for validation predictions using best model = 0.8025\n",
- "\n",
- "\n",
- "Feature impact calculated for features associated with AdaBoostClassifier model\n",
- "Plots saved in /outputs/2020.09.22/run_13.33.53.803/images/\n",
- "\n",
- "Continuing to hyperparameter search and final model fitting on testing set\n",
- "\n",
- "Best model fitting now complete - final score on testing set = 0.733\n",
- "\n",
- "Confusion matrix for testing set:\n",
- "\n",
- " | pred_0 pred_1\n",
- "------| -------------\n",
- "true_0| 733 0 \n",
- "true_1| 267 0 \n",
- "\n",
- "Saving down procedure report to /outputs/2020.09.22/run_13.33.53.803/report/\n",
- "Saving down AdaBoostClassifier model to /outputs/2020.09.22/run_13.33.53.803/models/\n",
- "Saving down model parameters to /outputs/2020.09.22/run_13.33.53.803/config/\n"
+ "Model date: 2020.12.22. Model time: 13:38:49.990.\n",
+ "Accuracy on custom model run using held-out data : 0.78\n"
]
}
],
"source": [
- ".automl.run[telcoFeat;telcoTarg;`normal;`class;\"new_normal_defaults.txt\"];"
+ ".util.printDateTimeId model2.modelInfo;\n",
+ "-1\"Accuracy on default model run using held-out data: \",string[accuracy2],\"\\n\";\n",
+ ".util.printDateTimeId model4.modelInfo;\n",
+ "-1\"Accuracy on custom model run using held-out data : \",string .ml.accuracy[imdbInputs`ytest;pred4];"
]
},
{
diff --git a/notebooks/images/run1conf.png b/notebooks/images/run1conf.png
index 6029608..dd3ce1e 100644
Binary files a/notebooks/images/run1conf.png and b/notebooks/images/run1conf.png differ
diff --git a/notebooks/images/run1impact.png b/notebooks/images/run1impact.png
index 28476ae..eb111df 100644
Binary files a/notebooks/images/run1impact.png and b/notebooks/images/run1impact.png differ
diff --git a/notebooks/images/run2conf.png b/notebooks/images/run2conf.png
index 5a90067..bc48251 100644
Binary files a/notebooks/images/run2conf.png and b/notebooks/images/run2conf.png differ
diff --git a/notebooks/images/run2impact.png b/notebooks/images/run2impact.png
index 4afaadf..cf1c132 100644
Binary files a/notebooks/images/run2impact.png and b/notebooks/images/run2impact.png differ
diff --git a/notebooks/images/run3impact.png b/notebooks/images/run3impact.png
index 861b37c..7d770e5 100644
Binary files a/notebooks/images/run3impact.png and b/notebooks/images/run3impact.png differ
diff --git a/notebooks/images/run4impact.png b/notebooks/images/run4impact.png
index 8ff941f..25fe9a2 100644
Binary files a/notebooks/images/run4impact.png and b/notebooks/images/run4impact.png differ
diff --git a/utils/util.q b/utils/util.q
index 74cf50e..4e849b6 100644
--- a/utils/util.q
+++ b/utils/util.q
@@ -8,7 +8,8 @@ npa:.p.import[`numpy]`:array
round:{y*"j"$x%y}
imax:{x?max x}
mattab:{flip value flip x}
-print_runid:{-1"Run date: ",string[x],". Run time: ",string[y],"."}
+printDateTimeId:{-1"Model date: ",string[x`startDate],". Model time: ",string[x`startTime],"."}
+printSavedModelId:{-1"Model Name: ",string[x`savedModelName],"."}
// @kind function
// @category misc