-
Notifications
You must be signed in to change notification settings - Fork 159
06.5 TensorFlow (Custom)
Custom models based on TensorFlow or CNTK can provide higher specialization on image classification than using Cognitive Services or Pre-trained models. In this walkthough you will learn how to create/train a complex TensorFlow/CNTK model. Custom models creation and training usually needs approaches based on Python scripts; in this case, you will use Keras and TensorFlow/CNTK backends. Microsoft provides ML Workbench and VS AI Tools that provide a better UI for those Python scripts executions when creating and training TensorFlow models.
If your computer has an NVIDIA graphics card, you can install the NVIDIA CUDA Toolkit so the model training will be significantly faster.
Tensorflow v.1.5 supports CUDA Toolkit version 9.0, which is available for Windows, macOS and Linux operative system. This version can be downloaded at the CUDA Toolkit archive. First, you should install CUDA Toolkit 9.0:
In the same way as with CUDA Toolkit, you should avoid to install latest version until Tensorflow support is confirmed for latest version. Until then, you should install cuDNN version 7 for CUDA 9.0, which can be downloaded at the cuDNN Download, as shown in the following figure:
The cuDNN supports Windows, macOS and Linux operative systems, and to download this package you will need to register in the NVIDIA developer forum.
You should install this package following the next instructions:
PREREQUISITES
CUDA 9.0 and a GPU of compute capability 3.0 or higher are required.
ALL PLATFORMS
Extract the downloaded cuDNN .ZIP file to a folder of your choice. That folder is referred below as <installpath>.
Then follow the platform-specific instructions as follows.
WINDOWS
Add <installpath>\cuda\bin to the PATH environment variable. That folder should contain the `cudnn64_7.dll` library file (the version depends on your installed dependencies).
LINUX
cd <installpath>
export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH
OS X
cd <installpath>
export DYLD_LIBRARY_PATH=`pwd`:$DYLD_LIBRARY_PATH
POSSIBLE ISSUE IN WINDOWS WITH NVIDIA GPU:
In Windows, if you don't set the environment PATH variable to point to the folder holding the cudnn64_7.dll
file as it was just explained, you will get the following error later when trying to run the training process:
ImportError: Could not find 'cudnn64_7.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Note that installing cuDNN is a separate step from installing CUDA, and this DLL is often found in a different directory from the CUDA DLLs. You may install the necessary DLL by downloading cuDNN 6 from this URL: https://developer.nvidia.com/cudnn
Deep learning neuronal networks (models) can be trained to classify objects which were not previously used to train the original model.
In order to train custom models (custom neuronal networks), you will use Microsoft Azure ML Workbench which you need to install on your machine after creating the Azure ML account.
Set it up by using the following procedure:
https://docs.microsoft.com/en-us/azure/machine-learning/preview/quickstart-installation
After setting up your Azure ML Experimentation account, you should be able to download the Azure ML Workbench desktop application from there as in the following screenshot:
Select either the Windows or Mac setup depending on your workstation OS.
-
After the installation process of ML Workbench is complete, select the Launch Workbench button on the last screen of the installer. If you have closed the installer, find the shortcut to Machine Learning Workbench on your desktop and Start menu named Azure Machine Learning Workbench to start the app.
-
Sign in to Workbench by using the same account that you used earlier to provision your Azure resources.
-
When the sign-in process has succeeded, Workbench attempts to find the Machine Learning Experimentation accounts that you created earlier. It searches for all Azure subscriptions to which your credential has access. When at least one Experimentation account is found, Workbench opens with that account. It then lists the workspaces and projects found in that account.
eShopOnContainersAI provides a custom project for you that handles the training of an existing neural network. First, we will check how to open the project from the ML Workbench desktop application and from Visual Studio Code.
Run the AML Workbench and after signing in with your Microsoft Account (which is authorized for your Azure Experiment account), the main Workbench dashboard will be displayed on screen.
From there, you should click on menu File -> Add Existing Projects as a Folder. You can also do the same action by clicking on the + button, and then select Add Existing Folder as Project, as shown in the figure:
Then, in the right slide panel, click on Browse to navigate to the eShopOnContainersAI folder (your local GitHub repo), and select the workbench
folder at the root of the folder/repo, and accept so you see a similar form as the following:
The project name will be pre-populated with the project folder's name, but you can change it to a more meaningful name for you, as eShopWorkbench
. An experimentation project will be created in Azure using the project name you provide here.
The project description is optional, as well as the Visualstudio.com GIT Repository. This repository is the one that will be used to host experimentation tracking data.
Additionally, check the Workspace field if you have several Experimentation Workspaces, as this will be used on your Azure accounting; otherwise, this field will be pre-populated with the Experimentation Workspace.
Finish up the creation form by clicking on the Create
button.
Creating a new project, or adding existing folder as a project, provokes the creation of an Experimentation workspace project in Azure. You can check this new created resource in the Azure portal, as in the following screenshot:
You should set up an external editor related to ML Workbench. You should only use Workbench to edit files which Workbench provides rich editing experience for, such as Python Notebooks, Data Sources and Data preparation packages.
For additional project files, you should open/edit those files by using a different tool/editor. This can be configured via menu File - > Configure Project IDE.
On the right slide panel, you should fill the Name of the program and enter the path to the application to use for opening the project folder. In the example below, we choose Visual Studio Code (check Visual Studio Code documentation for installing steps), a lightweight and resourceful IDE with Python support, but you can also use Visual Studio or any other third party editor (PyCharm, Eclipse, ...)
Once the Project IDE has been configured and there is a project open in the Workbench, you can open the project folder using the menu File -> Open Project, which will open the project using the configured IDE application, like in the following screenshot when using Visual Studio Code while the eShopWorkbench project was open in ML Workbench:
In order to execute the training project, you need to install Python dependencies (packages first). Workbench uses Conda as a package manager, so in order to install those dependencies you will need to use either the Command prompt or the PowerShell terminal which are available from the File menu in ML Workbench.
The following commands are run by using the Command Prompt in windows, but you could also use a PowerShell terminal.
Once you open the command prompt or PowerShell, it is recommended to close now ML Workbench as we're going to install dependencies.
First, you can check the Conda environment in use by running the following command:
conda info --envs
This command will show which conda environment is active, marked with a * mark next to the path field, as shown below:
You should have at least one environment, named root
, and this should be the one active.
In the case that your computer has an NVIDIA GPU, you should install package dependencies using following command:
conda env update -f .\aml_config\conda_dependencies_gpu.yml -n root
On the other hand, if your computer only has a CPU or a non-NVIDIA GPU, you should use the conda_dependencies.yml
file instead of conda_dependencies_gpu.yml
in previous command, meaning a command like the following:
conda env update -f .\aml_config\conda_dependencies.yml -n root
IMPORTANT NOTE: In many cases when installing the Python dependency packages, you get some errors. In that case (pretty common), close Workbench if you had it open, and try to run the same command again.
Once finished, you can make sure you have all the dependencies by re-running the same command and you will get many messages, one per requirement, saying "Requirement already satisfied:"
EXCEPTIONAL CASE: If after trying multiple times you still get errors, that is an exceptional and rare case. In this case, you might need to update Conda and all the environment packages, executing following command:
Optional command if you need to troubleshoot the dependencies installation:
conda update --all
After having the conda dependecies installed, you can check packages installed in current environment running the command:
conda list
You will see a list quite longer than the screenshot below:
As a recommendation, once the packages have been installed correctly, in case you had ML Workbench open, you should close Workbench and open it up again. This will cause Workbench to load the just installed dependencies.
You need a set of images/pictures in order to train the custom model. Those images have to be copied in a very specific folder structure within the \workbench\data folder.
Inside the workbench project folder, you should have a data
folder with two folders within it named train
and validation
, for the training set and the validation set respectively.
As examples, we have set a few product types like frisbee
, bracelet
, thermometer
, etc. with many pictures/photos related per type.
You could add additional but similar item folders into both parent folders with your own images, following the same pattern shown in the screenshot below:
In the previous figure example, you can observe that the deep learning model will try to each image in one of those categories (labels).
Each folder in the training folder has a sibling folder in the validation folder, for instance, data/training/thermometer
and data/validation/thermometer
.
Inside each of these folders, you need to provide images appropriate for each label/type. At least, you need around 15 images per category/label and those images should be distributed between the train
and validation
folders with approximately 80% of the images within the training
folder and 20% of the images within the validation
folder.
For instance, the following screenshot is the folder's content related to the data\train\frisbee
folder:
Note that you also have a sibling folder named data\validation\frisbee
with additional frisbees.
You can provide as many labels/types as you want in the form of couples of folders and following the same naming schema already described. However, it is recommended that you first try/train with the provided images in our GitHub repo, so you confirm that your infrastructure is running properly. Afterwards, you can add your own labels and images.
TROUBLESHOOTING: We have found some issues if the name of the image files are too long. Also, make sure that the file extensions are right, for instance, they should be .png or .jpg.
Take into account that the total training time can be quite long (many minutes) if you are not using a GPU and depending on the number of image types to train.
Step 6: Train and generate your custom-model and labels/tags with ML Workbench and Keras+Python scripts
The aim of this scenario is to build a classification model based on a custom set of images. After deploying the file images to classify, you need to train a deep learning network and save the trained model in a file, which later will be consumed by the microservice.
The model can be generated directly using a Python console, or using an Experiment Job in ML Workbench. In both cases you will have a historical record of the training you have made over time.
You need to open the Workbench PowerShell / Command Prompt console. Once the console is open, you need to issue the following command to execute the training on the data folder
python ./training.py tensorflow
That command will launch the training script which is using a Tensorflow deep learning model from KERAS, which is a Python library that can generate trained models to be executed on TensorFlow or CNTK.
You should see an execution similar to the following screenshot:
After that execution completes, the trained custom model is saved in the outputs
folder, using the name model_tf.pb
.
On the other hand, if you want to generate a trained model targeting CNTK deep learning model, you should add the cntk
parameter as the following:
python ./training.py cntk
The model will be saved in the outputs folder as model_cntk.pb
.
In both cases, the training script execution also produces a file named labels.txt
placed within the same outputs
folder; this file contains the labels used for training the model.
Copy the output files into eShopOnContainers AI.ProductSearchImageBased.TensorFlow.API microservice code
Copy the files model_tf.pb
and labels.txt
from the folder outputs
:
And paste it into the following folder within the TensorFlow.API microservice's code:
\src\Services\AI.ProductSearchImageBased\AI.ProductSearchImageBased.TensorFlow.API\Resources\models
Under the covers, this way will still be using the same Python scripts to train and generate the model. However, running the jobs from ML Workbench provides a history of your training executions plus optionally, you could even run it into remote Azure VMs specialized for deep learning training.
For building the model using an Experimentation Job in ML Workbench, you need to open the ML Workbench desktop app and load the eShop workbench project you previously loaded. The displayed name of the project depends on how you named it the first time you loaded it, like "eShopWorkbench" or "workbench".
Once you are in the workbench project using the ML Workbench app, select local
configuration, select the training.py
script, type tensorflow
and click on the Run
button to launch the Experiment Job.
You can then observe the Jobs panel on the right hand side; the job just submitted will appear as processing. The job should be running for quite a few minutes, more or less depending on your computer's hardware resources (mainly, CPU and GPU). If there's any error you should be able to research it by opening the logs.
You now can also check previous finished jobs by clicking on the "Runs" buttom on the left hand side menu. Then, clicking on any of the runs.
Within each run's info you can explore how its training performed and last but most important point, you can get access the generated artifacts, meaning the trained model and labels, in this case:
Once you see the information of a particular run, you can download the output files/artifacts, like in this case click on the model_tf.pb
file and download. Then, click on the labels.txt
file and on download afterwards (A download can handle a single file, not both files).
You will need these files to copy them into the eShopOnContainersAI microservice code folders in order to fonfigure it and run it, as explained at the end of the Wiki post.
This approach allows you to run the training without needing to have all the Python dependencies installed into your local machine, although you'd need to have Docker installed in your machine and it'll run slower than when having a local GPU.
Steps:
- Make sure you pull the microsoft/mmlspark:plus-0.7.91 image with this Docker command:
docker pull microsoft/mmlspark:plus-0.7.91
- From ML Workbench, run the same script that you did previously but choose the
docker
as the execution environment:
Using the console is much more verbose, but at the same time, allows more fine-tuning and automatization options.
First, you need to open a Workbench PowerShell / Command Prompt console from the ML Workbench menu option, so the console has access to Azure CLI (az commands). Once the console is open, you need to login in Azure using the AZ CLI, and choose a default subscription:
# to authenticate
az login
# to list subscriptions
az account list -o table
# to set current subscription to a particular subscription ID
az account set -s "subscription_id"
or
az account set -s "subscription_name"
You can run the same previous command to list the subscriptions and check which subscription is now the default/current subscription to be used when running az commands.
Then, you submit your experiment using the local environment, with the following command:
az ml experiment submit -c local training.py tensorflow
After the execution finishes, you will need to look up for your execution in the Runs panel of the Workbench application, and download the model as explained in the former Workbench section.
In the case where you didn't create a VSTS Git repo related to the ML Workbench project you initially created from ML workbench, you can do it later and associate a Team Services Git repo with this existing Machine Learning project by using the following command executed from the PowerShell/CommandPrompt console opened from the ML Workbench desktop app:
# Ensure that you are in the project path so Azure CLI has the context of your current project.
$ az ml project update --repo https://<Team Services account name>.visualstudio.com/_git/<project name>
Training and creating a model with a specialized deep learning GPU based machine (like Tesla based) can reduce dramatically the needed time for each training. For instance, the same training of the sample images included in eShopOnContainersAI took the following times in our tests:
- MacBook Pro 2017 - NO NVIDIA GPU - Running Windows 10 natively (Bootcamp): 27:31 minutes
- MacBook Pro 2017 - NO NVIDIA GPU - Running Docker image: 41:25 minutes
- Surface Studio with local NVIDIA GPU (GEForce GTX 980M): 8:43 minutes
- Azure VM NC6 with NVIDIA GPU (Tesla K80): 2:24 minutes
(DISCLAIMER: The previous scores can vary depending on the context of each machine)
Therefore, if you want to train your deep network using a GPU based docker, the best and fastest option is to execute it using a remote Azure Virtual Machine supporting a GPU. You can turn it on for just a few minutes and turn it off once you are done, so you pay just per use.
There are several flavours of Virtual Machines available, but we recommend the Deep Learning Virtual Machine. Although you could also configure the Data Science Virtual Machine for Linux (Ubuntu) id using a GPU based VM.
It's very important you select the following choices:
- Linux operating system (usually Ubuntu is recommended)
- HDD storage option (not SSD, currently, Azure GPU VMs require a standard disk)
- NC6 series is a good option (in general, any virtual machine series with GPU capability)
NOTE: Sometimes you might need to choose a different datacenter as GPU-based VMs are not available in all Azure datacenters. For instance, it is available in "West US 2" but not in "Central US".
Once the VM is created, you need to get the public IP address or DNS address, user and password. You can get the IP/DNS from the VMs config at Azure's portal. Then, you should execute following command from your local command prompt console opened from the ML Workbench desktop app:
az ml computetarget attach remotedocker --name "azure-docker-gpu" --address "<public_IP_or_DNS_address>" --username "<user>" --password "<password>"
This will create a couple of files within the folder aml_config
, named azure-docker-gpu.compute
and azure-docker-gpu.runconfig
.
IMPORTANT: In order to use the VM's GPU, you need to edit those files with the following modifications:
- Edit the
azure-docker-gpu.compute
- Add the following line at the end of the file so the VM will use the GPU:
nvidiaDocker: true
- Change the Docker image to use. Use
microsoft/mmlspark:plus-gpu-0.9.9
instead ofmicrosoft/mmlspark:plus-0.9.9
. Note that version numbers can vary. The important point is to use the "gpu" version.
- Edit the
azure-docker-gpu.runconfig
and set it like the following:
ArgumentVector:
- $file
CondaDependenciesFile: aml_config/conda_dependencies_gpu.yml
EnvironmentVariables: null
Framework: "Python"
PrepareEnvironment: true
#SparkDependenciesFile: aml_config/spark_dependencies.yml
Target: azure-docker-gpu
TrackedRun: true
UseSampling: true
Especially, make sure you have CondaDependenciesFile: aml_config/conda_dependencies_gpu.yml
and Framework: "Python"
.
Now, you need to prepare the remote Azure VM with your project's environment and pull the Docker image, which is going to be used for the training, into the remote Azure VM, by running the following command which will configure the remote Azure VM:
az ml experiment prepare -c azure-docker-gpu
This remote Azure VM preparation will take a few minutes and should end like in the following screenshot:
After setting up the virtual machine with the previous command (it pulled the needed Docker images plus additional configuration), you should be ready to execute the training using this remote VM context.
You can either launch it from the Workbench Powershell terminal, running following command:
az ml experiment submit -c azure-docker-gpu training.py tensorflow
Or you can also launch the experiment job from Workbench, selecting the docker-gpu
context and proceeding as explained before:
Step 6: Configure and run the eShopOnContainersAI product search image-based with the produced custom model
After updating the .env
file and the models folder, you need to redeploy the whole solution again.
-
At the
.env
file, Make sure that the ESHOPAI_PRODUCTSEARCHIMAGEBASED_APPROACH=TensorFlowCustom environment variable -
Paste the trained model
model_tf.pb
andlabels.txt
files into eShopOnContainersAIAI.ProductSearchImageBased.TensorFlow.API\Resources\models
folder -
Run eShopOnContainersAI and do any product search image-based for instance by providing any frisbee picture:
- You should get some results with the frisbees in the actual catalog:
-
Recommendation systems
Product recommendation (ML Studio, C#) -
Computer Vision
Image classification-
Cognitive Services - Mobile applications (Xamarin, C#)
-
TensorFlow - Custom Model (ML Workbench, CNTK, TensorFlow, C#)
-
Natural Language Processing
Skype Bot (Microsoft Bot Framework, LUIS, C#) -
Regression
Sales Forecasting (ML.NET, C#)