Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

06.5 TensorFlow (Custom)

Miguel Veloso edited this page Feb 13, 2019 · 3 revisions

Introduction

image

Custom models based on TensorFlow or CNTK can provide higher specialization on image classification than using Cognitive Services or Pre-trained models. In this walkthough you will learn how to create/train a complex TensorFlow/CNTK model. Custom models creation and training usually needs approaches based on Python scripts; in this case, you will use Keras and TensorFlow/CNTK backends. Microsoft provides ML Workbench and VS AI Tools that provide a better UI for those Python scripts executions when creating and training TensorFlow models.

Training Setup

STEP 1: (OPTIONAL) NVIDIA support for faster model training

If your computer has an NVIDIA graphics card, you can install the NVIDIA CUDA Toolkit so the model training will be significantly faster.

Install NVIDIA CUDA Toolkit

Tensorflow v.1.5 supports CUDA Toolkit version 9.0, which is available for Windows, macOS and Linux operative system. This version can be downloaded at the CUDA Toolkit archive. First, you should install CUDA Toolkit 9.0:

image

Install NVIDIA cuDNN

In the same way as with CUDA Toolkit, you should avoid to install latest version until Tensorflow support is confirmed for latest version. Until then, you should install cuDNN version 7 for CUDA 9.0, which can be downloaded at the cuDNN Download, as shown in the following figure:

image

The cuDNN supports Windows, macOS and Linux operative systems, and to download this package you will need to register in the NVIDIA developer forum.

You should install this package following the next instructions:

PREREQUISITES

CUDA 9.0 and a GPU of compute capability 3.0 or higher are required.

ALL PLATFORMS

Extract the downloaded cuDNN .ZIP file to a folder of your choice. That folder is referred below as <installpath>.
Then follow the platform-specific instructions as follows.

WINDOWS

Add <installpath>\cuda\bin to the PATH environment variable. That folder should contain the `cudnn64_7.dll` library file (the version depends on your installed dependencies).

LINUX

cd <installpath>
export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH

OS X

cd <installpath>
export DYLD_LIBRARY_PATH=`pwd`:$DYLD_LIBRARY_PATH

POSSIBLE ISSUE IN WINDOWS WITH NVIDIA GPU: In Windows, if you don't set the environment PATH variable to point to the folder holding the cudnn64_7.dll file as it was just explained, you will get the following error later when trying to run the training process:

ImportError: Could not find 'cudnn64_7.dll'. TensorFlow requires that this DLL be installed in a directory that is named in your %PATH% environment variable. Note that installing cuDNN is a separate step from installing CUDA, and this DLL is often found in a different directory from the CUDA DLLs. You may install the necessary DLL by downloading cuDNN 6 from this URL: https://developer.nvidia.com/cudnn

Step 2: Create Azure Machine Learning account and install Azure Machine Learning Workbench

Deep learning neuronal networks (models) can be trained to classify objects which were not previously used to train the original model.

In order to train custom models (custom neuronal networks), you will use Microsoft Azure ML Workbench which you need to install on your machine after creating the Azure ML account.

Set it up by using the following procedure:

https://docs.microsoft.com/en-us/azure/machine-learning/preview/quickstart-installation

After setting up your Azure ML Experimentation account, you should be able to download the Azure ML Workbench desktop application from there as in the following screenshot:

image

Select either the Windows or Mac setup depending on your workstation OS.

Run Azure Machine Learning Workbench to sign in for the first time

  1. After the installation process of ML Workbench is complete, select the Launch Workbench button on the last screen of the installer. If you have closed the installer, find the shortcut to Machine Learning Workbench on your desktop and Start menu named Azure Machine Learning Workbench to start the app.

  2. Sign in to Workbench by using the same account that you used earlier to provision your Azure resources.

  3. When the sign-in process has succeeded, Workbench attempts to find the Machine Learning Experimentation accounts that you created earlier. It searches for all Azure subscriptions to which your credential has access. When at least one Experimentation account is found, Workbench opens with that account. It then lists the workspaces and projects found in that account.

Step 3: Configure Azure ML Workbench

eShopOnContainersAI provides a custom project for you that handles the training of an existing neural network. First, we will check how to open the project from the ML Workbench desktop application and from Visual Studio Code.

Run the AML Workbench and after signing in with your Microsoft Account (which is authorized for your Azure Experiment account), the main Workbench dashboard will be displayed on screen.

From there, you should click on menu File -> Add Existing Projects as a Folder. You can also do the same action by clicking on the + button, and then select Add Existing Folder as Project, as shown in the figure:

image

Then, in the right slide panel, click on Browse to navigate to the eShopOnContainersAI folder (your local GitHub repo), and select the workbench folder at the root of the folder/repo, and accept so you see a similar form as the following:

image

The project name will be pre-populated with the project folder's name, but you can change it to a more meaningful name for you, as eShopWorkbench. An experimentation project will be created in Azure using the project name you provide here.

The project description is optional, as well as the Visualstudio.com GIT Repository. This repository is the one that will be used to host experimentation tracking data.

Additionally, check the Workspace field if you have several Experimentation Workspaces, as this will be used on your Azure accounting; otherwise, this field will be pre-populated with the Experimentation Workspace.

Finish up the creation form by clicking on the Create button.

Creating a new project, or adding existing folder as a project, provokes the creation of an Experimentation workspace project in Azure. You can check this new created resource in the Azure portal, as in the following screenshot:

image

Configure an external editor related to ML Workbench (Such as Visual Studio Code)

You should set up an external editor related to ML Workbench. You should only use Workbench to edit files which Workbench provides rich editing experience for, such as Python Notebooks, Data Sources and Data preparation packages.

For additional project files, you should open/edit those files by using a different tool/editor. This can be configured via menu File - > Configure Project IDE.

On the right slide panel, you should fill the Name of the program and enter the path to the application to use for opening the project folder. In the example below, we choose Visual Studio Code (check Visual Studio Code documentation for installing steps), a lightweight and resourceful IDE with Python support, but you can also use Visual Studio or any other third party editor (PyCharm, Eclipse, ...)

image

Once the Project IDE has been configured and there is a project open in the Workbench, you can open the project folder using the menu File -> Open Project, which will open the project using the configured IDE application, like in the following screenshot when using Visual Studio Code while the eShopWorkbench project was open in ML Workbench:

image

Step 4: Install required Python packages

In order to execute the training project, you need to install Python dependencies (packages first). Workbench uses Conda as a package manager, so in order to install those dependencies you will need to use either the Command prompt or the PowerShell terminal which are available from the File menu in ML Workbench.

image

The following commands are run by using the Command Prompt in windows, but you could also use a PowerShell terminal.

Once you open the command prompt or PowerShell, it is recommended to close now ML Workbench as we're going to install dependencies.

First, you can check the Conda environment in use by running the following command:

conda info --envs

This command will show which conda environment is active, marked with a * mark next to the path field, as shown below:

image

You should have at least one environment, named root, and this should be the one active.

In the case that your computer has an NVIDIA GPU, you should install package dependencies using following command:

conda env update -f .\aml_config\conda_dependencies_gpu.yml -n root

On the other hand, if your computer only has a CPU or a non-NVIDIA GPU, you should use the conda_dependencies.yml file instead of conda_dependencies_gpu.yml in previous command, meaning a command like the following:

conda env update -f .\aml_config\conda_dependencies.yml -n root

IMPORTANT NOTE: In many cases when installing the Python dependency packages, you get some errors. In that case (pretty common), close Workbench if you had it open, and try to run the same command again.

Once finished, you can make sure you have all the dependencies by re-running the same command and you will get many messages, one per requirement, saying "Requirement already satisfied:"

EXCEPTIONAL CASE: If after trying multiple times you still get errors, that is an exceptional and rare case. In this case, you might need to update Conda and all the environment packages, executing following command:

Optional command if you need to troubleshoot the dependencies installation:

conda update --all

After having the conda dependecies installed, you can check packages installed in current environment running the command:

conda list

You will see a list quite longer than the screenshot below:

image

As a recommendation, once the packages have been installed correctly, in case you had ML Workbench open, you should close Workbench and open it up again. This will cause Workbench to load the just installed dependencies.

Step 5: Preparing folders with sample item images for the model's training

You need a set of images/pictures in order to train the custom model. Those images have to be copied in a very specific folder structure within the \workbench\data folder.

Inside the workbench project folder, you should have a data folder with two folders within it named train and validation, for the training set and the validation set respectively.

As examples, we have set a few product types like frisbee, bracelet, thermometer, etc. with many pictures/photos related per type. You could add additional but similar item folders into both parent folders with your own images, following the same pattern shown in the screenshot below:

image

In the previous figure example, you can observe that the deep learning model will try to each image in one of those categories (labels).

Each folder in the training folder has a sibling folder in the validation folder, for instance, data/training/thermometer and data/validation/thermometer.

Inside each of these folders, you need to provide images appropriate for each label/type. At least, you need around 15 images per category/label and those images should be distributed between the train and validation folders with approximately 80% of the images within the training folder and 20% of the images within the validation folder.

For instance, the following screenshot is the folder's content related to the data\train\frisbee folder:

image

Note that you also have a sibling folder named data\validation\frisbee with additional frisbees.

You can provide as many labels/types as you want in the form of couples of folders and following the same naming schema already described. However, it is recommended that you first try/train with the provided images in our GitHub repo, so you confirm that your infrastructure is running properly. Afterwards, you can add your own labels and images.

TROUBLESHOOTING: We have found some issues if the name of the image files are too long. Also, make sure that the file extensions are right, for instance, they should be .png or .jpg.

Take into account that the total training time can be quite long (many minutes) if you are not using a GPU and depending on the number of image types to train.

Step 6: Train and generate your custom-model and labels/tags with ML Workbench and Keras+Python scripts

The aim of this scenario is to build a classification model based on a custom set of images. After deploying the file images to classify, you need to train a deep learning network and save the trained model in a file, which later will be consumed by the microservice.

The model can be generated directly using a Python console, or using an Experiment Job in ML Workbench. In both cases you will have a historical record of the training you have made over time.

Training from the Python Console (CLI)

You need to open the Workbench PowerShell / Command Prompt console. Once the console is open, you need to issue the following command to execute the training on the data folder

python ./training.py tensorflow

That command will launch the training script which is using a Tensorflow deep learning model from KERAS, which is a Python library that can generate trained models to be executed on TensorFlow or CNTK.

You should see an execution similar to the following screenshot:

image

After that execution completes, the trained custom model is saved in the outputs folder, using the name model_tf.pb.

On the other hand, if you want to generate a trained model targeting CNTK deep learning model, you should add the cntk parameter as the following:

python ./training.py cntk

The model will be saved in the outputs folder as model_cntk.pb.

In both cases, the training script execution also produces a file named labels.txt placed within the same outputs folder; this file contains the labels used for training the model.

Copy the output files into eShopOnContainers AI.ProductSearchImageBased.TensorFlow.API microservice code

Copy the files model_tf.pb and labels.txt from the folder outputs: image

And paste it into the following folder within the TensorFlow.API microservice's code:

\src\Services\AI.ProductSearchImageBased\AI.ProductSearchImageBased.TensorFlow.API\Resources\models

image

Training from the ML Workbench desktop application

Under the covers, this way will still be using the same Python scripts to train and generate the model. However, running the jobs from ML Workbench provides a history of your training executions plus optionally, you could even run it into remote Azure VMs specialized for deep learning training.

For building the model using an Experimentation Job in ML Workbench, you need to open the ML Workbench desktop app and load the eShop workbench project you previously loaded. The displayed name of the project depends on how you named it the first time you loaded it, like "eShopWorkbench" or "workbench".

ML Workbench Experiment Job

Once you are in the workbench project using the ML Workbench app, select local configuration, select the training.py script, type tensorflow and click on the Run button to launch the Experiment Job.

image

You can then observe the Jobs panel on the right hand side; the job just submitted will appear as processing. The job should be running for quite a few minutes, more or less depending on your computer's hardware resources (mainly, CPU and GPU). If there's any error you should be able to research it by opening the logs.

You now can also check previous finished jobs by clicking on the "Runs" buttom on the left hand side menu. Then, clicking on any of the runs.

image

Within each run's info you can explore how its training performed and last but most important point, you can get access the generated artifacts, meaning the trained model and labels, in this case:

image

Once you see the information of a particular run, you can download the output files/artifacts, like in this case click on the model_tf.pb file and download. Then, click on the labels.txt file and on download afterwards (A download can handle a single file, not both files).

You will need these files to copy them into the eShopOnContainersAI microservice code folders in order to fonfigure it and run it, as explained at the end of the Wiki post.

(OPTIONAL) Using a Docker container to run the training execution

This approach allows you to run the training without needing to have all the Python dependencies installed into your local machine, although you'd need to have Docker installed in your machine and it'll run slower than when having a local GPU.

Steps:

  • Make sure you pull the microsoft/mmlspark:plus-0.7.91 image with this Docker command:

docker pull microsoft/mmlspark:plus-0.7.91

  • From ML Workbench, run the same script that you did previously but choose the docker as the execution environment:

image

(OPTIONAL) Using the CLI (Command Line Interface)

Using the console is much more verbose, but at the same time, allows more fine-tuning and automatization options.

First, you need to open a Workbench PowerShell / Command Prompt console from the ML Workbench menu option, so the console has access to Azure CLI (az commands). Once the console is open, you need to login in Azure using the AZ CLI, and choose a default subscription:

# to authenticate
az login

# to list subscriptions
az account list -o table

# to set current subscription to a particular subscription ID
az account set -s "subscription_id"
or
az account set -s "subscription_name"

You can run the same previous command to list the subscriptions and check which subscription is now the default/current subscription to be used when running az commands.

Then, you submit your experiment using the local environment, with the following command:

az ml experiment submit -c local training.py tensorflow

After the execution finishes, you will need to look up for your execution in the Runs panel of the Workbench application, and download the model as explained in the former Workbench section.

image

(OPTIONAL) Associate an existing Machine Learning project with a Team Services Git repo

In the case where you didn't create a VSTS Git repo related to the ML Workbench project you initially created from ML workbench, you can do it later and associate a Team Services Git repo with this existing Machine Learning project by using the following command executed from the PowerShell/CommandPrompt console opened from the ML Workbench desktop app:

# Ensure that you are in the project path so Azure CLI has the context of your current project.
$ az ml project update --repo https://<Team Services account name>.visualstudio.com/_git/<project name>

(OPTIONAL) Using an Azure VM with docker and GPU to run your training much faster

Training and creating a model with a specialized deep learning GPU based machine (like Tesla based) can reduce dramatically the needed time for each training. For instance, the same training of the sample images included in eShopOnContainersAI took the following times in our tests:

  • MacBook Pro 2017 - NO NVIDIA GPU - Running Windows 10 natively (Bootcamp): 27:31 minutes
  • MacBook Pro 2017 - NO NVIDIA GPU - Running Docker image: 41:25 minutes
  • Surface Studio with local NVIDIA GPU (GEForce GTX 980M): 8:43 minutes
  • Azure VM NC6 with NVIDIA GPU (Tesla K80): 2:24 minutes

image

(DISCLAIMER: The previous scores can vary depending on the context of each machine)

Therefore, if you want to train your deep network using a GPU based docker, the best and fastest option is to execute it using a remote Azure Virtual Machine supporting a GPU. You can turn it on for just a few minutes and turn it off once you are done, so you pay just per use.

There are several flavours of Virtual Machines available, but we recommend the Deep Learning Virtual Machine. Although you could also configure the Data Science Virtual Machine for Linux (Ubuntu) id using a GPU based VM.

It's very important you select the following choices:

  • Linux operating system (usually Ubuntu is recommended)
  • HDD storage option (not SSD, currently, Azure GPU VMs require a standard disk)
  • NC6 series is a good option (in general, any virtual machine series with GPU capability)

NOTE: Sometimes you might need to choose a different datacenter as GPU-based VMs are not available in all Azure datacenters. For instance, it is available in "West US 2" but not in "Central US".

image

Once the VM is created, you need to get the public IP address or DNS address, user and password. You can get the IP/DNS from the VMs config at Azure's portal. Then, you should execute following command from your local command prompt console opened from the ML Workbench desktop app:

az ml computetarget attach remotedocker --name "azure-docker-gpu" --address "<public_IP_or_DNS_address>" --username "<user>" --password "<password>" 

image

This will create a couple of files within the folder aml_config, named azure-docker-gpu.compute and azure-docker-gpu.runconfig.

IMPORTANT: In order to use the VM's GPU, you need to edit those files with the following modifications:

  1. Edit the azure-docker-gpu.compute
  • Add the following line at the end of the file so the VM will use the GPU:

nvidiaDocker: true

  • Change the Docker image to use. Use microsoft/mmlspark:plus-gpu-0.9.9 instead of microsoft/mmlspark:plus-0.9.9. Note that version numbers can vary. The important point is to use the "gpu" version.
  1. Edit the azure-docker-gpu.runconfig and set it like the following:
ArgumentVector:
- $file
CondaDependenciesFile: aml_config/conda_dependencies_gpu.yml
EnvironmentVariables: null
Framework: "Python"
PrepareEnvironment: true
#SparkDependenciesFile: aml_config/spark_dependencies.yml
Target: azure-docker-gpu
TrackedRun: true
UseSampling: true

Especially, make sure you have CondaDependenciesFile: aml_config/conda_dependencies_gpu.yml and Framework: "Python".

Now, you need to prepare the remote Azure VM with your project's environment and pull the Docker image, which is going to be used for the training, into the remote Azure VM, by running the following command which will configure the remote Azure VM:

az ml experiment prepare -c azure-docker-gpu

This remote Azure VM preparation will take a few minutes and should end like in the following screenshot:

image

After setting up the virtual machine with the previous command (it pulled the needed Docker images plus additional configuration), you should be ready to execute the training using this remote VM context.

You can either launch it from the Workbench Powershell terminal, running following command:

az ml experiment submit -c azure-docker-gpu training.py tensorflow

Or you can also launch the experiment job from Workbench, selecting the docker-gpu context and proceeding as explained before:

image

Step 6: Configure and run the eShopOnContainersAI product search image-based with the produced custom model

After updating the .env file and the models folder, you need to redeploy the whole solution again.

Copy/paste produce custom model and labels into eShopOnContainers microservice
  • At the .env file, Make sure that the ESHOPAI_PRODUCTSEARCHIMAGEBASED_APPROACH=TensorFlowCustom environment variable

  • Paste the trained model model_tf.pb and labels.txt files into eShopOnContainersAI AI.ProductSearchImageBased.TensorFlow.API\Resources\models folder

  • Run eShopOnContainersAI and do any product search image-based for instance by providing any frisbee picture:

image

  • You should get some results with the frisbees in the actual catalog:

image

News

Setup

Scenarios

Clone this wiki locally