This repository contains code for distributed computing with Tensorflow
Tensorflow v2.0
: https://www.tensorflow.org/install
TensorFlow supports distributed computing, this enables execution of a computational graph on a totally different process, which can be a completely different server. In addition, say for example you have a powerful GPU
in your home desktop and want to do some training on that using the data generated by your not so power Raspberry Pi
that is running your current robotics project, tensorflow distributed helps you to do just that!
TensorFlow works in the form of a server-client model, to be able to compute parts of the graph on a different server we need create a bunch of workers that will perform the heavy lifting on the Raspberry Pi's
behalf. The server doing all the heavy lifting is called a worker
and the one providing the graph is called a parameter server
, data flowing from a parameter server
to the worker
is called a forward pass where as the opposit is called a backward pass.
An example script to run on a single process, and then we will move to multiple processes.
import tensorflow as tf #import Tensorflow
a = tf.constant(3.0) #declare constants
b = tf.constant(2.0)
x = tf.add(a, b) #add
y = tf.multiply(a, b) #multiply
z = tf.add(x, y) #add
with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True)) as sess:
print(sess.run(z))
In order to distribute you need to create a session on one of those parameter servers
, and it will compute the graph, possibly distributing parts of it to the worker
clusters on the server.
At the parameter server side, the example code shown above has been modified to support distribution, with a simple json
file pointing at the IP address of the worker
machine (the one with a GPU).
import json
import tensorflow as tf
with open('cluster_spec/clusterspec.json', 'r') as f:
clusterspec = json.load(f)
cluster = tf.train.ClusterSpec(clusterspec)
a = tf.constant(3.0)
b = tf.constant(2.0)
with tf.device("/job:worker/task:0"):
x = tf.add(a, b)
y = tf.multiply(a, b)
z = tf.add(x, y)
with tf.compat.v1.Session('grpc://X.X.X.X:2222', config=tf.compat.v1.ConfigProto(log_device_placement=True)) as sess:
print(sess.run(z))
Replace the x.x.x.x:2222
with the IP address of server that has a GPU in it. (Try ifconfig
/ipconfig
for Linux\Windows server), both in the cluster-tf.py
and the clusterspec.json
file.
At the worker side make sure the correct type of Tensorflow is installed i.e. Tensorflow-GPU, copy the same clusterspec.json
file that has x.x.x.x:2222
replaced with the IP address of the worker server.
import sys
import json
import tensorflow as tf
task_number = int(sys.argv[1])
with open('cluster_spec/clusterspec.json', 'r') as f:
cluster_spec = json.load(f)
cluster = tf.train.ClusterSpec(cluster_spec)
server = tf.train.Server(cluster, job_name="worker", task_index=task_number)
print("Starting server #{}".format(task_number))
server.start()
server.join()
Run the above code gpu_worker_0.py
with 0
as command line input, if you have more than one worker, you can have more numbers of the gpu_worker_0.py
file with the 0
part replaced with the worker number. Also, the clusterspec.json
is to be updated with the IP address of the second worker in this manner {"worker":["x.x.x.x:2222","y.y.y.y:2223"]}
, in that case the new worker code are to be initialized with 1
, 2
, 3
and so on, as command line inputs.
With the cluster-tf.py
and gpu_worker_0.py
running on two different servers but on the same network the output screens from the worker show show device placement on the GPU (in my case a GTX 1050Ti)
and the parameter server side will give out the result, here 11.0
which was a result of the mathematical operation.
If your worker server is a Windows computer, make sure you can ping
the IP address if not add the IP address you are trying to ping from into firewall exclusions.
- Debjyoti Chowdhury - Initial work - MyGithub
This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details
- Distributed Computing with TensorFlow Link(https://databricks.com/tensorflow/distributed-computing-with-tensorflow)