Training Loss NAN #17

hychiang-git · 2021-09-14T05:01:25Z

Hi, I tried to reproduce your experiment with Cifar10, but I got training loss NaN. I am using a four GPUs machine with tensorflow-gpu 1.12 for the experiment.

Here is the Option I used, and I only modified the saveModel

import time
import tensorflow as tf

debug = False
Time = time.strftime('%Y-%m-%d', time.localtime())
Notes = 'vgg7_2888'
# Notes = 'temp'

GPU = [0]
batchSize = 128

dataSet = 'CIFAR10'

loadModel = None
# loadModel = '../model/' + '2017-12-06' + '(' + 'vgg7 2888' + ')' + '.tf'
# saveModel = None
saveModel = '../model/' + Time + '_' + Notes + '.tf'

bitsW = 2  # bit width of we ights
bitsA = 8  # bit width of activations
bitsG = 8  # bit width of gradients
bitsE = 8  # bit width of errors

bitsR = 16  # bit width of randomizer

lr = tf.Variable(initial_value=0., trainable=False, name='lr', dtype=tf.float32)
lr_schedule = [0, 8, 200, 1,250,1./8,300,0]

L2 = 0

lossFunc = 'SSE'
# lossFunc = tf.losses.softmax_cross_entropy
optimizer = tf.train.GradientDescentOptimizer(1)  # lr is controlled in Quantize.G
# optimizer = tf.train.MomentumOptimizer(lr, 0.9, use_nesterov=True)

# shared variables, defined by other files
seed = None
sess = None
W_scale = []

WAGE Folder structure.

.                                                                                                                                        
|-- README.md                                                                                                                            
|-- dataSet                                                                                                                              
|   |-- CIFAR10.npz                                                                                                                      
|   |-- CIFAR10.py
|   |-- cifar-10-batches-py
|   |   |-- batches.meta
|   |   |-- data_batch_1
|   |   |-- data_batch_2
|   |   |-- data_batch_3
|   |   |-- data_batch_4
|   |   |-- data_batch_5
|   |   |-- readme.html
|   |   `-- test_batch
|   `-- cifar-10-python.tar.gz
|-- log
|   |-- 2018-01-30(vgg7\ 2888).txt
|   |-- 2021-09-14(temp).txt
|   `-- 2021-09-14(vgg7_2888).txt
|-- model
`-- source
    |-- Log.py
    |-- Log.pyc
    |-- NN.py
    |-- NN.pyc
    |-- Option.py
    |-- Option.pyc
    |-- Quantize.py
    |-- Quantize.pyc
    |-- Top.py
    |-- getData.py
    |-- getData.pyc
    |-- myInitializer.py
    `-- myInitializer.pyc

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Loss NAN #17

Training Loss NAN #17

hychiang-git commented Sep 14, 2021

Training Loss NAN #17

Training Loss NAN #17

Comments

hychiang-git commented Sep 14, 2021