Loss becomes NaN when setting use_global_stat=True for batchnorm #13902

FCInter · 2019-01-16T06:47:00Z

FCInter
Jan 16, 2019

Description

I trained a model and used it to perform prediction. While building the predictor, if I set the argument for_training=False, the prediction result is bad, as bad as predicted using a randomly initialized model.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Dec  4 2017 14:50:18'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '18.1')
('Directory    :', '/path/to/mx_env/local/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '1.3.0')
('Directory    :', '/path/to/mx_env/local/lib/python2.7/site-packages/mxnet')
('Commit Hash   :', 'b3be92f4a48bce62a5a8424271871c2f81c8f7f1')
----------System Info----------
('Platform     :', 'Linux-4.4.0-87-generic-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'B22-C09-G5500-01-GPU')
('release      :', '4.4.0-87-generic')
('version      :', '#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                88
On-line CPU(s) list:   0-87
Thread(s) per core:    2
Core(s) per socket:    22
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
Stepping:              1
CPU MHz:               2400.093
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4801.21
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):     0-21,44-65
NUMA node1 CPU(s):     22-43,66-87

Package used (Python/R/Scala/Julia):
Python

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609

Build config:
I use pip install.

Error Message:

the training log:

Epoch[0] Batch [178]    Speed: 16.56 samples/sec        Train-RPNAcc=0.870976,  RPNLogLoss=nan, RPNL1Loss=nan,  RCNNAcc=0.990496,       RCNNLogLoss=nan,        RCNNL1Loss=nan,
Epoch[0] Batch [179]    Speed: 14.71 samples/sec        Train-RPNAcc=0.871275,  RPNLogLoss=nan, RPNL1Loss=nan,  RCNNAcc=0.990522,       RCNNLogLoss=nan,        RCNNL1Loss=nan,

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
This is how I build a resnet-50 model.

def residual_unit(self, data, num_filter, stride, dim_match, name, bottle_neck=True, bn_mom=0.9, workspace=256, memonger=False):
    """Return ResNet Unit symbol for building ResNet
    Parameters
    ----------
    data : str
        Input data
    num_filter : int
        Number of output channels
    bnf : int
        Bottle neck channels factor with regard to num_filter
    stride : tuple
        Stride used in convolution
    dim_match : Boolean
        True means channel number between input and output is the same, otherwise means differ
    name : str
        Base name of the operators
    workspace : int
        Workspace used in convolution operator
    """
    if bottle_neck:
        # the same as https://github.com/facebook/fb.resnet.torch#notes, a bit difference with origin paper
        bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn1', use_global_stats=self.use_global_stats)
        act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
        conv1 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
                                   no_bias=True, workspace=workspace, name=name + '_conv1')
        bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn2', use_global_stats=self.use_global_stats)
        act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
        conv2 = mx.sym.Convolution(data=act2, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
                                   no_bias=True, workspace=workspace, name=name + '_conv2')
        bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=self.eps, momentum=bn_mom, name=name + '_bn3', use_global_stats=self.use_global_stats)
        act3 = mx.sym.Activation(data=bn3, act_type='relu', name=name + '_relu3')
        conv3 = mx.sym.Convolution(data=act3, num_filter=num_filter, kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True,
                                   workspace=workspace, name=name + '_conv3')
        if dim_match:
            shortcut = data
        else:
            shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
                                            workspace=workspace, name=name+'_sc')
        if memonger:
            shortcut._set_attr(mirror_stage='True')
        return conv3 + shortcut
    else:
        bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, momentum=bn_mom, eps=self.eps, name=name + '_bn1', use_global_stats=self.use_global_stats)
        act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
        conv1 = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(3,3), stride=stride, pad=(1,1),
                                      no_bias=True, workspace=workspace, name=name + '_conv1')
        bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, momentum=bn_mom, eps=self.eps, name=name + '_bn2', use_global_stats=self.use_global_stats)
        act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
        conv2 = mx.sym.Convolution(data=act2, num_filter=num_filter, kernel=(3,3), stride=(1,1), pad=(1,1),
                                      no_bias=True, workspace=workspace, name=name + '_conv2')
        if dim_match:
            shortcut = data
        else:
            shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
                                            workspace=workspace, name=name+'_sc')
        if memonger:
            shortcut._set_attr(mirror_stage='True')
        return conv2 + shortcut

def resnet(self, data, units, num_stages, filter_list, num_classes, bottle_neck=True, bn_mom=0.9, workspace=256, dtype='float32', memonger=False):
    """Return ResNet symbol of
    Parameters
    ----------
    units : list
        Number of units in each stage
    num_stages : int
        Number of stage
    filter_list : list
        Channel size of each stage
    num_classes : int
        Ouput size of symbol
    dataset : str
        Dataset type, only cifar10 and imagenet supports
    workspace : int
        Workspace used in convolution operator
    dtype : str
        Precision (float32 or float16)
    """
    num_unit = len(units)
    assert(num_unit == num_stages)
    body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(7, 7), stride=(2,2), pad=(3, 3),
                              no_bias=True, name="conv0", workspace=workspace)
    body = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=self.eps, momentum=bn_mom, name='bn0', use_global_stats=self.use_global_stats)
    body = mx.sym.Activation(data=body, act_type='relu', name='relu0')
    body = mx.sym.Pooling(data=body, kernel=(3, 3), stride=(2,2), pad=(1,1), pool_type='max')

    for i in range(num_stages):
        stride = (2, 2)
        if i == num_stages - 1 or i == 0:
            stride = (1, 1)
        body = self.residual_unit(body, filter_list[i+1], stride, False,
                             name='stage%d_unit%d' % (i + 1, 1), bottle_neck=bottle_neck, workspace=workspace,
                             memonger=memonger)
        for j in range(units[i]-1):
            body = self.residual_unit(body, filter_list[i+1], (1,1), True, name='stage%d_unit%d' % (i + 1, j + 2),
                                 bottle_neck=bottle_neck, workspace=workspace, memonger=memonger) 
    feat_conv_3x3 = mx.sym.Convolution(
        data=body, kernel=(3, 3), pad=(6, 6), dilate=(6, 6), num_filter=1024, name="feat_conv_3x3")
    feat_conv_3x3_relu = mx.sym.Activation(data=feat_conv_3x3, act_type="relu", name="feat_conv_3x3_relu") # ('feat_conv_3x3_relu.shape', [(1L, 1024L, 38L, 50L)])
    return feat_conv_3x3_relu

def get_resnet_symbol(self, data, num_classes=2, dtype='float32'):
    """
    Adapted from https://github.com/tornadomeet/ResNet/blob/master/train_resnet.py
    Original author Wei Wu
    """
    num_layers = self.num_layers
    if num_layers >= 50:
        filter_list = [64, 256, 512, 1024, 2048]
        bottle_neck = True
    else:
        filter_list = [64, 64, 128, 256, 512]
        bottle_neck = False
    num_stages = 4
    if num_layers == 18:
        units = [2, 2, 2, 2]
    elif num_layers == 34:
        units = [3, 4, 6, 3]
    elif num_layers == 50:
        units = [3, 4, 6, 3]
    elif num_layers == 101:
        units = [3, 4, 23, 3]
    elif num_layers == 152:
        units = [3, 8, 36, 3]
    elif num_layers == 200:
        units = [3, 24, 36, 3]
    elif num_layers == 269:
        units = [3, 30, 48, 8]
    else:
        raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
    return self.resnet( data = data, 
                  units  = units,
                  num_stages  = num_stages,
                  filter_list = filter_list,
                  num_classes = num_classes,
                  bottle_neck = bottle_neck,
                  workspace   = self.workspace,
                  dtype       = dtype)

Steps to reproduce

(Paste the commands you ran that produced the error.)

For all the Batchnorm layers, if the self.use_global_stats is False, then everything goes fine. Training loss keeps going down and training acc increases. However, if self.use_global_stats is True, the training loss becomes NaN as is shown in the error message.
I loaded a pretrained resnet-50 checkpoint, which is downloaded from here.

What's wrong with my code?
Thank you all for helping me!!!

Answered by zhujiagang

May 17, 2021

You can try to lower the learning rate to 1/10, 1/100 of the orginal value since when self.use_global_stats is True the model is not strictly zero centered and 1 variance, training is more difficult.

View full answer

piyushghai · 2019-01-16T17:48:44Z

piyushghai
Jan 16, 2019

@FCInter Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it.
@mxnet-label-bot add [Python, Question]

0 replies

zhujiagang · 2021-05-17T07:08:32Z

zhujiagang
May 17, 2021

You can try to lower the learning rate to 1/10, 1/100 of the orginal value since when self.use_global_stats is True the model is not strictly zero centered and 1 variance, training is more difficult.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss becomes NaN when setting use_global_stat=True for batchnorm #13902

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Loss becomes NaN when setting use_global_stat=True for batchnorm #13902

FCInter Jan 16, 2019

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

Steps to reproduce

Replies: 2 comments

piyushghai Jan 16, 2019

zhujiagang May 17, 2021

FCInter
Jan 16, 2019

piyushghai
Jan 16, 2019

zhujiagang
May 17, 2021