Memory leak #33

tanwimallick · 2019-06-17T16:53:58Z

def run_epoch_generator(self, sess, model, data_generator, return_output=False, training=False, writer=None):
output_dim = self._model_kwargs.get('output_dim')
preds = model.outputs
labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)

This part of the code has a memory leak. Getting OOM error after several epochs.

liyaguang · 2019-06-18T17:36:12Z

Thanks for your kind information. I will investigate this issue. Besides, it is appreciated if you can provide more information, e.g., the error message, log, parameters, etc.

tanwimallick · 2019-06-19T22:18:49Z

The error massage is:
2019-06-06 20:04:31.386792: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 43.75MiB. Current allocation summary follows.
2019-06-06 20:04:31.386936: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 664, Chunks in use: 664. 166.0KiB allocated for chunks. 166.0KiB in use in bin. 8.9KiB client-requested in use in bin.

2019-06-06 20:04:31.396827: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[44800,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I was trying to plot the memory consumption after each epoch. I got the following plot

The hyperparameter configuration was:

batch_size: 256, cl_decay_steps: 2000, filter_type: 'laplacian', horizon': 12, input_dim: 2, l1_decay': 0, max_diffusion_step: 1, num_nodes: 175, num_rnn_layers: 2, output_dim: 1, rnn_units: 64, seq_len: 12,
use_curriculum_learning: True, base_lr: 0.01, epochs: 62, epsilon: 0.001, global_step: 0, lr_decay_ratio: 0.05, max_grad_norm: 9, max_to_keep: 100, min_learning_rate: 2e-06, optimizer': adagrad, patience: 50, steps: [20, 30, 40, 50], test_every_n_epochs: 10

I got the error after 30 epochs.

ivechan · 2019-07-17T02:02:26Z

Is there any solution or suggestion? :)

ivechan · 2019-07-17T03:12:30Z

It seems that the following codes will add nodes into computation graph per epoch.
Every epoch we create new nodes in graph so that the graph will be larger and larger.

labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)

A possible solution is that creating loss node in graph in class DCRNNModel initialization instead of
in function run_epoch_generator.

parkitny · 2019-07-22T01:33:38Z

Any further updates on when this fix will be added?

tanwimallick · 2019-07-24T14:19:12Z

It is better to define loss node in the graph in class DCRNNModel initialization. Then inside run_epoch_generator model.loss and model.mae can be used.

For a quick fix, I initialized the training and testing loss separately during the initialization of DCRNNSupervisor.

preds = self._train_model.outputs
labels = self._train_model.labels[..., :output_dim]

self.preds_test = self._test_model.outputs
self.labels_test = self._test_model.labels[..., :output_dim]

self._train_loss = self._loss_fn(preds=preds, labels=labels)
self._test_loss = self._loss_fn(preds=self.preds_test, labels=self.labels_test)

Inside run_epoch_generator:

if training:
             fetches = {
                 'loss': self._train_loss,
                 'mae': self._train_loss,
                 'global_step': tf.train.get_or_create_global_step()
             }
else:
            fetches = {
                 'loss': self._test_loss,
                 'mae': self._test_loss,
                 'global_step': tf.train.get_or_create_global_step()
            }

In the paper, how did you plot the learned localized filters centered at different nodes (Figure 7 in the paper)? Is that code available?

liyaguang self-assigned this Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #33

Memory leak #33

tanwimallick commented Jun 17, 2019

liyaguang commented Jun 18, 2019

tanwimallick commented Jun 19, 2019 •

edited

Loading

ivechan commented Jul 17, 2019

ivechan commented Jul 17, 2019

parkitny commented Jul 22, 2019

tanwimallick commented Jul 24, 2019 •

edited

Loading

Memory leak #33

Memory leak #33

Comments

tanwimallick commented Jun 17, 2019

liyaguang commented Jun 18, 2019

tanwimallick commented Jun 19, 2019 • edited Loading

ivechan commented Jul 17, 2019

ivechan commented Jul 17, 2019

parkitny commented Jul 22, 2019

tanwimallick commented Jul 24, 2019 • edited Loading

tanwimallick commented Jun 19, 2019 •

edited

Loading

tanwimallick commented Jul 24, 2019 •

edited

Loading