Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #33

Open
tanwimallick opened this issue Jun 17, 2019 · 6 comments
Open

Memory leak #33

tanwimallick opened this issue Jun 17, 2019 · 6 comments
Assignees

Comments

@tanwimallick
Copy link

def run_epoch_generator(self, sess, model, data_generator, return_output=False, training=False, writer=None):
output_dim = self._model_kwargs.get('output_dim')
preds = model.outputs
labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)

This part of the code has a memory leak. Getting OOM error after several epochs.

@liyaguang
Copy link
Owner

Thanks for your kind information. I will investigate this issue. Besides, it is appreciated if you can provide more information, e.g., the error message, log, parameters, etc.

@tanwimallick
Copy link
Author

tanwimallick commented Jun 19, 2019

The error massage is:
2019-06-06 20:04:31.386792: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 43.75MiB. Current allocation summary follows.
2019-06-06 20:04:31.386936: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 664, Chunks in use: 664. 166.0KiB allocated for chunks. 166.0KiB in use in bin. 8.9KiB client-requested in use in bin.

2019-06-06 20:04:31.396827: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[44800,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I was trying to plot the memory consumption after each epoch. I got the following plot
OOM

The hyperparameter configuration was:

batch_size: 256, cl_decay_steps: 2000, filter_type: 'laplacian', horizon': 12, input_dim: 2, l1_decay': 0, max_diffusion_step: 1, num_nodes: 175, num_rnn_layers: 2, output_dim: 1, rnn_units: 64, seq_len: 12,
use_curriculum_learning: True, base_lr: 0.01, epochs: 62, epsilon: 0.001, global_step: 0, lr_decay_ratio: 0.05, max_grad_norm: 9, max_to_keep: 100, min_learning_rate: 2e-06, optimizer': adagrad, patience: 50, steps: [20, 30, 40, 50], test_every_n_epochs: 10

I got the error after 30 epochs.

@liyaguang liyaguang self-assigned this Jun 19, 2019
@ivechan
Copy link

ivechan commented Jul 17, 2019

Is there any solution or suggestion? :)

@ivechan
Copy link

ivechan commented Jul 17, 2019

It seems that the following codes will add nodes into computation graph per epoch.
Every epoch we create new nodes in graph so that the graph will be larger and larger.

labels = model.labels[..., :output_dim]
loss = self._loss_fn(preds=preds, labels=labels)

A possible solution is that creating loss node in graph in class DCRNNModel initialization instead of
in function run_epoch_generator.

@parkitny
Copy link

Any further updates on when this fix will be added?

@tanwimallick
Copy link
Author

tanwimallick commented Jul 24, 2019

It is better to define loss node in the graph in class DCRNNModel initialization. Then inside run_epoch_generator model.loss and model.mae can be used.

For a quick fix, I initialized the training and testing loss separately during the initialization of DCRNNSupervisor.

preds = self._train_model.outputs
labels = self._train_model.labels[..., :output_dim]

self.preds_test = self._test_model.outputs
self.labels_test = self._test_model.labels[..., :output_dim]

self._train_loss = self._loss_fn(preds=preds, labels=labels)
self._test_loss = self._loss_fn(preds=self.preds_test, labels=self.labels_test)

Inside run_epoch_generator:

if training:
             fetches = {
                 'loss': self._train_loss,
                 'mae': self._train_loss,
                 'global_step': tf.train.get_or_create_global_step()
             }
else:
            fetches = {
                 'loss': self._test_loss,
                 'mae': self._test_loss,
                 'global_step': tf.train.get_or_create_global_step()
            }

In the paper, how did you plot the learned localized filters centered at different nodes (Figure 7 in the paper)? Is that code available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants