Failure to reproduce results of prior experiements #8

orsharir · 2017-06-02T12:40:02Z

I've tried to reproduce our previous results of networks trained in Caffe, but I cannot get them to converge during training -- the loss function is either stuck or increasing. This seems to be some sort of bug in the current implementation, however, it's a bit difficult understanding where's the fault, given that the tests are running fine.

I'll upload my code later on, while I try to see if I get more specific information on the source of this issue.

elhanan7 · 2017-06-04T10:58:58Z

Did you rely on unsupervised initialization?

orsharir · 2017-06-04T10:59:52Z

I've tried both with and without it. We can discussed it in more detail in our meeting. Are you coming today?

orsharir · 2017-06-04T11:05:24Z

Example in Keras: basic_net_with_keras.py.txt

orsharir · 2017-06-04T13:15:50Z

This is an updated version of the test script: basic_net_with_keras.py.txt

I've made minor modifications from the last one, such that the exact same configuration (same network, same initialization and same optimization algorithm) works okay under Caffe.

I've also found one bug which might have been a contributing factor (though it doesn't solve the issue): in your Dirichlet initialization code, you've forgot to take the log at the end (as each set of parameters are a probability vector in log-space). I've added my correct version in this file.

orsharir · 2017-06-04T13:25:47Z

I've also found another issue with unshared regions behavior. I've fixed this locally in the above script, but it should be fixed. See issue #11 for details.

orsharir · 2017-06-07T11:40:24Z

This is the weights file of a network of the same structure that was trained in caffe:
ght_model_train_1_iter_250.caffemodel.zip

orsharir · 2017-06-07T11:59:36Z

These are the same weights but in numpy format, each saved in their own file: weights.zip

orsharir · 2017-06-07T12:04:23Z

I've trained the above network for 250 iterations with batch size 100. At the end of the training the loss function was on the order of 0.1~0.3, so expect values on this order (it's not precise because I've forgot to test the network at the end).

elhanan7 · 2017-06-07T18:09:52Z

After initializing with your weights, the result is the same, no learning.
I passed a single example through the network and printed out the mean activations:

2017-06-07 21:06:41.899961: I tensorflow/core/kernels/logging_ops.cc:79] Mean Sim[-37.658592]
2017-06-07 21:06:41.900483: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-15.049721]
2017-06-07 21:06:41.901129: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-66.746323]
2017-06-07 21:06:41.902419: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-275.81625]
2017-06-07 21:06:41.905782: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-1113.9332]
2017-06-07 21:06:41.906367: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-4464.123]

Is this normal?

orsharir · 2017-06-07T18:12:29Z

The point of initializing with those weights is not to train the network from this point, but simply to test that the forward pass of the network is correct, i.e. you'll need to set the weights and then evaluate the network (no training!) on the dataset (could be just a small subset of course) to make sure the loss is around the levels I've written above.

Try to do that and update me on the results.

orsharir · 2017-06-07T18:17:42Z

And regarding the activations, it is normal for them to grow to a very large number - this is because of the sum pooling, for example, consider the activations of the similarity (-37), and that it's spatial extent is 16x16, then had we used global sum pooling at that point, we'd get around -9400, which is on the same order as what you get at the last MEX layer -3364.

elhanan7 · 2017-06-07T18:41:05Z

I evaluated the model with the weights, and it gave really bad results. Turns out that the weights expect the data to be in the range [-1, 1] and what was given is [0, 1]. After fixing that we get:

loss = 0.642
accuracy = 0.83

And still no learning.

BTW, do you gradient clipping in caffe?

orsharir · 2017-06-07T19:29:41Z

Actually the data should be in the -0.5 to 0.5 range (I thought I did that in the script I've sent you). Also, are these results on the training set or the test set?

And no, I didn't use gradient clipping in Caffe.

Given the above results, I'd assume that the issue is with the gradients. Maybe try to output more detailed statistics on them, i.e. min, max, mean, std, etc. Try to output these statistics for the weights I gave you with no modifications to the weights, and average the results over a few mini batches.

orsharir · 2017-06-13T09:13:12Z

Hi @elhanan7
Do you have any updates regarding this issue?

elhanan7 · 2017-06-14T07:04:11Z

I did the numeric vs. computed test for the gradients of the keras network w.r.t. the offsets (of the first mex)
They are different, and a subset of the computed gradients is very large.
Next steps are to find the specific gradient that is the culprit, and to understand why the tests didn't catch this behaviour

I just saw that you offered to meet on thursday, Do you still want to, maybe in the morning?

gradients.zip

orsharir · 2017-06-14T07:48:54Z

Thanks for the update. Let's discuss this tomorrow (Thursday) in more detail. Can you meet at 10:30?

elhanan7 · 2017-06-14T07:51:17Z

Yes, that works

orsharir · 2017-06-14T08:22:34Z

I've tried to open the gradients files you've attached, but something seems wrong. First, the shapes are (256,1), and I've expected them to be same as the ones from the network. Second, the numeric gradients are simply 0, which seems like a mistake.

elhanan7 · 2017-06-14T13:49:20Z

About the different size, that is because i removed the similarity layer to make the numeric gradients computation tractable. About the zeroes, maybe I did something wrong when computing the gradients (it is not clear how to do this for a keras model)

orsharir · 2017-06-14T14:11:11Z

What sharing pattern do you use, and how many instances? Regarding getting 0, it could be that you are not computing the numeric gradients correctly. I suggest you follow the code of Caffe for checking the gradients, look at test_gradient_check_util.hpp for details -- some hints for reading this source code:

Start from CheckGradientSingle.
the blobs_ array is the array of "blobs" object, each representing a set of parameters for a layer (e.g. one for the templates and one for the weights in the case of the Similarity layer). The bottom array is the array of blobs representing the inputs to the layer, and the tops the array of blobs representing the outputs of the layer.
Notice that they define a new "loss" that is used when testing gradients. It is defined as the sum of squares of the outputs of the layer. You could probably define this loss directly in Keras.
Notice that they zero out all the parameters before testing the numeric gradient, but use random values for the input arrays (random gaussian with zero mean and std=1.0).

Also, don't forgot that you should also check the gradient w.r.t. the input to the layer, and not just w.r.t. the parameters.

Hope this helps.

orsharir · 2017-06-18T11:17:26Z

Extract this zip to the following directory Generative-ConvACs/exp/mnist/ght_model/train inside HUJI-Deep/Generative-ConvACs.

elhanan7 · 2017-06-18T19:33:22Z

I compiled the Generative-CAC code and ran the 'run.py' file in exp/mnist/ght_model/train
The result:

All required datasets are present.
Generating pre-trained model:

All required datasets are present.
Invalid train plan!

Try `python hyper_train.py --help` for more information
Error calling hyper_train script
=============== DONE ===============
Invalid train plan!

Try `python hyper_train.py --help` for more information
Error calling hyper_train script

orsharir · 2017-06-19T14:20:01Z

I have just tried cloning, compiling, and unzipping training_files.zip myself, and it worked fine. Are you sure you have followed all of the steps (cloning with --recursive etc.)? Just in case it makes a difference, here are my Makefile.config and my .cshrc files.

orsharir · 2017-06-19T14:20:38Z

Also, have you tried it on one of the school's computers (e.g. gsm)?

… issue #8 ?)

elhanan7 · 2017-06-23T09:17:31Z

it seems that the bug was that i didn't pass the block parameter into the gradients.
the tests didn't catch this because also in the tests I didn't pass the block parameter so all tests ran the default [1,1,1] blocks. Now the ght model is able to learn:

orsharir · 2017-06-23T09:27:55Z

That's great news! However, given that it's the second time that there was an issue with passing the correct parameters, I suggest you go through all parameters (for both MEX and Similarity) and double check that they are indeed all correct.

I'll try to run a few more tests myself, and if it all goes well, then I'll notify Nadav that he can start "beta testing" the new framework.

orsharir · 2017-06-23T09:33:45Z

I've added #15 to help prevent similar kinds of issues in the future, and possibly detect other cases that we are not currently aware of.

I'm currently assuming this issue is fixed, so I'm closing it.

orsharir added the bug label Jun 2, 2017

orsharir assigned orsharir and elhanan7 Jun 2, 2017

orsharir removed their assignment Jun 4, 2017

orsharir mentioned this issue Jun 4, 2017

Dirichlet initialization should take the log of the samples before setting the parameters #12

Closed

elhanan7 added a commit that referenced this issue Jun 23, 2017

bugfix: mex gradients did not get the correct block parameter (solves…

25936b2

… issue #8 ?)

orsharir mentioned this issue Jun 23, 2017

Add unit tests for shape sizes and for correctly setting parameters #15

Open

orsharir closed this as completed Jun 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to reproduce results of prior experiements #8

Failure to reproduce results of prior experiements #8

orsharir commented Jun 2, 2017

elhanan7 commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 7, 2017

elhanan7 commented Jun 7, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 7, 2017

elhanan7 commented Jun 7, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 13, 2017

elhanan7 commented Jun 14, 2017 •

edited

Loading

orsharir commented Jun 14, 2017

elhanan7 commented Jun 14, 2017

orsharir commented Jun 14, 2017

elhanan7 commented Jun 14, 2017

orsharir commented Jun 14, 2017

orsharir commented Jun 18, 2017

elhanan7 commented Jun 18, 2017

orsharir commented Jun 19, 2017

orsharir commented Jun 19, 2017

elhanan7 commented Jun 23, 2017

orsharir commented Jun 23, 2017

orsharir commented Jun 23, 2017

Failure to reproduce results of prior experiements #8

Failure to reproduce results of prior experiements #8

Comments

orsharir commented Jun 2, 2017

elhanan7 commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 4, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 7, 2017

elhanan7 commented Jun 7, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 7, 2017

elhanan7 commented Jun 7, 2017

orsharir commented Jun 7, 2017

orsharir commented Jun 13, 2017

elhanan7 commented Jun 14, 2017 • edited Loading

orsharir commented Jun 14, 2017

elhanan7 commented Jun 14, 2017

orsharir commented Jun 14, 2017

elhanan7 commented Jun 14, 2017

orsharir commented Jun 14, 2017

orsharir commented Jun 18, 2017

elhanan7 commented Jun 18, 2017

orsharir commented Jun 19, 2017

orsharir commented Jun 19, 2017

elhanan7 commented Jun 23, 2017

orsharir commented Jun 23, 2017

orsharir commented Jun 23, 2017

elhanan7 commented Jun 14, 2017 •

edited

Loading