Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to reproduce results of prior experiements #8

Closed
orsharir opened this issue Jun 2, 2017 · 27 comments
Closed

Failure to reproduce results of prior experiements #8

orsharir opened this issue Jun 2, 2017 · 27 comments
Assignees
Labels

Comments

@orsharir
Copy link
Member

orsharir commented Jun 2, 2017

I've tried to reproduce our previous results of networks trained in Caffe, but I cannot get them to converge during training -- the loss function is either stuck or increasing. This seems to be some sort of bug in the current implementation, however, it's a bit difficult understanding where's the fault, given that the tests are running fine.

I'll upload my code later on, while I try to see if I get more specific information on the source of this issue.

@elhanan7
Copy link
Contributor

elhanan7 commented Jun 4, 2017

Did you rely on unsupervised initialization?

@orsharir
Copy link
Member Author

orsharir commented Jun 4, 2017

I've tried both with and without it. We can discussed it in more detail in our meeting. Are you coming today?

@orsharir
Copy link
Member Author

orsharir commented Jun 4, 2017

Example in Keras: basic_net_with_keras.py.txt

@orsharir
Copy link
Member Author

orsharir commented Jun 4, 2017

This is an updated version of the test script: basic_net_with_keras.py.txt

I've made minor modifications from the last one, such that the exact same configuration (same network, same initialization and same optimization algorithm) works okay under Caffe.

I've also found one bug which might have been a contributing factor (though it doesn't solve the issue): in your Dirichlet initialization code, you've forgot to take the log at the end (as each set of parameters are a probability vector in log-space). I've added my correct version in this file.

@orsharir
Copy link
Member Author

orsharir commented Jun 4, 2017

I've also found another issue with unshared regions behavior. I've fixed this locally in the above script, but it should be fixed. See issue #11 for details.

@orsharir
Copy link
Member Author

orsharir commented Jun 7, 2017

This is the weights file of a network of the same structure that was trained in caffe:
ght_model_train_1_iter_250.caffemodel.zip

@orsharir
Copy link
Member Author

orsharir commented Jun 7, 2017

These are the same weights but in numpy format, each saved in their own file: weights.zip

@orsharir
Copy link
Member Author

orsharir commented Jun 7, 2017

I've trained the above network for 250 iterations with batch size 100. At the end of the training the loss function was on the order of 0.1~0.3, so expect values on this order (it's not precise because I've forgot to test the network at the end).

@elhanan7
Copy link
Contributor

elhanan7 commented Jun 7, 2017

After initializing with your weights, the result is the same, no learning.
I passed a single example through the network and printed out the mean activations:

2017-06-07 21:06:41.899961: I tensorflow/core/kernels/logging_ops.cc:79] Mean Sim[-37.658592]
2017-06-07 21:06:41.900483: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-15.049721]
2017-06-07 21:06:41.901129: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-66.746323]
2017-06-07 21:06:41.902419: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-275.81625]
2017-06-07 21:06:41.905782: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-1113.9332]
2017-06-07 21:06:41.906367: I tensorflow/core/kernels/logging_ops.cc:79] Mean Mex[-4464.123]

Is this normal?

@orsharir
Copy link
Member Author

orsharir commented Jun 7, 2017

The point of initializing with those weights is not to train the network from this point, but simply to test that the forward pass of the network is correct, i.e. you'll need to set the weights and then evaluate the network (no training!) on the dataset (could be just a small subset of course) to make sure the loss is around the levels I've written above.

Try to do that and update me on the results.

@orsharir
Copy link
Member Author

orsharir commented Jun 7, 2017

And regarding the activations, it is normal for them to grow to a very large number - this is because of the sum pooling, for example, consider the activations of the similarity (-37), and that it's spatial extent is 16x16, then had we used global sum pooling at that point, we'd get around -9400, which is on the same order as what you get at the last MEX layer -3364.

@elhanan7
Copy link
Contributor

elhanan7 commented Jun 7, 2017

I evaluated the model with the weights, and it gave really bad results. Turns out that the weights expect the data to be in the range [-1, 1] and what was given is [0, 1]. After fixing that we get:

loss = 0.642
accuracy = 0.83  

And still no learning.

BTW, do you gradient clipping in caffe?

@orsharir
Copy link
Member Author

orsharir commented Jun 7, 2017

Actually the data should be in the -0.5 to 0.5 range (I thought I did that in the script I've sent you). Also, are these results on the training set or the test set?

And no, I didn't use gradient clipping in Caffe.

Given the above results, I'd assume that the issue is with the gradients. Maybe try to output more detailed statistics on them, i.e. min, max, mean, std, etc. Try to output these statistics for the weights I gave you with no modifications to the weights, and average the results over a few mini batches.

@orsharir
Copy link
Member Author

Hi @elhanan7
Do you have any updates regarding this issue?

@elhanan7
Copy link
Contributor

elhanan7 commented Jun 14, 2017

I did the numeric vs. computed test for the gradients of the keras network w.r.t. the offsets (of the first mex)
They are different, and a subset of the computed gradients is very large.
Next steps are to find the specific gradient that is the culprit, and to understand why the tests didn't catch this behaviour

I just saw that you offered to meet on thursday, Do you still want to, maybe in the morning?

gradients.zip

@orsharir
Copy link
Member Author

Thanks for the update. Let's discuss this tomorrow (Thursday) in more detail. Can you meet at 10:30?

@elhanan7
Copy link
Contributor

Yes, that works

@orsharir
Copy link
Member Author

I've tried to open the gradients files you've attached, but something seems wrong. First, the shapes are (256,1), and I've expected them to be same as the ones from the network. Second, the numeric gradients are simply 0, which seems like a mistake.

@elhanan7
Copy link
Contributor

About the different size, that is because i removed the similarity layer to make the numeric gradients computation tractable. About the zeroes, maybe I did something wrong when computing the gradients (it is not clear how to do this for a keras model)

@orsharir
Copy link
Member Author

What sharing pattern do you use, and how many instances? Regarding getting 0, it could be that you are not computing the numeric gradients correctly. I suggest you follow the code of Caffe for checking the gradients, look at test_gradient_check_util.hpp for details -- some hints for reading this source code:

  • Start from CheckGradientSingle.
  • the blobs_ array is the array of "blobs" object, each representing a set of parameters for a layer (e.g. one for the templates and one for the weights in the case of the Similarity layer). The bottom array is the array of blobs representing the inputs to the layer, and the tops the array of blobs representing the outputs of the layer.
  • Notice that they define a new "loss" that is used when testing gradients. It is defined as the sum of squares of the outputs of the layer. You could probably define this loss directly in Keras.
  • Notice that they zero out all the parameters before testing the numeric gradient, but use random values for the input arrays (random gaussian with zero mean and std=1.0).

Also, don't forgot that you should also check the gradient w.r.t. the input to the layer, and not just w.r.t. the parameters.

Hope this helps.

@orsharir
Copy link
Member Author

Extract this zip to the following directory Generative-ConvACs/exp/mnist/ght_model/train inside HUJI-Deep/Generative-ConvACs.

@elhanan7
Copy link
Contributor

I compiled the Generative-CAC code and ran the 'run.py' file in exp/mnist/ght_model/train
The result:

All required datasets are present.
Generating pre-trained model:

All required datasets are present.
Invalid train plan!

Try `python hyper_train.py --help` for more information
Error calling hyper_train script
=============== DONE ===============
Invalid train plan!

Try `python hyper_train.py --help` for more information
Error calling hyper_train script

@orsharir
Copy link
Member Author

I have just tried cloning, compiling, and unzipping training_files.zip myself, and it worked fine. Are you sure you have followed all of the steps (cloning with --recursive etc.)? Just in case it makes a difference, here are my Makefile.config and my .cshrc files.

@orsharir
Copy link
Member Author

Also, have you tried it on one of the school's computers (e.g. gsm)?

@elhanan7
Copy link
Contributor

it seems that the bug was that i didn't pass the block parameter into the gradients.
the tests didn't catch this because also in the tests I didn't pass the block parameter so all tests ran the default [1,1,1] blocks. Now the ght model is able to learn:

loss
acc

@orsharir
Copy link
Member Author

That's great news! However, given that it's the second time that there was an issue with passing the correct parameters, I suggest you go through all parameters (for both MEX and Similarity) and double check that they are indeed all correct.

I'll try to run a few more tests myself, and if it all goes well, then I'll notify Nadav that he can start "beta testing" the new framework.

@orsharir
Copy link
Member Author

I've added #15 to help prevent similar kinds of issues in the future, and possibly detect other cases that we are not currently aware of.

I'm currently assuming this issue is fixed, so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants