-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to reproduce results of prior experiements #8
Comments
Did you rely on unsupervised initialization? |
I've tried both with and without it. We can discussed it in more detail in our meeting. Are you coming today? |
Example in Keras: basic_net_with_keras.py.txt |
This is an updated version of the test script: basic_net_with_keras.py.txt I've made minor modifications from the last one, such that the exact same configuration (same network, same initialization and same optimization algorithm) works okay under Caffe. I've also found one bug which might have been a contributing factor (though it doesn't solve the issue): in your Dirichlet initialization code, you've forgot to take the log at the end (as each set of parameters are a probability vector in log-space). I've added my correct version in this file. |
I've also found another issue with unshared regions behavior. I've fixed this locally in the above script, but it should be fixed. See issue #11 for details. |
This is the weights file of a network of the same structure that was trained in caffe: |
These are the same weights but in numpy format, each saved in their own file: weights.zip |
I've trained the above network for 250 iterations with batch size 100. At the end of the training the loss function was on the order of 0.1~0.3, so expect values on this order (it's not precise because I've forgot to test the network at the end). |
After initializing with your weights, the result is the same, no learning.
Is this normal? |
The point of initializing with those weights is not to train the network from this point, but simply to test that the forward pass of the network is correct, i.e. you'll need to set the weights and then evaluate the network (no training!) on the dataset (could be just a small subset of course) to make sure the loss is around the levels I've written above. Try to do that and update me on the results. |
And regarding the activations, it is normal for them to grow to a very large number - this is because of the sum pooling, for example, consider the activations of the similarity (-37), and that it's spatial extent is 16x16, then had we used global sum pooling at that point, we'd get around -9400, which is on the same order as what you get at the last MEX layer -3364. |
I evaluated the model with the weights, and it gave really bad results. Turns out that the weights expect the data to be in the range [-1, 1] and what was given is [0, 1]. After fixing that we get:
And still no learning. BTW, do you gradient clipping in caffe? |
Actually the data should be in the -0.5 to 0.5 range (I thought I did that in the script I've sent you). Also, are these results on the training set or the test set? And no, I didn't use gradient clipping in Caffe. Given the above results, I'd assume that the issue is with the gradients. Maybe try to output more detailed statistics on them, i.e. min, max, mean, std, etc. Try to output these statistics for the weights I gave you with no modifications to the weights, and average the results over a few mini batches. |
Hi @elhanan7 |
I did the numeric vs. computed test for the gradients of the keras network w.r.t. the offsets (of the first mex) I just saw that you offered to meet on thursday, Do you still want to, maybe in the morning? |
Thanks for the update. Let's discuss this tomorrow (Thursday) in more detail. Can you meet at 10:30? |
Yes, that works |
I've tried to open the gradients files you've attached, but something seems wrong. First, the shapes are (256,1), and I've expected them to be same as the ones from the network. Second, the numeric gradients are simply 0, which seems like a mistake. |
About the different size, that is because i removed the similarity layer to make the numeric gradients computation tractable. About the zeroes, maybe I did something wrong when computing the gradients (it is not clear how to do this for a keras model) |
What sharing pattern do you use, and how many instances? Regarding getting 0, it could be that you are not computing the numeric gradients correctly. I suggest you follow the code of Caffe for checking the gradients, look at test_gradient_check_util.hpp for details -- some hints for reading this source code:
Also, don't forgot that you should also check the gradient w.r.t. the input to the layer, and not just w.r.t. the parameters. Hope this helps. |
Extract this zip to the following directory |
I compiled the Generative-CAC code and ran the 'run.py' file in exp/mnist/ght_model/train
|
I have just tried cloning, compiling, and unzipping training_files.zip myself, and it worked fine. Are you sure you have followed all of the steps (cloning with |
Also, have you tried it on one of the school's computers (e.g. gsm)? |
That's great news! However, given that it's the second time that there was an issue with passing the correct parameters, I suggest you go through all parameters (for both MEX and Similarity) and double check that they are indeed all correct. I'll try to run a few more tests myself, and if it all goes well, then I'll notify Nadav that he can start "beta testing" the new framework. |
I've added #15 to help prevent similar kinds of issues in the future, and possibly detect other cases that we are not currently aware of. I'm currently assuming this issue is fixed, so I'm closing it. |
I've tried to reproduce our previous results of networks trained in Caffe, but I cannot get them to converge during training -- the loss function is either stuck or increasing. This seems to be some sort of bug in the current implementation, however, it's a bit difficult understanding where's the fault, given that the tests are running fine.
I'll upload my code later on, while I try to see if I get more specific information on the source of this issue.
The text was updated successfully, but these errors were encountered: