Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocd #19

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Ocd #19

wants to merge 3 commits into from

Conversation

Chung-I
Copy link

@Chung-I Chung-I commented Mar 12, 2019

implement Optimal Completion Distillation.
add a new config named libri_ocd_example.yaml which enables ocd training.
Not well tested. Might have bugs inside.
temperature annealing not yet implemented. currently equals to 1e-8 (sharpest).

@Liangtaiwan
Copy link

@Alexander-H-Liu
I think this is a wonderful PR, can you merge it ASAP?

@xingchensong
Copy link

xingchensong commented May 23, 2019

@Chung-I i notice that u used cross entropy in ocd_loss rather than KLdivergence( which is official in paper 'Optimal Completion Distillation for sequence learning') , is this PR a right implementation for ocd_loss? THX.

@xingchensong
Copy link

xingchensong commented May 23, 2019

ocd_loss should be like this ?
optimal_probs = F.softmax(q_val / temp, dim=-1)

loss += ( optimal_probs * (torch.log(optimal_probs)- F.log_softmax(out_probs[b,:len_sample,:])) ).sum(dim=-1).mean()

@Chung-I
Copy link
Author

Chung-I commented May 25, 2019

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this:
KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .

So H(p, q) - KL(p||q) = H(p) .

H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients:
d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .

So the two losses are equivalent in backprop despite having different values.

But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q.
H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.

It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .

@xingchensong
Copy link

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this:
KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .

So H(p, q) - KL(p||q) = H(p) .

H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients:
d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .

So the two losses are equivalent in backprop despite having different values.

But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q.
H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.

It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .

i see ,THX for ur reply ! There is a question I would like to consult with you:Do we need to complete the backprob ourselves when designing a new loss? Recently I was trying to reproduce CTC(which used dynamic programming algorithms ) . Existing CTC repo such as baidu‘s warp-ctc not only realized the forward part, but also calculated the gradient by hand, but it seems we dont need to do so in ocd_loss , so i 'm confused,Should we calculate the gradient ourselves ?

@Chung-I
Copy link
Author

Chung-I commented May 27, 2019

I think PyTorch does automatic differentiation for you.

Baidu realized their own backward function because they want their own optimized version. (DeepSpeech2, Page 27)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants