Comparison to ConFIG? #147

ogencoglu · 2024-09-08T19:20:09Z

ogencoglu
Sep 8, 2024

Not super familiar with the literature lately. Can you explain what are the differences of your approach to ConFIG?

PierreQuinton · 2024-09-09T06:17:43Z

PierreQuinton
Sep 9, 2024
Maintainer

Thank you for your issue, note that TorchJD is a framework to do JD (not necessarily a framework to do UPGrad), we could in principle implement config in TorchJD as an aggregator, is that what you are asking?
I think that you linked this to us since they address the notion of conflict. In the characteristics of their method they claim that it is non-conflicting but I could not find any proof there.

If $J\in\mathbb R^{m\times n}$, and $d\in \mathbb R^m$ correspond to the vector of norms of rows of $J$, then their method gives an update proportional to $(\text{diag}(d) J)^{-T} 1$, I will assume that the $^{-T}$ is the Moore Penrose pseudo inverse as we can only invert square matrices of full rank, I will also note it with $^\dagger$. Rewritting, we get $g_u \propto (\text{diag}(d) J)^\dagger 1=J^\dagger \text{diag}(d) 1=J^\dagger d$.
Now in order for this to be non conflicting, we need $J g_u$ to have non-negative entries, but $J g_u \propto J J^\dagger d=P d$ with $P$ the projection onto the span of $J$. By selecting $J$ carefully we can get any $P$ and $d$ we want, as long as $P$ is a projection matrix and $d$ has non-negative entries. If $P$ is full rank, then it is non-conflicting, however as long as the span of $P$ does not intersect with the positive cone of $\mathbb R^m$, then we can find a $d$ that will be projected outside of this positive cone. The aggregator cannot be non-conflicting when generalized with the Moore Penrose pseudo inverse and if we don't generalize it, it would probably be rather unstable.

This aggregators feels like a mix of IMTL-G and PCGrad which are both not typically performing well, I doubt that this would lead to better performance than UPGrad but we can try if you think this is worth the shot!

When trying to make the sum non-conflicting, there is essentially two very natural ways by projecting onto the non-conflicting cone (dual cone of the rows of the Jacobian), one consist of projecting the sum of the gradients onto the non-conflicting cone, this is DualProj. The other is to first project each grdients onto the non-conflicting cone and then summing them, this is UPGrad. It turns out that UPGrad appears to give better performances and this is why we highlight it.

0 replies

ogencoglu · 2024-09-09T07:07:18Z

ogencoglu
Sep 9, 2024
Author

Then it would make sense to ping the original authors as they can respond better.

@qiauil @thunil

0 replies

qiauil · 2024-09-09T07:42:27Z

qiauil
Sep 9, 2024

@PierreQuinton

Hi, thanks for the interesting comment!

I am not sure whether I fully understand the notation there, but I think there are some interesting points I would like to discuss.

The gradient matrix is usually full rank as the number of optimization parameters is usually larger than the number of losses, and the losses are usually linearly independent. We have discussed this point in Appendix A.3 in our manuscript.
We are actually not using Moore Penrose pseudo inverse in the code but a more stable way, i.e., to calculate the least squares solution of the following equation:

$$[\mathcal{U}({g}_1),\mathcal{U}({g}_2),\cdots, \mathcal{U}({g}_m)]^{\top} {x}=\mathbf{1}_m,$$

Once this equation is solved, then x should have positive dot products to any $g_i$, i.e., inside the non-conflicting cone.

I have just checked the torched package, and I think it is great work! I really like your ideas and will have a deeper look into your work and see whether we can make some comparisons!

Cheers,
Qiang

0 replies

PierreQuinton · 2024-09-09T08:16:50Z

PierreQuinton
Sep 9, 2024
Maintainer

@qiauil in that case, your formulation is equivalent to the one with the Moore Penrose pseudo inverse, in this case I agree that it is non-conflicting. However consider a point on the Pareto front, then it is Pareto stationary and therefore there is $w\in\mathbb R^m$ (actually positive) such that $\sum_{i\in[m]} w_i g_i=0$, this means that your matrix is not full rank. If your aggregator was non-conflicting then in that case it would have to decide $g_u$ with $g_u\cdot g_i=0$ for all $i$. Suppose for instance that we have two gradients with $g_1=-a g_2$ with $g_2$ unitary and $a>0$ (meaning we are on the Pareto front), then the solution $x$ will be proportional to $g_1$ either with positive or negative coefficient depending on $a$, in both case you will conflict with either $g_1$ or $g_2$ (unless $x=0$ which happens if and only if $a=1$). As you are optimizing for the Pareto front, you will in the limit get into a regime where one left singular vector of the Jacobian will be all positive with a very low singular value, in that case the condition number of the system that your solve will be very high and the associated matrix will be degenerate, this means that having control over it will be slightly hard.

That being said, I am curious of the performance of your aggregator and I think this could be implemented in TorchJD. The advantage is that you would be able to parallelize the computation of the gradients (using vmap from torch), so it should be faster! I'm guessing that the function ConFIG_update is the one computing the direction from a Jacobian.

EDIT: I think that you could avoid the problem of diving by zero by solving equivalently $[g_1,\dots, g_m]^T x=[| g_1|, \dots, | g_m |]$.

0 replies

qiauil · 2024-09-09T08:43:42Z

qiauil
Sep 9, 2024

@PierreQuinton

Thanks for your comments!

Yes, I would agree that near the Pareto front, it is hard to get a non-conflict direction. We also considered this point during the implementation. However, we found that it actually works well near the Pareto front due to the implementation of pseudo-inverse/ least square solution and are trying to get an approximate solution of the above equations. When near to the Pareto front, we will get a zero gradient. We can show a simple example here:

import torch
import time
b=torch.ones(2)
xs=[]
for i in range(1000):
    A=torch.rand(1,1000)
    A=torch.cat((A,-A),dim=0)
    xs.append(torch.linalg.lstsq(A,b).solution)
xs=torch.stack(xs)
print(torch.mean(xs))
print(torch.max(xs))
print(torch.min(xs))

which gives us 4.3675e-11,3.0981e-10, and -2.2936e-10.

Yes, the ConFIG_update returns the final update direction. Your suggestions on changing the equation seem to be very interesting! We will try it!

0 replies

PierreQuinton · 2024-09-09T09:00:37Z

PierreQuinton
Sep 9, 2024
Maintainer

@qiauil this correspond to the example I gave with $a=1$, I said that $x=0$ if and only if $a=1$. If you try with the following, then you do not get $0$ :

import torch
import time
b=torch.ones(2)
xs=[]
for i in range(1000):
    A=torch.rand(1,1000)
    A=torch.cat((A,-0.1*A),dim=0)
    xs.append(torch.linalg.lstsq(A,b).solution)

xs=torch.stack(xs)
print(torch.mean(xs))
print(torch.max(xs))
print(torch.min(xs))

here I get 0.0013, 0.0029 and 0.0.

0 replies

qiauil · 2024-09-09T09:06:04Z

qiauil
Sep 9, 2024

@qiauil this correspond to the example I gave with a = 1 , I said that x = 0 if and only if a = 1 . If you try with the following, then you do not get 0 :
import torch
import time
b=torch.ones(2)
xs=[]
for i in range(1000):
    A=torch.rand(1,1000)
    A=torch.cat((A,-0.1*A),dim=0)
    xs.append(torch.linalg.lstsq(A,b).solution)

xs=torch.stack(xs)
print(torch.mean(xs))
print(torch.max(xs))
print(torch.min(xs))
here I get 0.0013, 0.0029 and 0.0.

Hi, @PierreQuinton

Sorry, I forgot the unit vector operation; the following example will work:

import torch
import time
b=torch.ones(2)
xs=[]
for i in range(1000):
    A=torch.rand(1,1000)
    A=torch.cat((A,-0.1*A),dim=0)
    A=torch.nan_to_num((A/(A.norm(dim=1)).unsqueeze(1)),0)
    xs.append(torch.linalg.lstsq(A,b).solution)
xs=torch.stack(xs)
print(torch.mean(xs))
print(torch.max(xs))
print(torch.min(xs))

it gives 3.4520e-10, 9.2795e-09 and -1.0622e-08.

0 replies

PierreQuinton · 2024-09-09T09:31:14Z

PierreQuinton
Sep 9, 2024
Maintainer

@qiauil I think that you are right, in 2 dimension this will be non-conflicting! I still have doubts about higher dimensions. Also I would rather see a proof than an example, I will try to prove/disprove it later.

0 replies

ogencoglu · 2024-09-09T09:40:12Z

ogencoglu
Sep 9, 2024
Author

I need more popcorn. Be right back...

0 replies

ValerianRey · 2024-09-17T17:18:52Z

ValerianRey
Sep 17, 2024
Maintainer

FYI I just enabled discussions in the repo, and transferred this from issue to discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison to ConFIG? #147

{{title}}

Replies: 10 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Comparison to ConFIG? #147

ogencoglu Sep 8, 2024

Replies: 10 comments

PierreQuinton Sep 9, 2024 Maintainer

ogencoglu Sep 9, 2024 Author

qiauil Sep 9, 2024

PierreQuinton Sep 9, 2024 Maintainer

qiauil Sep 9, 2024

PierreQuinton Sep 9, 2024 Maintainer

qiauil Sep 9, 2024

PierreQuinton Sep 9, 2024 Maintainer

ogencoglu Sep 9, 2024 Author

ValerianRey Sep 17, 2024 Maintainer

ogencoglu
Sep 8, 2024

PierreQuinton
Sep 9, 2024
Maintainer

ogencoglu
Sep 9, 2024
Author

qiauil
Sep 9, 2024

PierreQuinton
Sep 9, 2024
Maintainer

qiauil
Sep 9, 2024

PierreQuinton
Sep 9, 2024
Maintainer

qiauil
Sep 9, 2024

PierreQuinton
Sep 9, 2024
Maintainer

ogencoglu
Sep 9, 2024
Author

ValerianRey
Sep 17, 2024
Maintainer