Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should we be storing, and when? #237

Open
sdmccabe opened this issue Jun 20, 2019 · 1 comment
Open

What should we be storing, and when? #237

sdmccabe opened this issue Jun 20, 2019 · 1 comment
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@sdmccabe
Copy link
Collaborator

Our approach of using self.results to store intermediate results arose mostly from the network reconstruction context; there we typically wanted to store, e.g., the pure weights matrix so that we could play with thresholding. For distances, especially, some of these intermediate representations may be less useful; we should think about what, precisely, we want to store for each.

I think this might be covered under our big issue about representing distances (#174), but I'm opening a separate issue to put it on the table. That is, we store the eigenvalues in NBD so that we can experiment with different distances; if we implement #174 that may change the utility of storing the eigenvalues.

@sdmccabe sdmccabe added help wanted Extra attention is needed question Further information is requested labels Jun 20, 2019
@leotrs
Copy link
Collaborator

leotrs commented Jun 20, 2019

I think the point is to figure out where netrd fits within a user's workflow. For example, if I have two graphs, I might want to load them in memory, compute/plot their eigenvalues, and then compute the distance.

graph1, graph2 = # some graphs
vals1, vals2 = compute_eigenvalues(graph1), compute_eigenvalues(graph2)
plt.scatter(vals1)
plt.scatter(vals2)
dist = graph_distance(graph1, graph2)

Right now, netrd forces a different workflow which I think is less intuitive/natural.

graph1, graph2 = # some graphs
distance = netrd.distance.NonBacktrackingSpectral()
dist = distance.dist(graph1, graph2)
vals1, vals2 = dist.results['vals']
plt.scatter(vals1)
plt.scatter(vals2)

I can't really think of a situation where I would like to compute the NBD between two graphs but never use the eigenvalues. So I will always want dist.results to store the eigenvalues every time.

One alternative is to have netrd accept precomputed intermediate values, like this

vals1, vals2 = compute_eigenvalues(graph1), compute_eigenvalues(graph2)
# ... some intermediate analysis
dist = distance.dist(graph1, graph2, vals1=vals1, vals2=vals2)

Though of course it would be much cleaner to have

vals1, vals2 = compute_eigenvalues(graph1), compute_eigenvalues(graph2)
# ... some intermediate analysis
dist = netrd.EMD(vals1, vals2)    # or dist = netrd.JSD(vals1, vals2), etc

In this case, the distance module of netrd would really become two sub-modules: one that computes arbitrary statistics, and one that implements many different ways of comparing said statistics (plus a file putting the two together and implementing more complicated methods that involve pre/post-processing such as LaplacianSpectralDistance).


Having said all of this, I admit that this is just what seems more natural to me, and we should try to design netrd so it can be used with many different workflows. It seems to me that storing all the intermediate work or accepting it as parameter is also the way to go in this case since it allows for different alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants