Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal scaling for expressions of constant treewidth #117

Open
yaroslavvb opened this issue Dec 1, 2019 · 12 comments
Open

Suboptimal scaling for expressions of constant treewidth #117

yaroslavvb opened this issue Dec 1, 2019 · 12 comments

Comments

@yaroslavvb
Copy link

yaroslavvb commented Dec 1, 2019

I'm seeing opt_einsum generate suboptimal schedules for einsum graphs of small treewidth

For instance, family of Apollonian networks are treewidth=3 graphs where optimal decomposition is discovered by greedily triangulating the graph using minfill heuristic. The optimal scaling order is n^4 regardless of graph size.

Here's an example graph and its tree decomposition (generated with this code)
app3

This problem corresponds to the following einsum:
BC,BD,BE,BF,BG,BI,BJ,BL,BM,CD,CE,CF,CH,CI,CK,CO,CP,DE,DG,DH,DL,DN,DO,DQ,EF,EG,EH,EJ,EK,EM,EN,EP,EQ,FI,FJ,FK,GL,GM,GN,HO,HP,HQ->

Default optimizer gives n^9 scaling. optimizer=dp recovers n^4 but seems a bit slow and doesn't work for next largest graph, where default contraction order is O(n^20)

app4

Even larger example is Apollonian-6 with 1095 edges where opt_einsum's optimize=auto gives O(n^60) contraction order -- colab

@jcmgray
Copy link
Collaborator

jcmgray commented Dec 1, 2019

It does seem the current 'greedy' heuristic struggles a lot with this type of graph. Maybe because it is a very vertex centric approach whereas here because of the size of the hyper-edges there are far fewer overall.

I have some optimizers based on the linegraph tree decomposition, and they both find the optimal scaling quickly for the largest (Apollonian-6) graph:

  • QuickBB - based : scaling 4, FLOP count 8.5e10
  • FlowCutter - based: scaling 4, FLOP count 8.5e10

A hypergraph partitioner based optimizer, which usually outperforms both of those, still struggles a bit and gets:

  • scaling 5, FLOP count 8.8e11

Whereas a method that adaptively updates the usual greedy heuristic only gets:

  • scaling 22, FLOP count 2.0e44

@dgasmith
Copy link
Owner

dgasmith commented Dec 1, 2019

We have always focused on generalized expressions, but have not focused on specific networks. Are there other networks beyond Apollonian that we should consider adding to the test suite and optimizing for?

@yaroslavvb
Copy link
Author

yaroslavvb commented Dec 1, 2019

Perhaps random subgraphs of k-trees?

Any einsum graph computable in O(n^{k+1}) time is a subgraph of a k-tree, so if you make it work with probability 1 for some k, you have achieved optimality for all problems of a certain size. k<=4 covers a bulk of real-world problems that are feasible to compute exactly

I've added some k-trees to colab. Generated using code here, using the simplex "gluing" approach from https://en.wikipedia.org/wiki/K-tree

Screenshot 2019-12-01 11 57 15

In real-world you have a fixed compute budget, which limits the largest size of intermediate factors. Under this constraint the problem of optimal decomposition is fixed-parameter tractable -- getting optimal decomposition is linear in size of graph. That particular algorithm doesn't seem practical, but even for heuristics, "maximum size of intermediate factor" constraint should limit the search space significantly.

Ideally, an einsum optimizer could just tell you that computation is not possible for your scaling budget, so you would then know to reformulate the problem. The algorithm in 10.5 of Kleinberg/Tardos book gives a linear time algorithm that either produces width=4*w decomposition, or proof that treewidth is more than w.

BTW, I'm curious what is your workflow for evaluating these methods.

  • How do you enable FlowCutter method?
  • What is an easier way to evaluating decompositions doesn't print a ton of stuff to stdout? (currently I'm hitting jupyter limits on larger graphs)

@jcmgray
Copy link
Collaborator

jcmgray commented Dec 2, 2019

What is an easier way to evaluating decompositions doesn't print a ton of stuff to stdout? (currently I'm hitting jupyter limits on larger graphs)

Ah the printed info is actually an object (with a verbose __repr__):

path, info = oe.contract_path(eq, *shapes, shapes=True)
info.opt_cost
info.largest_intermediate
max(info.scale_list)
# etc

@jcmgray
Copy link
Collaborator

jcmgray commented Dec 2, 2019

How do you enable FlowCutter method?

  1. create a line-graph of the einsum equation
  2. feed it to flowcutter which finds a tree decomposition
  3. generate a edge elimination ordering from the tree decomposition
  4. contract the tensors based on the edge order, using 'auto-hq' when the subgraph induced by that (hyper)edge has more than 2 tensors

@yaroslavvb
Copy link
Author

yaroslavvb commented Dec 8, 2019

Couple more examples, I expected to always get O(n^2) schedule from tree structured graphs, but I get O(n^4) schedule 66% of the time using auto. The optimize=dp recovers it correctly though

Here's an example that takes 100 random subgraphs of tree structured expression below and prints scaling of resulting einsum optimization

image

import opt_einsum as oe
import numpy as np
import re
graph='{{1,2},{1,3},{1,4},{1,5},{1,6},{2,7},{2,8},{2,9},{2,10},{2,11},{3,12},{3,13},{3,14},{3,15},{3,16},{4,17},{4,18},{4,19},{4,20},{4,21},{5,22},{5,23},{5,24},{5,25},{5,26},{6,27},{6,28},{6,29},{6,30},{6,31},{7,32},{7,33},{7,34},{7,35},{7,36},{8,37},{8,38},{8,39},{8,40},{8,41},{9,42},{9,43},{9,44},{9,45},{9,46},{10,47},{10,48},{10,49},{10,50},{10,51},{11,52},{11,53},{11,54},{11,55},{11,56},{12,57},{12,58},{12,59},{12,60},{12,61},{13,62},{13,63},{13,64},{13,65},{13,66},{14,67},{14,68},{14,69},{14,70},{14,71},{15,72},{15,73},{15,74},{15,75},{15,76},{16,77},{16,78},{16,79},{16,80},{16,81},{17,82},{17,83},{17,84},{17,85},{17,86},{18,87},{18,88},{18,89},{18,90},{18,91},{19,92},{19,93},{19,94},{19,95},{19,96},{20,97},{20,98},{20,99},{20,100},{20,101},{21,102},{21,103},{21,104},{21,105},{21,106},{22,107},{22,108},{22,109},{22,110},{22,111},{23,112},{23,113},{23,114},{23,115},{23,116},{24,117},{24,118},{24,119},{24,120},{24,121},{25,122},{25,123},{25,124},{25,125},{25,126},{26,127},{26,128},{26,129},{26,130},{26,131},{27,132},{27,133},{27,134},{27,135},{27,136},{28,137},{28,138},{28,139},{28,140},{28,141},{29,142},{29,143},{29,144},{29,145},{29,146},{30,147},{30,148},{30,149},{30,150},{30,151},{31,152},{31,153},{31,154},{31,155},{31,156}}'

def tc(num):return chr(num+91)
def factor(first, second): return tc(int(first))+tc(int(second))

from collections import defaultdict

scaling = defaultdict(int)
for i in range(100):
  factors=[factor(first,second) for first,second in re.findall("\{(\d+),(\d+)\}", graph)]
  factors=np.random.choice(factors,replace=False, size=int(len(factors)*.5))
  expr=','.join(factors)+'->'
  path, info = oe.contract_path(expr, *[(100,100)]*len(factors), shapes=True, optimize='auto')
  scaling[max(info.scale_list)]+=1
print(scaling) # {4: 67, 3: 33}

@dgasmith
Copy link
Owner

dgasmith commented Feb 2, 2020

@jcmgray Do you think we can fix this with current algorithms or should we look at adding flowcutter/quickbb for these highly structured graphs?

@yaroslavvb
Copy link
Author

The practical need is the following -- if there's a schedule that computes an expression in reasonable time, discovering this schedule should also take reasonable time.

The assumption of "reasonable time" puts an upper bound on tree-width of the expression graph. Perhaps the dp search strategy can be improved to take that into account? Currently it takes too long even for a tree-structured expression

@yaroslavvb
Copy link
Author

BTW, I've recently reran this comparing against np.einsum, and the issue still exists. However, opt_einsum's dp optimizer seems to be better than np.einsum optimizer. Numpy seems to use this algorithm

Here's the colab comparing the methods

What motivated this is that there's @rgommers was working including np.einsum in the array API, which requires deciding which method to include in the API standard for optimization.

@dgasmith
Copy link
Owner

I added the greedy and optimal versions of the opt_einsum code directly to NumPy a number of years ago (oof, blame says 5!). Parsing through the blame, it hasn't been touched since. I'm happy to contribute more of opt_einsum to NumPy if we can define the feature set everyone is happy with. The diversity of use cases np.einsum is used for made it a bit of a balancing act

What we are looking for here is that the minimal scaling, but not flop count is found in a reasonable time envelope for an arbitrary graph. Let me check if a few facts are true:

  • dp minimal scaling 👍 (except for small tree width) , time envelope 👎
  • greedy minimal scaling 👎 , time envelope 👍
  • We need an algorithm that does not require dependancies beyond NumPy such as networkX

There are a few ideas there which should be simple to implement if we can cover the use cases. Is there a working group we can join to find out more about the API standard?

@yaroslavvb
Copy link
Author

I learned about np.einsum from this discussion, @rgommers -- could you point us to where Python Array API discussions are happening?

@rgommers
Copy link

I learned about np.einsum from this discussion, @rgommers -- could you point us to where Python Array API discussions are happening?

Sure, this is the main repo: https://github.com/data-apis/array-api/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants