-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New and Improved MapFusion #1629
base: main
Are you sure you want to change the base?
New and Improved MapFusion #1629
Conversation
Now using the 3.9 type hints.
But it is too restrictive.
When the function was fixing the innteriour of the second map, it did not remove the readiong.
It almost passes all fuction. However, the one that needs renaming are not yet done.
…t in the input and output set. However, it is very simple.
Before it was going to look for the memlet of the consumer or producer. However, one should actually only look at the memlets that are adjacent to the scope node. At least this is how the original worked. I noticed this because of the `buffer_tiling_test.py::test_basic()` test. I was not yet focused on maps that were nested and not multidimensional. It seems that the transformation has some problems there.
Whet it now cheks for covering (i.e. if the information to exchange is enough) it will now no longer decend into the maps, but only inspect the first outgoing/incomming edges of the map entrie and exit. I noticed that the other way was to restrictive, especially for map tiling.
Otherwise we can end up in recursion.
Before it was replacing the elimated variables by zero. Which actually worked pretty good, but I have now changed that such that `offset()` is used. I am not sure why I used `replace` in the first place, but I think that there was an issue. However, I am not sure.
I have some trouble understanding the difference between -Serial and -Parallel and why both are necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that MapFusion has issues - but I am slightly confused by the current approach to provide multiple versions - it seems as though fixes to the original should be preferrable. Why should we maintain many different implementations, and not have one maintained one, and deprecate the rest? @tbennun - what is your view?
dace/sdfg/state.py
Outdated
# Union all subgraphs, so an array that was excluded from the read | ||
# set because it was written first is still included if it is read | ||
# in another subgraph | ||
for data, accesses in rs.items(): | ||
read_set[data] += accesses | ||
for data, accesses in ws.items(): | ||
write_set[data] += accesses | ||
return read_set, write_set | ||
return copy.deepcopy((read_set, write_set)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to make a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because, the subsets are still linked to the Memlets, so if you modify them then you change them at the Memlets.
This might be useful but since it is not possible to determine which subset belongs to which memlet, it does not make sense to maintain this link.
You could potentially copy them above, because some of them are constructed on demand anyway, however:
- The on demand construction is the minority case.
- Doing it here allows to do it in one big sweep.
dace/sdfg/state.py
Outdated
for e in out_edges: | ||
# skip empty memlets | ||
if e.data.is_empty(): | ||
if not isinstance(n, nd.AccessNode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add more comments to make it easier to follow what is being done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% about the naming solution - having 3/5 different kinds of MapFusion seems like a poor solution that will lead to confusion - between Serial,Parallel, OTF and the original MapFusions versions are proliferating. Would it not be preferrable to fix/choose one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but the original map fusion was unable to do parallel fusing, the available solutions are:
- Combining serial and parallel map fusion into one
- Doing the above one but making parallel an opt in option.
The best solution is probably 2. because then everything will work as before. But I do not care what do you think @tbennun @acalotoiu
Now only the serial version is there. Also integrated the helper into the serial file.
This reverts commit 259d17c.
…f data can be removed. This is because the function is much less strict.
Before the intermediate scalar was a Register, but not the array. I noticed that some rediundand array removal transformation have problems with that. An alternative would be to make the array intermediate a register.
It is now made the default. Also added some tests to check if it is properly working. This mostly refactors the implementation. It should be the same as before, however, the only thing that I really removed, that could become a problem, is the check if it is used in the same state. This was always a hack, and the check should have been that it is not used in the same data flow. But I think it was redundant anyway.
It essentially check if the intermdeiate, that should be turned into a shared intermediate, i.e. a sink node to the map exit, is used in the data flow downstream. This is needed because some DaCe transformations do not correctkly check for the existence, it even seems that it is killed.
This test failed due to numerical instabilities. It passed once I changed the arguments, which to me does not make sense, as I think if `a` is close to `b` then `b` should also be close to `a`. So I changed the test to an absoluet check.
The reason is that everything goes through the intermediates should not be a problem.
The [initial version](#1594) of the optimization pipeline only contained a rough draft. Currently this PR contains a copy of the map fusion transformations from DaCe that are currently under [review](spcl/dace#1629). As soon as that PR is merged and DaCe was updated in GT4Py these files will be deleted. This PR collects some general improvements: - [x] More liberal `LoopBlocking` transformation (with tests). - [x] Incorporate `MapFusionParallel` - [x] Using of `can_be_applied_to()` as soon as DaCe is updated (`TrivialGPUMapElimination`, `SerialMapPromoter`). - [x] Looking at strides that the Lowering generates. (Partly done) However, it still uses MapFusion implementation that ships with GT4Py and not the one in DaCe. Note: Because of commit 60e4226 this PR must be merged after [PR1768](#1768).
It mainly verifies that fusion stop at a certain invalid point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work on this transformation. MapFusion is in need of some improvements. There are a few general concerns I need to have addressed or at least discussed first in addition to the specific review comments though.
Complete Change
You advertise this in the name and description of the PR (the later of which seems to be outdated, if I am not mistaken?), but this is essentially a completely re-written MapFusion
. I have nothing against that, but this naturally worries me for a transformation this central, often used, and consequently battle tested.
There are some things, such as checking for no WCR outputs of the first map that serve as inputs to the second map, that I can no longer detect in the new can_be_applied
method. If these are all correctly handled by the new transformation now, that is great. But I believe this drastic change at least needs a very concise and clear discussion in the PR to explain conceptually exactly what changed that now allows for different situations for the transformation to apply. Currently this is missing to me, and I was unable to piece this together from the doc comments alone.
Difference to OTF Map Fusion and SubgraphFusion
This new version of the transformation seems to me like it contains components of both OTF map fusion and SubgraphFusion
(or CompositeFusion
). I am not categorically against bringing transformations together to reduce the size of the transformation repository if there is one transformation that can take the job of 2. However, currently I don't believe this transformation is meant to replace either of the two. Can you elaborate not just on how this new MapFusion
is different from the old one (see previous point), but also on how this is different from the other existing transformations in this category (OTFMapFusion
and SubgraphFusion
)?
Complexity
By seemingly being a much more capable fusion transformation, this can_be_applied
method appears to perform a lot more checks in general, which is great and improves robustness if all previously identified issues are still covered. However, can_be_applied
is also run a lot, particularly on large graphs. Can you elaborate on the performance difference between this and the old MapFusion
transformation, or is there no measurable difference?
def annotates_memlets(): | ||
return False | ||
# Pattern Nodes | ||
map_exit_1 = transformation.transformation.PatternNode(nodes.MapExit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that there is no need for changing anything about the pattern nodes, except explicitly matching the pattern to MapExit
as opposed to ExitNode
. What's the reason for this change? In my opinion the first_map_exit
/ second_map_entry
naming for instance seemed clearer, and this would completely remove the need for changing anything about how one manually sets up matches with this transformation - which I believe is done quite frequently in user code (i.e., code not inside the DaCe repo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the naming is better.
I also agree that a transformation as central as this one should not change its public interface unless it is really needed.
I used search+replace, I think I got all, but if you spot a stray, please indicate it.
if scope[map_entry_1] is not None: | ||
return False | ||
|
||
# We will now check if there exists a remapping that of the map parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incomplete, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand your question.
Do you refer to the checks that ensures that the maps are in the same scope or that a particular kind of check is missing at all?
# are not resolved yet. | ||
access_sets: List[Dict[str, nodes.AccessNode]] = [] | ||
for scope_node in [map_entry_1, map_exit_1, map_entry_2, map_exit_2]: | ||
access_set: Set[nodes.AccessNode] = self.get_access_set(scope_node, state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only seems to collect access nodes which are connected as outputs to the map exits and inputs to the map entries, right? If so, are map-internal writes / reads taken care of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I had trouble understanding your question, because I thought you refers to the data that is used to exchange data between the maps, i.e. what self.array
matches (which is not examined in this function, but in the partition function).
However, then I realized that you mean writes that are fully withing the map, i.e. the dataflow does not pass through the map entry/exit.
Then you are right, this is not checked, because until now I have not realized that this might be an issue, since I assumed that every data has to pass through the maps.
However, now I realize that this is wrong (for example I know that CompositeFusion
may put intermediates fully inside a map.
To handle this case I have made the following additions:
- Fusion is refused if both map bodies consist of an access node that refers to the same data.
- Fusion is refused if a map body consists of an access node that refers to global data (this restriction could be lifted).
- As an implementation detail,
View
s are ignored, because they either refer to data from the outside (has_read_write_dendency()
takes care), or they refer to data that is inside the map (in which case it is handle).
As a side note, as far as I can tell the original implementation did not handle this case.
|
||
# Compute the renaming that for translating the parameter of the _second_ | ||
# map to the ones used by the first map. | ||
repl_dict: Dict[str, str] = self.find_parameter_remapping( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to me that this function is called multiple times in a single call to can_be_applied
. Is that necessary, and if so, why? Can this not be computed once and reused during the entire can_be_applied
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right it is called multiple time.
I refactored the code in such a way that it is only called once during can_be_applied()
.
Furthermore, also in apply()
it is now called only once, before it was called twice, the second time the dict
was always empty.
I refactored the code in such a way that the call is only performed twice, once in can_be_applied()
and once in apply()
.
A new and improved version of the map fusion transformation.
The transformation is implemented in a class named
MapFusionSerial
, furthermore theMapFusionParallel
transformation is added, that allows to fuse parallel maps together.The new transformation analyses the graph more carefully when it checks if and how it should perform the fusing.
Special consideration was given about the correction of memlets.
However, there is still some aspects that should be improved and allowed to handle.
The new transformation produces graphs that are slightly different from before, and certain (other) transformations can not handle the resulting SDFG. For that reason a compatibility flag
strict_dataflow
was introduced. However, by default this flag is disabled. The only place where it is activated is inside the auto optimization function.Furthermore, the
SDFGState._read_and_write_sets()
function has been rewritten to handle the new SDFGs, because of some bugs. However, one bug has been kept because of other transformations that would fail otherwise.But it is a bug, tests were written to demonstrate this.
Collection of known issues in other transformation:
RefineNestedAccess
and `SDFGState._read_and_write_sets()