text diffing code mutations produces unintended results #3710

jeysal · 2022-11-13T22:40:45Z

jeysal
Nov 13, 2022

I encountered this while building noDupeKeys, but very likely this can also happen with some other lint rules including ones that are on main.

Linting

let x = { a: 1, a: 2 };

with noDupeKeys suggests removing a: 1,.

However, since record_diff does a plain text diff on the input code vs the code after applying the suggested fix, the console output looks like removing 1, a:

This is confusing.

Solutions

Tree-based diffing

We could print diffs based on the CST instead of plain text. A CST diffing algorithm can understand that a: 2 is the same in both input and output code, and marks the a: 1 member as the difference.

Disadvantage

Tree-based diffing cannot find out which member is removed in this object:

let x = { a: 1, a: 1 };

It will thus still print slightly confusing diffs in some cases, for example for noDupeKeys it may show removal of the other second member even though the first one is the one that the lint is highlighting as a problem:

Mutation tracking diffing

We could print a diff based on the previous tree and the mutation batch we apply to it, so that we know the exact node being mutated. The disadvantage of naive tree-based diffing does not apply to mutation tracking diffing for this reason.

Disadvantage

Mutation tracking diffing is less versatile.

Formatter diffs exhibit similar problems, although they are usually less confusing because formatter changes tend to be less invasive:

Tree-based diffing could improve formatter diffs as well, but we cannot trivially apply mutation tracking diffing to formatter diffs.

Notes

Both diffing algorithms are likely slower than text diffing. However, I wouldn't consider the lint/format check failure case, in which this code would run, as performance-sensitive as the success case.

This is probably a low priority bug since the impact on users isn't dramatic (the diff is still technically correct and users should be able to figure out what is happening) but fixing it by eschewing code text diffing in favor of something better is not a quick fix.
So it's likely more of a thing for in the future when Rome has more users. But I thought it's worth tracking either way.

cc @leops @MichaReiser

MichaReiser · 2022-11-14T07:19:32Z

MichaReiser
Nov 14, 2022

I'll defer to @leops and @xunilrj . I have little context on our text diff implementation.

0 replies

leops · 2022-11-14T08:47:32Z

leops
Nov 14, 2022

This part is actually quite important:

Both diffing algorithms are likely slower than text diffing. However, I wouldn't consider the lint/format check failure case, in which this code would run, as performance-sensitive as the success case.

It turns out diffing is actually one of the slower parts of the toolchain: it can take up as much as 50% of the overall runtime of the CLI in profiling runs (although this number might be biased as we run the profiling on large repos with many diagnostics).
The reason for this is that while I explored some tree-based diffing algorithms and considered implementing the one used in difftastic as part of the diagnostics refactor, I ultimately ended up using the similar crate for the initial implementation. This is obviously not ideal though as we need to "commit" the mutation emitted by each code action in order to build a new syntax tree, traverse this tree from start to end to collect the text of each token into a string buffer, tokenize the string into unicode words, and finally run the diffing algorithm over the resulting tokens.

For linter diagnostics we should be able to implement a form of mutation-based diffing (this is what the as_text_edits method is intended to do, it used to have a specialized diffing implementation that skipped the commit-and-stringify step but that approach doesn't work anymore with the new internal representation of diffs), that could be supplemented with a local tree-based algorithm for finer grained diffing.
For formatting diffs this is a bit more complex since the formatter directly emits a string, so we could either re-parse the output text and diff the resulting tree, or we may be able to make use of the sourcemapping information emitted by the printer to increase the accuracy of the existing algorithm in tracking which tokens were actually removed or inserted.

2 replies

jeysal Nov 14, 2022
Author

Cool, so sounds like you'd advocate the mutation diffing algorithm for the linter, because it produces even higher-quality diffs and because it's probably faster than at least a whole tree diff.
And for the formatter, the sourcemapping information idea is awesome, hadn't thought of that.

These two together will probably be faster and produce better results, even though it may take longer to build them than building syntax tree diffing once and using it for both.

Given that this is probably low priority because of effort per value, let's refer back to this discussion when we create a concrete task to build it :)

jeysal Nov 14, 2022
Author

Some other forms of tree diffing will probably enter Rome eventually, for example if building a testing framework with equal assertions or so, but that's equally distant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text diffing code mutations produces unintended results #3710

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

text diffing code mutations produces unintended results #3710

jeysal Nov 13, 2022

Solutions

Tree-based diffing

Disadvantage

Mutation tracking diffing

Disadvantage

Notes

Replies: 2 comments · 2 replies

MichaReiser Nov 14, 2022

leops Nov 14, 2022

jeysal Nov 14, 2022 Author

jeysal Nov 14, 2022 Author

jeysal
Nov 13, 2022

Replies: 2 comments 2 replies

MichaReiser
Nov 14, 2022

leops
Nov 14, 2022

jeysal Nov 14, 2022
Author

jeysal Nov 14, 2022
Author