adding more granular diff format for autoedits model training #6173

hitesh-1997 · 2024-11-21T17:31:12Z

Context

The PR makes the following high-level changes:

Current auto-edit model have trouble understanding the most recent diffs, where it suggest deleting the recently added line or suggest the change recently deleted. One reason is that it doesn't have a seperate view of short term and long term diff.
Introduces a more granular diff format for training the auto-edits model. Currently we only use a single diff format. The PR computes the line level for the changes made in the editor. In addition, ensures that all the continuous changes are groped together as a single entity. Additionally, it derives some strategies to calculate the diff at different granularity levels. Refer to the class for the entry point.
Introduce a helper function to diff format, to simulate the document changes using markers. Refer to helper function here
Refactors recent edits handling to separate long-term and short-term diffs.
Initially the data is logged to the telemetry, to be used for training and evaluating the model offline.
One final change is to log 10 sec diff data by the user in the analytics to capture the short term diffs.

Test plan

Added Unit tests for various changes

valerybugakov

As we discussed last week, let's aim to keep PRs focused on a single key change. This will make them easier to review and help reduce the risk of regressions.

This PR could be split into a series of changes that build on each other. For example, we could start by updating the implementation of the UnifiedDiffStrategy with the relevant interfaces (in DefaultContextStrategyFactory and ContextRetrieverDataCollection). After that, we can introduce one new diff strategy per PR: first TwoStageUnifiedDiffStrategy, then LineLevelDiffStrategy. The changes to prompt utils and RecentViewPortRetriever seem relatively independent and could also be extracted into smaller, separate PRs out of the stack.

I know it's not fun to spend time on this "extra" work, but we need to make it a habit from the get-go. This PR could be a great place to start. Holding each other accountable will make it easier to iterate faster in the long run.

valerybugakov · 2024-11-25T05:11:25Z

vscode/src/autoedits/prompt-utils.ts

@@ -323,24 +327,61 @@ ${RECENT_COPY_TAG_CLOSE}
 `
 }

-export function getRecentEditsPrompt(contextItems: AutocompleteContextSnippet[]): PromptString {
+export function getRecentEditsPromptComponents(


Let's add tests to prompt-utils.test.ts to cover recent edits prompt structure changes.

hitesh-1997 · 2024-11-25T11:53:41Z

As we discussed last week, let's aim to keep PRs focused on a single key change. This will make them easier to review and help reduce the risk of regressions.

This PR could be split into a series of changes that build on each other. For example, we could start by updating the implementation of the UnifiedDiffStrategy with the relevant interfaces (in DefaultContextStrategyFactory and ContextRetrieverDataCollection). After that, we can introduce one new diff strategy per PR: first TwoStageUnifiedDiffStrategy, then LineLevelDiffStrategy. The changes to prompt utils and RecentViewPortRetriever seem relatively independent and could also be extracted into smaller, separate PRs out of the stack.

I know it's not fun to spend time on this "extra" work, but we need to make it a habit from the get-go. This PR could be a great place to start. Holding each other accountable will make it easier to iterate faster in the long run.

Refactored the PR into several other PRs. Closing this one.

hitesh-1997 marked this pull request as ready for review November 24, 2024 00:12

hitesh-1997 requested review from valerybugakov and beyang November 24, 2024 00:12

hitesh-1997 added 15 commits November 24, 2024 18:52

adding more granular diff format for autoedits model training

fc4ef49

temp changes

7b9ccfa

some change

0e8c2c3

checkpoint

9979e4c

checkpoint

19a1295

basic test case structure

082fa9e

improve marker strategy

e50b49b

diff

3a1f1d1

diff

670e3c8

logic fix and test cases fix

b7f39a7

cleanup

9da6dc4

add long term and short term diff in autoedit prompt

b77a743

diff strategies

eeb7eee

augment test case

a51b77c

add line level strategies

be81dbc

hitesh-1997 force-pushed the hitesh/diff-format branch from ac66aad to be81dbc Compare November 24, 2024 13:23

hitesh-1997 added 3 commits November 24, 2024 23:46

use common timestamp for different context candidates

8ee99a2

use time based diff logic

20f3505

metadata for logging

1a91032

valerybugakov requested changes Nov 25, 2024

View reviewed changes

hitesh-1997 closed this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding more granular diff format for autoedits model training #6173

adding more granular diff format for autoedits model training #6173

hitesh-1997 commented Nov 21, 2024 •

edited

Loading

valerybugakov left a comment •

edited

Loading

valerybugakov Nov 25, 2024

hitesh-1997 commented Nov 25, 2024

adding more granular diff format for autoedits model training #6173

adding more granular diff format for autoedits model training #6173

Conversation

hitesh-1997 commented Nov 21, 2024 • edited Loading

Context

Test plan

valerybugakov left a comment • edited Loading

Choose a reason for hiding this comment

valerybugakov Nov 25, 2024

Choose a reason for hiding this comment

hitesh-1997 commented Nov 25, 2024

hitesh-1997 commented Nov 21, 2024 •

edited

Loading

valerybugakov left a comment •

edited

Loading