Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding more granular diff format for autoedits model training #6173

Closed
wants to merge 18 commits into from

Conversation

hitesh-1997
Copy link
Contributor

@hitesh-1997 hitesh-1997 commented Nov 21, 2024

Context

The PR makes the following high-level changes:

  1. Current auto-edit model have trouble understanding the most recent diffs, where it suggest deleting the recently added line or suggest the change recently deleted. One reason is that it doesn't have a seperate view of short term and long term diff.
  2. Introduces a more granular diff format for training the auto-edits model. Currently we only use a single diff format. The PR computes the line level for the changes made in the editor. In addition, ensures that all the continuous changes are groped together as a single entity. Additionally, it derives some strategies to calculate the diff at different granularity levels. Refer to the class for the entry point.
  3. Introduce a helper function to diff format, to simulate the document changes using markers. Refer to helper function here
  4. Refactors recent edits handling to separate long-term and short-term diffs.
  5. Initially the data is logged to the telemetry, to be used for training and evaluating the model offline.
  6. One final change is to log 10 sec diff data by the user in the analytics to capture the short term diffs.

Test plan

Added Unit tests for various changes

@hitesh-1997 hitesh-1997 marked this pull request as ready for review November 24, 2024 00:12
Copy link
Member

@valerybugakov valerybugakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed last week, let's aim to keep PRs focused on a single key change. This will make them easier to review and help reduce the risk of regressions.

This PR could be split into a series of changes that build on each other. For example, we could start by updating the implementation of the UnifiedDiffStrategy with the relevant interfaces (in DefaultContextStrategyFactory and ContextRetrieverDataCollection). After that, we can introduce one new diff strategy per PR: first TwoStageUnifiedDiffStrategy, then LineLevelDiffStrategy. The changes to prompt utils and RecentViewPortRetriever seem relatively independent and could also be extracted into smaller, separate PRs out of the stack.

I know it's not fun to spend time on this "extra" work, but we need to make it a habit from the get-go. This PR could be a great place to start. Holding each other accountable will make it easier to iterate faster in the long run.

@@ -323,24 +327,61 @@ ${RECENT_COPY_TAG_CLOSE}
`
}

export function getRecentEditsPrompt(contextItems: AutocompleteContextSnippet[]): PromptString {
export function getRecentEditsPromptComponents(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests to prompt-utils.test.ts to cover recent edits prompt structure changes.

@hitesh-1997
Copy link
Contributor Author

As we discussed last week, let's aim to keep PRs focused on a single key change. This will make them easier to review and help reduce the risk of regressions.

This PR could be split into a series of changes that build on each other. For example, we could start by updating the implementation of the UnifiedDiffStrategy with the relevant interfaces (in DefaultContextStrategyFactory and ContextRetrieverDataCollection). After that, we can introduce one new diff strategy per PR: first TwoStageUnifiedDiffStrategy, then LineLevelDiffStrategy. The changes to prompt utils and RecentViewPortRetriever seem relatively independent and could also be extracted into smaller, separate PRs out of the stack.

I know it's not fun to spend time on this "extra" work, but we need to make it a habit from the get-go. This PR could be a great place to start. Holding each other accountable will make it easier to iterate faster in the long run.

Refactored the PR into several other PRs. Closing this one.

  1. adding line level diff strategy for the recent edits diff calculation #6188
  2. Hitesh/add diff strategies logging #6189
  3. Hitesh/add diff stratagies #6190
  4. add 10 sec diff for autoedit experiments #6191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants