Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] - Support S3 Source Resumption #3570

Merged
merged 38 commits into from
Nov 22, 2024
Merged

[feat] - Support S3 Source Resumption #3570

merged 38 commits into from
Nov 22, 2024

Conversation

ahrav
Copy link
Collaborator

@ahrav ahrav commented Nov 7, 2024

Description:

This PR adds resumption to the S3 source.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@ahrav ahrav force-pushed the s3-progress-tracker branch 2 times, most recently from 5ed6639 to b29a571 Compare November 7, 2024 22:01
@ahrav ahrav requested review from a team November 7, 2024 22:10
@ahrav ahrav marked this pull request as ready for review November 7, 2024 22:10
@ahrav ahrav requested a review from a team as a code owner November 7, 2024 22:10
@rgmz
Copy link
Contributor

rgmz commented Nov 8, 2024

This PR adds resumption to the S3 source.

This would be a beneficial capability for other sources. e.g., resuming a large GitHub org scan.

@ahrav ahrav requested review from rosecodym and a team November 18, 2024 22:24
@ahrav ahrav requested a review from a team as a code owner November 19, 2024 18:11
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty straightforward, although I would like to see a test case for (and handling of) the case where an in-progress bucket is ignored while the scan is stopped (described in an inline comment)

Comment on lines 245 to 246
ctx.Logger().Error(err, "failed to get resume point")
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems drastic - what do you think of instead restarting the scan from the beginning when we can't get a resume point? I feel like we should err on the side of scanning too much rather than scanning too little.

i,
len(bucketsToScan),
fmt.Sprintf("Bucket: %s", bucket),
s.Progress.EncodedResumeInfo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we re-using the existing resume info? I expected to see something from the progress tracker here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was essentially saying, "Don’t modify EncodeResumeInfo—it’s already been updated by progressTracker, so just use it as-is." Since progressTracker and s.Progress reference the same underlying object, would it be clearer if we explicitly used s.progressTracker.Progress instead?

pkg/sources/s3/s3.go Outdated Show resolved Hide resolved
) {
for _, obj := range page.Contents {
s.progressTracker.Reset()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't reset entirely, right? It just resets for a new page? I wish I'd been well enough to leave that comment on #3568 :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this was my mistake. I'm going to fix the tracker so it's named more accurately. here
It's no longer a progress tracker, it's pretty much just a checkpointer.

pkg/sources/s3/s3_integration_test.go Outdated Show resolved Hide resolved
@ahrav ahrav changed the base branch from main to refactor-s3-progress-tracker November 21, 2024 23:30
@ahrav
Copy link
Collaborator Author

ahrav commented Nov 21, 2024

This seems pretty straightforward, although I would like to see a test case for (and handling of) the case where an in-progress bucket is ignored while the scan is stopped (described in an inline comment)

done

@ahrav ahrav requested a review from rosecodym November 21, 2024 23:54
Base automatically changed from refactor-s3-progress-tracker to main November 22, 2024 17:27
@ahrav ahrav merged commit e495661 into main Nov 22, 2024
13 checks passed
@ahrav ahrav deleted the s3-progress-tracker branch November 22, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants