Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): lazy commit, draft #2601

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alexwilcoxson-rel
Copy link
Contributor

Description

This is just a draft to see what you all think about this way to achieve commits without fully loading table state. This is useful for append only, low/no concurrency workflows, since we would be unlikely to have a version conflict and need to check for Conflicts. Perhaps further the Conflict checker could lazily load the state if the type of conflict resolution would require it.

Putting this out as a draft as I've talked with @roeap about it and it could help a case that came up on Slack today.

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jun 14, 2024
@alexwilcoxson-rel
Copy link
Contributor Author

@roeap @rtyler @Blajda any thoughts on this approach?

@roeap
Copy link
Collaborator

roeap commented Jul 29, 2024

@alexwilcoxson-rel, sorry for leaving you hanging so long.

Please help refresh my mind :). the use case would be that we want to write into a table where we MIGHT require additional actions? I.e. for append only we could use Snapshot, which does nothing but the P&M query. Since we currently prepare the snapshot before the commit.

The LazySnapshot would load actions as we need them?

I guess my main question - an bear in mind that i need a refresher on our discussion :) - can we serve your use case with Snapshot?

@alexwilcoxson-rel
Copy link
Contributor Author

Thanks @roeap !

In our use case we have the following:

  • A service committing append only to a table. This is the most often vector used for commits and thus this workflow benefits the most.
  • A spark job running on a cron to run optimize and vacuum.
  • At any time new M+P actions may be committed to perform a schema upgrade

Our implementation though is getting more requirements and complexity though.

I worry with Snapshot only we'll miss some edge case where we need to check for conflicts when version already exists.

@alexwilcoxson-rel
Copy link
Contributor Author

after much more digging, the primary issue we're seeing is the amount of memory used during checkpointing not commit

i tested with some of our larger (many small files pre optimize) and those tables use GBs of memory which is actually reported here: #2628

that said we've already done some work on our end to do some pre optimization of files prior to commit to reduce noise in the commit log and that is helping

so therefore i don't think we necessarily have a need to act on this for our use case right now. so i'm happy to close this

also landed this yesterday to help with larger checkpoints: #2717

@roeap
Copy link
Collaborator

roeap commented Aug 7, 2024

yeah, our checkpoint writing needs some updating as well. there is a lot of code we would no longer need since we are now dependent on arrow anyhow.

with the new arrow backend we "might" be able to do a more efficient replay / checkpoint writing, however we will always need a full replay :(.

so therefore i don't think we necessarily have a need to act on this for our use case right now. so i'm happy to close this

if there is no use case right now I would hold off, and just hope we can get our "reagular" handling as efficient as possible... there are some opportunities after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants