Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Spark] Add FileMetadataMaterializationTracker for Row Tracking Backf…
…ill (delta-io#2926) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description ### Background on Row Tracking Backfill In a future PR, we will introduce Row Tracking Backfill, a command in Delta Lake that assigns Row IDs and Row Commit Versions to each rows when we enable Row Tracking on an existing table. This is done by committing `addFile` actions to assign the `baseRowId` and the `defaultRowCommitVersion` for every files in the table. Due to the size of the table, doing it in one commit can be very large, unstable and causes a lot of concurrency conflicts, we propose doing it by batches (that is multiple commits, each commit handles a subset of the `addFile` actions of the table). ### Why we need this for Row Tracking Backfill? However, we could still hit stability issues when a single batch is large enough to OOM the driver. This would happen when individual tasks batched together would be huge. Think of tables where each `AddFile` is just 1-2 rows. ### What is the solution? We propose having a global file materialization limit that restricts the number of files that can be materialized at once on the driver, this limit will be added when Row Tracking Backfill is introduced in a future PR. A case we need to consider is what if the task size is more than the materialization limit. In that case we can allow the task to complete as we do not want to break the task boundary (breaks idempotence of Row Tracking Backfill). The `FileMetadataMaterializationTracker` is the component used in the Row Tracking Backfill process to ensure that a single batch is not large enough to OOM the driver. ### Design The driver holds a semaphore that can give out permits equal to the materialization limit of the driver. The driver also holds a overallocation lock that allows only one query to over allocate to complete materializing the task. Each `RowTrackingBackfillCommand` instance will maintain a `FileMetadataMaterializationTracker` that will keep track of how many files were materialized and this will acquire/release the over provisioning semaphore as well. A permit to materialize a file is acquired while iterating the files iterator while creating a task. The permits are released upon failure or when the batch completes executing. <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> ## How was this patch tested? Added UTs. <!-- If tests were added, say they were added here. Please make sure to test the changes thoroughly including negative and positive cases if possible. If the changes were tested in any way other than unit tests, please clarify how you tested step by step (ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future). If the changes were not tested, please explain why. --> ## Does this PR introduce _any_ user-facing changes? No. <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. -->
- Loading branch information