Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to the action cache in the execution server #7974

Closed
wants to merge 9 commits into from
Closed

Conversation

vanja-p
Copy link
Contributor

@vanja-p vanja-p commented Nov 27, 2024

This is done when the executor sends a completed action to PublishOperation, and only when the executor sends the ExecutionTask in auxiliary metadata. This allows us to avoid a period of time where both the executor and the app write to the cache.

I didn't include the executor changes in this PR. My plan was to let this roll out, and then send an executor PR that would start ExecutionAuxiliaryMetadata.execution_task and stop writing to the cache. This avoids double writes, but it does still leave a window where we have no writes, if the executor is updated before the app. Let me know what you think.

Comment on lines 1140 to 1144
resultDigest, err := digest.AddInvocationIDToDigest(actionResourceName.GetDigest(), actionResourceName.GetDigestFunction(), md.GetExecutionTask().GetInvocationId())
if err != nil {
return status.UnavailableErrorf("Error uploading action result: %s", err.Error())
}
cacheableResourceName = digest.NewResourceName(resultDigest, actionResourceName.GetInstanceName(), rspb.CacheType_AC, actionResourceName.GetDigestFunction())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these failed ActionResults are no longer used, so we can short-circuit in this case. The UI fetches the response that was stored with cacheExecuteResponse now, which includes the complete response and works for both failed and succeeded actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent #7985 to first remove these writes.

@@ -135,6 +135,9 @@ message Execution {
message ExecutionAuxiliaryMetadata {
// Platform overrides set via remote header.
build.bazel.remote.execution.v2.Platform platform_overrides = 1;

// The ExecutionTask that the executor received from the scheduler.
build.bazel.remote.execution.v2.ExecutionTask execution_task = 2;
Copy link
Member

@bduffany bduffany Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a note - once we start populating this, we probably want to keep an eye on the prometheus metrics for AC write size and make sure it looks OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be the metric on the right of this screenshot?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I decided it's probably best not to write the auxiliary metadata in cacheExecuteResponse. It's not used and it can be large.

Copy link
Member

@bduffany bduffany Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI uses the auxiliary metadata (if you were looking for references using a case-sensitive search for AuxiliaryMetadata you'll miss the UI references because they are lowerCamelCase)

Copy link
Member

@bduffany bduffany Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re. metrics, Upload throughput accounts for CAS blobs as well - you could either run a one-off query (Explore) to restrict to the AC label or update that dashboard to split out those charts by AC vs CAS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point about the UI. I decided to play it safe and not remove any existing data. Instead, the ExecutionTask will come as a separate auxiliary metadata, which can easily be removed without messing with the other metadata. PTAL.

Now that I don't need the invocation ID, another option is to not use the presence of the ExecutionTask to trigger cache writes from the app. All the necessary data is already available in PublishOperation. One downside of this is we might double write to the cache, especially in the case of self-hosted executors that aren't getting updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the double-write for older executor clients (and during the rollout) seems fine - if we notice it being a problem we could ask people to upgrade and/or gate the logic behind some sort of version check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the version without avoiding double-writes: #8009. Still waiting for Tyler's thoughts on which one he prefers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tyler's OK with the double writes, so ignore this PR in favor of #8009

@vanja-p vanja-p requested a review from bduffany December 3, 2024 20:33
@@ -1007,31 +1008,61 @@ func (s *ExecutionServer) PublishOperation(stream repb.Execution_PublishOperatio
return err
}

if op.GetName() == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change feels a little risky, is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could revert this change, but as far as I can tell, there's no way for PublishOperation to do anything useful without the name being an execution ID.

@vanja-p vanja-p marked this pull request as draft December 4, 2024 22:18
@vanja-p vanja-p closed this Dec 6, 2024
@vanja-p vanja-p deleted the vanja-cache branch December 6, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants