-
Notifications
You must be signed in to change notification settings - Fork 63
Cache eviction of past executions #504
base: master
Are you sure you want to change the base?
Cache eviction of past executions #504
Conversation
Codecov Report
@@ Coverage Diff @@
## master #504 +/- ##
==========================================
- Coverage 60.07% 59.90% -0.18%
==========================================
Files 168 169 +1
Lines 14997 15388 +391
==========================================
+ Hits 9010 9218 +208
- Misses 5196 5383 +187
+ Partials 791 787 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Allows for fields to be explicity set/updated to nil Signed-off-by: Nick Müller <nmueller@blackshark.ai>
Allows for re-use by cache manager Signed-off-by: Nick Müller <nmueller@blackshark.ai>
Added endpoint for evicting execution cache Added endpoint for evicting task execution cache Signed-off-by: Nick Müller <nmueller@blackshark.ai>
Signed-off-by: Nick Müller <nmueller@blackshark.ai>
…epropeller and flytestdlib Signed-off-by: Nick Müller <nmueller@blackshark.ai>
67e4383
to
650613d
Compare
Rebased onto current |
@MorpheusXAUT thanks for making the PR for this RFC. This PR makes multiple calls to datacatalog services to clear cache. Would be preferable if we could batch these calls by introducing a batch API in datacatalog to do the same. |
Can do if we want to reduce individual API requests. Would you rather keep the current (non-bulk) new endpoint around as well or have me replace it completely with a bulk one?
I was a bit afraid of timeouts as well, especially given large workflows, however as we've originally agreed on propagating results back to the user, that was the first I've come up with. |
Eventually using the bulk one would be what we should encourage the API users to use. If its maintainable we can keep the non-bulk version aswell but have a note in the API definition to use bulk version in most of the cases. Also do you think any reason someone would prefer to use bulk versus non-bulk
With multiple service calls being involved in single API, i am afraid that some client calls could timeout where some others would still be ongoing unless we cleanly handle context cancellation if the parent api call times out. i am assuming this is handled by context passing automatically but just checking what happens in such cases. eg : the api caller wait for 15 secs and timesout whereas the api implementations is still waiting on results from the downstream service may be with higher api timeout(datacatalog in our case for cache cleanup) Also I think we dont have an infra to setup async flyteadmin api tasks and hence what you are suggesting would be using such a solution . Adding @katrogan to see if we ever designed for such usecases |
Since both endpoints would be new, I think we could even remove the (not yet released) non-bulk version if we want to encourage users to use the bulk one anyways.
That's a good point. I believe everything we use should support cancellation properly, however I haven't tested that myself either. The request context is being passed to all relevant parts though, so if they support it, it should happen.
I didn't come across anything similar to what I described so far, but I'm also happy for suggestions on how to best implement that - the task ID/polling idea is just the first thing that came to mind. |
Sounds good @MorpheusXAUT . Its ok to keep both and have it internally use the new bulk api of datacatalog. We can check the behavior for timeouts through probably any golang client(flytectl by calling this new api) and setting context.WithTimeout and checking the behavior of the child rpc call if its getting cancelled too . It will probably leave the clearing of the cache half way but if the bulk api is idempotent then it would just continue from where it left off earlier in the next retry of the api Regarding the design on async flyteadmin api tasks , we can do that as a followup cc : @katrogan |
@pmahindrakar-oss Alright, sounds good to me, will add a bulk endpoint to datacatalog as a start, should hopefully be able to do so by the start of next week. Regarding the async implementation - happy to discuss further on how to best implement this. Is this something we want to absolutely have for the first version, or something we could improve in a future update to make the process more resilient? |
+1 for deferring the async implementation, I think this would be great to have down the road, especially if we can make it generalizable for other async actions like egress events (which are currently lossy) - having this functionality would be broadly useful for flyteadmin. Since the batch call should ideally be retryable in case of failures this seems like a good compromise on performance for now |
@pmahindrakar-oss @katrogan Wanted to check back with you real quick before I rewrite this to use the bulk implementation of datacatalog endpoints. flyteadmin would then continue walking the workflow and build a list of artifacts to:
Do we still want to delete the database models before (while traversing the workflow), or should this be handled in the bulk-operation at the end as well? |
TL;DR
This PR implements the second part of the cache eviction RFC, evicting stored data of a past (task) execution.
A new
CacheService
has been added for interacting with Flyte's cache via flyteadmin.Type
Are all requirements met?
Complete description
The new
CacheService
exposes two endpoints,EvictExecutionCache
andEvictTaskExecutionCache
. These two endpoints allow for the respective (already completed) execution's cached output data to be removed from datacatalog.flyteadmin retrieves the stored
NodeExecutions
and their associatedTaskExecutions
from its database, traversing the executions and evicting the cached data as indicated by the serializedTaskNodeMetadata
.flyteadmin acquires a reservation for the cached data from datacatalog before updating the
NodeExecution
's metadata/closure and removing the artifact data from datacatalog.Both cache eviction endpoints share mostly the same implementation and are developed to be idempotent - evicting the cache of an execution without any stored data still completes successfully. Partial failures are supported, if some task's cache can't be evicted, the rest of the deletion will still continue and an appropriate error is returned to the client. The client is expected to display any encountered errors to the user accordingly and prompt for a retry (already evicted data will be skipped when retrying).
I've tested cache eviction with some basic cached workflows available in the flytesnacks repository, a demo-workflow created in a flytepropeller PR as well as a slightly more complex company-internal workflow. Additionally, I've tried to cover as many aspects as possible with unit tests.
The PR was created as a draft until the flyteidl, flyteplugins, flytepropeller and flytestdlib PRs have been merged and respective versions have been published.
Tracking Issue
flyteorg/flyte#2867
Follow-up issue
NA