Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadBalancerCacheManager supports refresh cache, not just expire cache #1215

Open
jizhuozhi opened this issue Mar 14, 2023 · 16 comments
Open
Assignees

Comments

@jizhuozhi
Copy link
Contributor

jizhuozhi commented Mar 14, 2023

Is your feature request related to a problem? Please describe.
Spring Cloud LoadBalancer now only supports expired caching, that is, after a period of time, the cache item will become invalid, and when the same content is requested again, it will re-submit a request to the registry blockly to obtain the service instance. In smooth service there will be spikes.

In addition to causing spikes, this also poses a challenge to high availability. When the cache expires, all cache items are evicted. At this time, there are no related service instances in memory. If the registry goes down, the entire None of the client instances can submit a request to the service provider, and the service is completely unavailable. In a refresh cache, this will have no effect. If the registry goes down, we just don't get the latest data, and slightly older data is better than no data (in fact, the service instance changes very frequently low, the impact of old data is negligible).

Describe the solution you'd like
Based on this, I think it is necessary to provide refresh cache instead of just expire cache

A feasible solution is to support configuring the refresh interval and use DiscoveryClient to grab service instances, and at the same time support configuring the expiration time to avoid using expired data for a long time (exposing the problem)

setCaffeine(Caffeine.newBuilder()
    .initialCapacity(properties.getCapacity())
    .refreshAfterWrite(properties.getTtl().dividedBy(2))
    .expireAfterWrite(properties.getTtl())
    .softValues());
setCacheLoader((key) -> discoveryClient.getInstances((String) key));
@xinlunanxinlunan
Copy link

Hello, our calling relationship A->B, B has two instances (B1,B2). Set the ttl to 30s, and continue to send requests to A. If B2 is offline, A's cache is not updated. As A result, an error occurs when A calls B2. In this case, traffic must be disabled for the cache of A to be updated. I think this is a very big bug that needs to be updated or not.

@jizhuozhi
Copy link
Contributor Author

jizhuozhi commented Jul 12, 2023

Hello, @xinlunanxinlunan.

First of all, this is not a bug, what you need is a health check before invoking remote service instance.

Secondly, refreshing means that if there is a new list, we will replace it, but it is irresponsible to clear it if the new list cannot be obtained.

In your scenario, if 50% of the instances are unavailable, then use the history list when the new service list cannot be obtained, and the availability rate will only drop to 50%, but if all the instances are directly expired, then the service cannot be obtained, the availability rate is 0%.

Finally, this Issue is not for discussing when to update, you'd better create a new issue, I will not continue to discuss this topic in this.

@sobelek
Copy link

sobelek commented Sep 27, 2023

@jizhuozhi @OlgaMaciaszek
Hey. I have just stumbled upon this issue. K8s API got overloaded and started returning 429. In the same time cache expired and got cleared so for some time we were left with no instances for loadbalancer (Even tho there were healthy instances).

Is there any known walk-around for this?

I was thinking about using HealthCheck loadbalancer with:

  • spring.cloud.loadbalancer.health-check.refetch-instances-interval: true
  • spring.cloud.loadbalancer.health-check.refetch-instances-interval: 30s
  • spring.cloud.loadbalancer.health-check.repeat-health-check: false
  • spring.cloud.loadbalancer.health-check.update-results-list: false

This seems like that should behave just like 30s cache but it would not clear cache after refetch-interval. WDYT?

@jizhuozhi
Copy link
Contributor Author

Hello, @sobelek.

What your need in this scene is not scheduling HealthCheck api, but as same as the core of this Proposal

If the registry goes down, the entire None of the client instances can submit a request to the service provider, and the service is completely unavailable. In a refresh cache, this will have no effect.

And scheduling HealthCheck api could be implemented via composing Caching and HealthCheck suppliers.

@sobelek
Copy link

sobelek commented Oct 4, 2023

Hey.
I know I would prefer refreshWrite cache to be available but as it is not I am looking into other possible solution to achieve the same result

@jizhuozhi
Copy link
Contributor Author

jizhuozhi commented Oct 4, 2023

Hello, @sobelek . Thanks for your reply.

We may need to align a boundary issue. Generally speaking, health checks check the health status of an instance when given one from instance list, but it has no responsibility to change the instance list. If the logic of refetching instances is incorporated into health checks, the boundary between service discovery and health checks will be broken, and the responsibilities of health checks will no longer be single.

Of course, I tried to abstract your solution. In fact, what we need is still an event source that triggers an event regularly to notify the loadbalancer that the instance list needs to be refreshed, so our solution is the same.

@sobelek
Copy link

sobelek commented Oct 4, 2023

Fully agree with a boundary between discovery and loadbalancing. Thanks for clarifying this.

Event driven discovery or refreshing cache would both be great and would resolve to original issue👍

@jizhuozhi
Copy link
Contributor Author

jizhuozhi commented Oct 4, 2023

Hello, @sobelek.

As an aside, and in my experience, usually only application layer protocols that support multiplexing (such as Apache Dubbo) will perform health checks on the client side, because only a simple ping-pong request is required. But for HTTP/1 (Spring MVC or Webflux), because each request is a new TCP connection, when the number of instances is very large, the health check may issue a large number of on-the-fly requests and leave lots of CLOSE_WAIT connections after the request ends. As a result, normal requests cannot be processed. The same is true for the server side.

@sobelek
Copy link

sobelek commented Oct 9, 2023

@jizhuozhi

Yea, true that. We are actually using k8s so healthchecking is done by k8s out of the box.

Back to original issue. Do you have any idea if you guys are going to work on refreshAfter cache for spring loadbalancer?

@jizhuozhi
Copy link
Contributor Author

Hello, @sobelek . There is no out box solution for all CacheManager, but Caffeine provided refreshAfterWrite out of box. So you could custom a new CacheManager based on CaffeineBasedLoadBalancerCacheManager with your service discovery client.

@sobelek
Copy link

sobelek commented Oct 10, 2023

Thanks, I am trying to do this. Would you like me to also make a PR here when finished?

@jizhuozhi
Copy link
Contributor Author

Thanks, I am trying to do this. Would you like me to also make a PR here when finished?

Of course, welcome. This Issue has been assigned to @OlgaMaciaszek. Do you have any good suggestions @OlgaMaciaszek ?

@OlgaMaciaszek
Copy link
Collaborator

Thanks, @jizhuozhi, makes sense; however, I'd like to see a separate property for the refresh interval (can be the value you've proposed as default, but needs to be customisable).
@sobelek as indicated in the docs, the HealthCheckServiceInstanceListSupplier should not be used together with the CachingServiceInstanceListSupplier as it has it's own caching mechanism. Please feel free to provide a PR for this feature.

@OlgaMaciaszek
Copy link
Collaborator

@sobelek Please confirm if you will be submitting this PR.

@sobelek
Copy link

sobelek commented Nov 3, 2023

@OlgaMaciaszek
I would love to. Don't have much time currently but will try to do it in coming weeks.
I am not sure yet about how to set a loader for cache but will do PoC so we can discuss it

@OlgaMaciaszek
Copy link
Collaborator

Ok, please do and ping me here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants