Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix experimental_retry #6338

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changesets/fix_bnjjj_fix_retry_metric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
### Fix and test experimental_retry ([PR #6338](https://github.com/apollographql/router/pull/6338))

Fix the behavior of `experimental_retry` and make sure both the feature and metrics are working.
An entry in the context was also added, which would be useful later to implement a new standard attribute and selector for advanced telemetry.

By [@bnjjj](https://github.com/bnjjj) in https://github.com/apollographql/router/pull/6338
20 changes: 20 additions & 0 deletions apollo-router/src/plugins/telemetry/config_new/attributes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1131,6 +1131,26 @@ impl Selectors for SubgraphAttributes {
}
}

/// Key used in context to save number of retries for a subgraph http request
pub(crate) struct SubgraphRequestResendCountKey<'a> {
subgraph_req: &'a subgraph::Request,
}

impl<'a> SubgraphRequestResendCountKey<'a> {
pub(crate) fn new(subgraph_req: &'a subgraph::Request) -> Self {
Self { subgraph_req }
}
}

impl<'a> From<SubgraphRequestResendCountKey<'a>> for String {
fn from(value: SubgraphRequestResendCountKey) -> Self {
format!(
"apollo::telemetry::http_request_resend_count_{}",
value.subgraph_req.id
)
}
}

#[cfg(test)]
mod test {
use std::net::SocketAddr;
Expand Down
1 change: 0 additions & 1 deletion apollo-router/src/plugins/traffic_shaping/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -396,7 +396,6 @@ impl TrafficShaping {
config.min_per_sec,
config.retry_percent,
config.retry_mutations,
name.to_string(),
);
tower::retry::RetryLayer::new(retry_policy)
});
Expand Down
202 changes: 185 additions & 17 deletions apollo-router/src/plugins/traffic_shaping/retry.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ use std::time::Duration;
use tower::retry::budget::Budget;
use tower::retry::Policy;

use crate::plugins::telemetry::config_new::attributes::SubgraphRequestResendCountKey;
use crate::query_planner::OperationKind;
use crate::services::subgraph;

#[derive(Clone, Default)]
pub(crate) struct RetryPolicy {
budget: Arc<Budget>,
retry_mutations: bool,
subgraph_name: String,
}

impl RetryPolicy {
Expand All @@ -21,7 +21,6 @@ impl RetryPolicy {
min_per_sec: Option<u32>,
retry_percent: Option<f32>,
retry_mutations: Option<bool>,
subgraph_name: String,
) -> Self {
Self {
budget: Arc::new(Budget::new(
Expand All @@ -30,21 +29,57 @@ impl RetryPolicy {
retry_percent.unwrap_or(0.2),
)),
retry_mutations: retry_mutations.unwrap_or(false),
subgraph_name,
}
}
}

impl<Res, E> Policy<subgraph::Request, Res, E> for RetryPolicy {
impl<E> Policy<subgraph::Request, subgraph::Response, E> for RetryPolicy {
type Future = future::Ready<Self>;

fn retry(&self, req: &subgraph::Request, result: Result<&Res, &E>) -> Option<Self::Future> {
fn retry(
&self,
req: &subgraph::Request,
result: Result<&subgraph::Response, &E>,
) -> Option<Self::Future> {
let subgraph_name = req.subgraph_name.clone().unwrap_or_default();
match result {
Ok(_) => {
// Treat all `Response`s as success,
// so deposit budget and don't retry...
self.budget.deposit();
None
Ok(resp) => {
if resp.response.status() >= http::StatusCode::BAD_REQUEST {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? I mean, what if the subgraph does something "weird" like returning a redirect?

i.e.: should this be:

if resp.response.status() != http::StatusCode::OK {

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently saw users using 302 status code as a correct status code and they were making some stuffs in rhai about this. So I prefer to be conservative and use 400 status code as a threshold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me. If we go with this, then it means that we are encoding "weird" behaviour into the router and that might not be what some customers want.

My preference is to do the expected thing, rather than produce something which fits into a "weird" behaviour of a customer. See what other people think, since I may really be in the minority here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a subgraph returns a redirect, I would not expect it to be retried

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me. If we go with this, then it means that we are encoding "weird" behaviour into the router and that might not be what some customers want.

I agree, but I also agree with what @goto-bus-stop why would we retry on 3XX ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest: I find the concept of the retry module a bit weird. Maybe the best thing to do is to define exactly which codes we will retry on and document that?

@goto-bus-stop Says definitely on a 429. I'd probably also want to retry a 503. Apart for those two, I probably wouldn't want to retry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like starting with a small selection of codes.

I'm second-guessing 429 because we don't support Retry-After in this implementation and maybe that's something that clients should be doing, rather than us? If we just retry instantly then there is a good chance that we get a 429 again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@goto-bus-stop goto-bus-stop Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it says that. I read that the user agent should wait Retry-After seconds before following the redirect, not that it should retry the original request:

When sent with any 3xx (Redirection) response, Retry-After indicates the minimum time that the user agent is asked to wait before issuing the redirected request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to do that? Retry to the redirected request?

if req.operation_kind == OperationKind::Mutation && !self.retry_mutations {
return None;
}

let withdrew = self.budget.withdraw();
if withdrew.is_err() {
u64_counter!(
"apollo_router_http_request_retry_total",
"Number of retry for an http request to a subgraph",
1u64,
status = "aborted",
subgraph = subgraph_name
);

return None;
}

let _ = req
.context
.upsert::<_, usize>(SubgraphRequestResendCountKey::new(req), |val| val + 1);

u64_counter!(
"apollo_router_http_request_retry_total",
"Number of retry for an http request to a subgraph",
1u64,
subgraph = subgraph_name
);

Some(future::ready(self.clone()))
} else {
// Treat all `Response`s as success,
// so deposit budget and don't retry...
self.budget.deposit();
None
}
}
Err(_e) => {
if req.operation_kind == OperationKind::Mutation && !self.retry_mutations {
Expand All @@ -53,20 +88,27 @@ impl<Res, E> Policy<subgraph::Request, Res, E> for RetryPolicy {

let withdrew = self.budget.withdraw();
if withdrew.is_err() {
tracing::info!(
monotonic_counter.apollo_router_http_request_retry_total = 1u64,
u64_counter!(
"apollo_router_http_request_retry_total",
"Number of retry for an http request to a subgraph",
bnjjj marked this conversation as resolved.
Show resolved Hide resolved
1u64,
status = "aborted",
subgraph = %self.subgraph_name,
subgraph = subgraph_name
);

return None;
}

tracing::info!(
monotonic_counter.apollo_router_http_request_retry_total = 1u64,
subgraph = %self.subgraph_name,
u64_counter!(
"apollo_router_http_request_retry_total",
"Number of retry for an http request to a subgraph",
1u64,
subgraph = subgraph_name
);

let _ = req
.context
.upsert::<_, usize>(SubgraphRequestResendCountKey::new(req), |val| val + 1);

Some(future::ready(self.clone()))
}
}
Expand All @@ -76,3 +118,129 @@ impl<Res, E> Policy<subgraph::Request, Res, E> for RetryPolicy {
Some(req.clone())
}
}

#[cfg(test)]
mod tests {
use http::StatusCode;
use tower::BoxError;

use super::*;
use crate::error::FetchError;
use crate::graphql;
use crate::http_ext;
use crate::metrics::FutureMetricsExt;

#[tokio::test]
async fn test_retry_with_error() {
async {
let retry = RetryPolicy::new(
Some(Duration::from_secs(10)),
Some(10),
Some(0.2),
Some(false),
);

let subgraph_req = subgraph::Request::fake_builder()
.subgraph_name("my_subgraph_name_error")
.subgraph_request(
http_ext::Request::fake_builder()
.header("test", "my_value_set")
.body(
graphql::Request::fake_builder()
.query(String::from("query { test }"))
.build(),
)
.build()
.unwrap(),
)
.build();

assert!(retry
.retry(
&subgraph_req,
Err(&Box::new(FetchError::SubrequestHttpError {
status_code: None,
service: String::from("my_subgraph_name_error"),
reason: String::from("cannot contact the subgraph"),
}))
)
.is_some());

assert!(retry
.retry(
&subgraph_req,
Err(&Box::new(FetchError::SubrequestHttpError {
status_code: None,
service: String::from("my_subgraph_name_error"),
reason: String::from("cannot contact the subgraph"),
}))
)
.is_some());

assert_counter!(
"apollo_router_http_request_retry_total",
2,
"subgraph" = "my_subgraph_name_error"
);
}
.with_metrics()
.await;
}

#[tokio::test]
async fn test_retry_with_http_status_code() {
async {
let retry = RetryPolicy::new(
Some(Duration::from_secs(10)),
Some(10),
Some(0.2),
Some(false),
);

let subgraph_req = subgraph::Request::fake_builder()
.subgraph_name("my_subgraph_name_error")
.subgraph_request(
http_ext::Request::fake_builder()
.header("test", "my_value_set")
.body(
graphql::Request::fake_builder()
.query(String::from("query { test }"))
.build(),
)
.build()
.unwrap(),
)
.build();

assert!(retry
.retry(
&subgraph_req,
Ok::<&subgraph::Response, &BoxError>(
&subgraph::Response::fake_builder()
.status_code(StatusCode::BAD_REQUEST)
.build()
)
)
.is_some());

assert!(retry
.retry(
&subgraph_req,
Ok::<&subgraph::Response, &BoxError>(
&subgraph::Response::fake_builder()
.status_code(StatusCode::BAD_REQUEST)
.build()
)
)
.is_some());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for a redirect: MOVED_PERMANENTLY


assert_counter!(
"apollo_router_http_request_retry_total",
2,
"subgraph" = "my_subgraph_name_error"
);
}
.with_metrics()
.await;
}
}