Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

ram-senth
Copy link
Member

No description provided.

@ram-senth ram-senth force-pushed the anomaly-detection/ram/return-meaningful-errors branch from 55c2bd2 to dc6ab89 Compare September 11, 2024 16:36
anomaly_algo_data={"window_size": anomalies.window_size},
)
return StoreDataResponse(success=True)
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is, reasonable, but my biggest concern here is having the Exception here being too broad.
My question: what does success=False signal to a client? I

f it's supposed to specifically mean that the save failed due to say, integrity error, it'd be wiser to catch the database exception (https://docs.sqlalchemy.org/en/20/core/exceptions.html#sqlalchemy.exc.SQLAlchemyError).

Here are some counter points for why letting errors pass through sometimes is important:

  1. Unit tests may silently pass that are failing if we catch exceptions too broadly. The most common way that a unit tests fails is that an exception is not handled during its execution. Are we certain that every unit test call site will check success=True in every case? Letting some exceptions pass through is healthy for simpler and robust testing.

  2. Related, but what if add an additional line of logic before the return on line 280 that always fails? In that case, success=False may look like a temporary database error, when in fact it would be a broad logical failure.

  3. Retry semantics. It is important that network access to services understand atleast somewhat when a failure is safe to retry. There are multiple ways you can encode the retry semantic -- for instance, maybe success=False means, do not retry at all, and maybe success=False means feel free to retry. But returning 500 is useful because it definitely implies to clients to at the least back off. If a totally unexpected issue is occuring in the system and we throw 500s because the endpoint isn't gracefully handling an issue, it is wise for clients to always back off from retrying the service.

Basically, I think that catching all exceptions and treating them the same way hides the fact that 'errors' come in different classes and warrant different responses from the client. I think it would be easier to narrow the exception class and be explicit about what success=False means for the caller (which part failed? is it safe to retry gain? etc, etc)

Copy link
Member Author

@ram-senth ram-senth Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good points. If we want to truly follow REST semantics then our API should use 4xx and 5xx response codes meaningfully. Right now all errors result in 5xx. Instead we should split our errors into two classes - user error and system error. User error means use can retry after fixing the error while 5xx means do not retry.

If I create two exception classes ClientException and ServerException, is it easy to return 400 for the first one and 500 for the second one, along with the error message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally fair. And in my mind it doesn't matter to me about REST per say. I think my main concern is simply that we differentiate the error classes between "somewhat expected db failure" and "unexpected logical failure" and however we want to do that is indifferent to me.

I think it's ok to simply encode it this way:

  1. Return success=False when it's a "client exception" or "data exception", meaning the client cannot proceed with this request (ala 400)
  2. Simply let other exceptions pass through. They will be reported to sentry thanks to our flask integration (no need to add that explicitly) and turned into a 500 (unexpected failure), which clients should back off from and maybe retry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the only change would be merely to narrow the exception clause a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just to be totally 100% honest, it is 100% ok for you to go with your gut and fuck around and find out. If you feel strongly about any particular path, I am ok with proceeding that way, too. This is my best effort perspective, but you know your logic better than me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with returning False for client errors (aka 400). Will update this PR. Thanks.

@ram-senth ram-senth force-pushed the anomaly-detection/ram/return-meaningful-errors branch from 3017026 to bbf7461 Compare September 11, 2024 19:14
@ram-senth ram-senth merged commit dfb0159 into main Sep 12, 2024
10 of 11 checks passed
@ram-senth ram-senth deleted the anomaly-detection/ram/return-meaningful-errors branch September 12, 2024 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants