chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149

ram-senth · 2024-09-11T16:33:34Z

No description provided.

…endpoint

corps · 2024-09-11T17:09:59Z

src/seer/anomaly_detection/anomaly_detection.py

+ anomaly_algo_data={"window_size": anomalies.window_size},
+ )
+ return StoreDataResponse(success=True)
+ except Exception as e:


This pattern is, reasonable, but my biggest concern here is having the Exception here being too broad.
My question: what does success=False signal to a client? I

f it's supposed to specifically mean that the save failed due to say, integrity error, it'd be wiser to catch the database exception (https://docs.sqlalchemy.org/en/20/core/exceptions.html#sqlalchemy.exc.SQLAlchemyError).

Here are some counter points for why letting errors pass through sometimes is important:

Unit tests may silently pass that are failing if we catch exceptions too broadly. The most common way that a unit tests fails is that an exception is not handled during its execution. Are we certain that every unit test call site will check success=True in every case? Letting some exceptions pass through is healthy for simpler and robust testing.

Related, but what if add an additional line of logic before the return on line 280 that always fails? In that case, success=False may look like a temporary database error, when in fact it would be a broad logical failure.

Retry semantics. It is important that network access to services understand atleast somewhat when a failure is safe to retry. There are multiple ways you can encode the retry semantic -- for instance, maybe success=False means, do not retry at all, and maybe success=False means feel free to retry. But returning 500 is useful because it definitely implies to clients to at the least back off. If a totally unexpected issue is occuring in the system and we throw 500s because the endpoint isn't gracefully handling an issue, it is wise for clients to always back off from retrying the service.

Basically, I think that catching all exceptions and treating them the same way hides the fact that 'errors' come in different classes and warrant different responses from the client. I think it would be easier to narrow the exception class and be explicit about what success=False means for the caller (which part failed? is it safe to retry gain? etc, etc)

All good points. If we want to truly follow REST semantics then our API should use 4xx and 5xx response codes meaningfully. Right now all errors result in 5xx. Instead we should split our errors into two classes - user error and system error. User error means use can retry after fixing the error while 5xx means do not retry.

If I create two exception classes ClientException and ServerException, is it easy to return 400 for the first one and 500 for the second one, along with the error message?

That's totally fair. And in my mind it doesn't matter to me about REST per say. I think my main concern is simply that we differentiate the error classes between "somewhat expected db failure" and "unexpected logical failure" and however we want to do that is indifferent to me.

I think it's ok to simply encode it this way:

Return success=False when it's a "client exception" or "data exception", meaning the client cannot proceed with this request (ala 400)

Simply let other exceptions pass through. They will be reported to sentry thanks to our flask integration (no need to add that explicitly) and turned into a 500 (unexpected failure), which clients should back off from and maybe retry.

In this case the only change would be merely to narrow the exception clause a bit.

But just to be totally 100% honest, it is 100% ok for you to go with your gut and fuck around and find out. If you feel strongly about any particular path, I am ok with proceeding that way, too. This is my best effort perspective, but you know your logic better than me.

I'm fine with returning False for client errors (aka 400). Will update this PR. Thanks.

…rors

chore(anomaly-detection) Return meaningful errors on failure of save …

dc6ab89

…endpoint

ram-senth force-pushed the anomaly-detection/ram/return-meaningful-errors branch from 55c2bd2 to dc6ab89 Compare September 11, 2024 16:36

corps reviewed Sep 11, 2024

View reviewed changes

chore(anomaly-detection): Return meaningful errors for client side er…

bbf7461

…rors

ram-senth force-pushed the anomaly-detection/ram/return-meaningful-errors branch from 3017026 to bbf7461 Compare September 11, 2024 19:14

corps approved these changes Sep 11, 2024

View reviewed changes

ram-senth merged commit dfb0159 into main Sep 12, 2024
10 of 11 checks passed

ram-senth deleted the anomaly-detection/ram/return-meaningful-errors branch September 12, 2024 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149

chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149

ram-senth commented Sep 11, 2024

corps Sep 11, 2024

ram-senth Sep 11, 2024 •

edited

Loading

corps Sep 11, 2024

corps Sep 11, 2024

corps Sep 11, 2024

ram-senth Sep 11, 2024

chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149

chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149

Conversation

ram-senth commented Sep 11, 2024

corps Sep 11, 2024

Choose a reason for hiding this comment

ram-senth Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

corps Sep 11, 2024

Choose a reason for hiding this comment

corps Sep 11, 2024

Choose a reason for hiding this comment

corps Sep 11, 2024

Choose a reason for hiding this comment

ram-senth Sep 11, 2024

Choose a reason for hiding this comment

ram-senth Sep 11, 2024 •

edited

Loading