-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149
chore(anomaly-detection) Return meaningful errors on failure of save endpoint #1149
Conversation
55c2bd2
to
dc6ab89
Compare
anomaly_algo_data={"window_size": anomalies.window_size}, | ||
) | ||
return StoreDataResponse(success=True) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is, reasonable, but my biggest concern here is having the Exception
here being too broad.
My question: what does success=False
signal to a client? I
f it's supposed to specifically mean that the save failed due to say, integrity error, it'd be wiser to catch the database exception (https://docs.sqlalchemy.org/en/20/core/exceptions.html#sqlalchemy.exc.SQLAlchemyError).
Here are some counter points for why letting errors pass through sometimes is important:
-
Unit tests may silently pass that are failing if we catch exceptions too broadly. The most common way that a unit tests fails is that an exception is not handled during its execution. Are we certain that every unit test call site will check success=True in every case? Letting some exceptions pass through is healthy for simpler and robust testing.
-
Related, but what if add an additional line of logic before the return on line 280 that always fails? In that case,
success=False
may look like a temporary database error, when in fact it would be a broad logical failure. -
Retry semantics. It is important that network access to services understand atleast somewhat when a failure is safe to retry. There are multiple ways you can encode the retry semantic -- for instance, maybe success=False means, do not retry at all, and maybe success=False means feel free to retry. But returning 500 is useful because it definitely implies to clients to at the least back off. If a totally unexpected issue is occuring in the system and we throw 500s because the endpoint isn't gracefully handling an issue, it is wise for clients to always back off from retrying the service.
Basically, I think that catching all exceptions and treating them the same way hides the fact that 'errors' come in different classes and warrant different responses from the client. I think it would be easier to narrow the exception class and be explicit about what success=False
means for the caller (which part failed? is it safe to retry gain? etc, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good points. If we want to truly follow REST semantics then our API should use 4xx and 5xx response codes meaningfully. Right now all errors result in 5xx. Instead we should split our errors into two classes - user error and system error. User error means use can retry after fixing the error while 5xx means do not retry.
If I create two exception classes ClientException
and ServerException
, is it easy to return 400 for the first one and 500 for the second one, along with the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally fair. And in my mind it doesn't matter to me about REST per say. I think my main concern is simply that we differentiate the error classes between "somewhat expected db failure" and "unexpected logical failure" and however we want to do that is indifferent to me.
I think it's ok to simply encode it this way:
- Return success=False when it's a "client exception" or "data exception", meaning the client cannot proceed with this request (ala 400)
- Simply let other exceptions pass through. They will be reported to sentry thanks to our flask integration (no need to add that explicitly) and turned into a 500 (unexpected failure), which clients should back off from and maybe retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the only change would be merely to narrow the exception clause a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But just to be totally 100% honest, it is 100% ok for you to go with your gut and fuck around and find out. If you feel strongly about any particular path, I am ok with proceeding that way, too. This is my best effort perspective, but you know your logic better than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with returning False
for client errors (aka 400). Will update this PR. Thanks.
3017026
to
bbf7461
Compare
No description provided.