Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition during ACME order finalization #4562

Merged
merged 2 commits into from
Sep 14, 2023

Conversation

edewata
Copy link
Contributor

@edewata edewata commented Sep 12, 2023

Previously when the authorization for an order completed the ACMEChallengeProcessor would update the authorization status and the order status as well, then the client would call the ACMEFinalizeOrderService to finalize the order.

However, since these updates are not atomic there is a risk that the client will detect the new authorization status then immediately call the ACMEFinalizeOrderService before the order status can be updated by ACMEChallengeProcessor. Since both both ACMEFinalizeOrderService and ACMEChallengeProcessor are trying to update the same order at the same time, the order can end up with a wrong status.

To avoid the problem the code that updates the order status has been moved into ACMEFinalizeOrderService, so there is only one thread that can update the order at the same time.

Copy link
Member

@fmarco76 fmarco76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ckelleyRH ckelleyRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@frasertweedale
Copy link
Contributor

I am not sure this is the correct approach. Stand by for additional detail.

@frasertweedale
Copy link
Contributor

frasertweedale commented Sep 12, 2023

From https://datatracker.ietf.org/doc/html/rfc8555#section-7.1.6 (see diagram reproduced below) it is clear that an Order must move to status ready when all Authz are valid. It should not remain in status pending.


    pending --------------+
       |                  |
       | All authz        |
       | "valid"          |
       V                  |
     ready ---------------+
       |                  |
       | Receive          |
       | finalize         |
       | request          |
       V                  |
   processing ------------+
       |                  |
       | Certificate      | Error or
       | issued           | Authorization failure
       V                  V
     valid             invalid

                    State Transitions for Order Objects

From https://datatracker.ietf.org/doc/html/rfc8555#section-7.4:

   A request to finalize an order will result in error if the order is
   not in the "ready" state.  In such cases, the server MUST return a
   403 (Forbidden) error with a problem document of type
   "orderNotReady".  

This patch changes the behaviour such that an order would be successfully finalized
when its status is pending. Indeed, the status must be pending. This is non-conforming.
Operationally it seems OK if, and only if, the client proceeds to finalization when all authz are
valid, regardless of the the status of the corresponding Order object.

Unfortunatley, the description of expected client behaviour is vague:

   Once the client believes it has fulfilled the server's requirements,
   it should send a POST request to the order resource's finalize URL.

In particular, what does believes it has fulfilled the server's requirements mean?
One client could interpret this to mean "all authzs valid", and ignore the fact that the order is still pending
(this behaviour would be compatible with the proposed change). Another might ensure that the Order status
is ready before attmpting to finalize it (this would not be compatible). It's hard to predict what a client
implementation would do. Both of these behaviours seem to conform to the RFC. So this change could
cause compatibility issues.

How to proceed? First I have a question: was this PR prompted by an observed issue in real world use or in tests? If so, can you provide the details, including what was the client implementation?

@frasertweedale
Copy link
Contributor

Thinking more about the premise of the change (from PR description):

Since both both ACMEFinalizeOrderService and ACMEChallengeProcessor are trying to update the same order at the same time, the order can end up with a wrong status.

I don't think this can happen (based on the existing code, not this PR). ACMEFinalizeOrderService has to see that the order is ready or it wll fail. So, the updates are serialised and the order cannot end up with the wrong status. However, the lack of proper error response when order status is not ready could cause problems (does the current code 500 in this scenario, due to throw Exception(...))?

@edewata
Copy link
Contributor Author

edewata commented Sep 12, 2023

@fmarco76 @ckelleyRH @frasertweedale Thanks for looking!

These race condition issues were observed when I was testing ACME with minConns=0 (to allow PKI server to start when DS is down). They can be reproduced consistently using certbot. Like you said, it looks like certbot will only wait for the authz to become valid (instead of for the order to become ready) to initiate the order finalization. Let me create a patch to demonstrate the problem. I might also add more compatibility tests with different clients.

@edewata
Copy link
Contributor Author

edewata commented Sep 12, 2023

Here's a patch that demonstrates the problem: edewata@ee43b23

          docker exec pki pki-server acme-database-mod \
              ... \
              -D minConns=0

As mentioned earlier, certbot seems to be only checking the authz status and ignore the order status before finalizing the order:
https://github.com/edewata/pki/actions/runs/6162315047/job/16723890467#step:28:128

Sending POST request to http://pki.example.com:8080/acme/rest/authz/7sptvS24Cm:

https://github.com/edewata/pki/actions/runs/6162315047/job/16723890467#step:28:145

{"status":"valid","expires":"2023-09-12T17:21:33Z","identifier":{"type":"dns","value":"client.example.com"},"challenges":[{"type":"http-01","url":"http://pki.example.com:8080/acme/rest/chall/e7W59c7DUO","token":"h9BRp6LHMXsgQc30f5gvBg","status":"valid"}]}

https://github.com/edewata/pki/actions/runs/6162315047/job/16723890467#step:28:154

Sending POST request to http://pki.example.com:8080/acme/rest/order/ROl0zDqknZ/finalize:

Then ACMEFinalizeOrderService failed because the order is not ready yet even though the authz is valid:
https://github.com/edewata/pki/actions/runs/6162315047/job/16723890467#step:49:345

2023-09-12 16:51:38 [http-nio-8080-exec-12] INFO: LDAPDatabase: Searching acmeOrderId=ROl0zDqknZ,ou=orders,dc=acme,dc=pki,dc=example,dc=com
2023-09-12 16:51:38 [Thread-8] INFO: Valid authorization: 7sptvS24Cm
2023-09-12 16:51:38 [http-nio-8080-exec-12] INFO: PKISocketFactory: Initializing PKISocketFactory
2023-09-12 16:51:38 [Thread-8] INFO: Order ROl0zDqknZ is ready
2023-09-12 16:51:38 [Thread-8] INFO: Valid order: ROl0zDqknZ
2023-09-12 16:51:38 [Thread-8] INFO: LDAPDatabase: Modifying acmeOrderId=ROl0zDqknZ,ou=orders,dc=acme,dc=pki,dc=example,dc=com
2023-09-12 16:51:38 [http-nio-8080-exec-12] INFO: PKISocketFactory: Creating socket for ds.example.com:3389
2023-09-12 16:51:38 [Thread-8] INFO: PKISocketFactory: Initializing PKISocketFactory
2023-09-12 16:51:38 [Thread-8] INFO: PKISocketFactory: Creating socket for ds.example.com:3389
2023-09-12 16:51:38 [http-nio-8080-exec-12] INFO: Valid order: ROl0zDqknZ
2023-09-12 16:51:38 [http-nio-8080-exec-12] SEVERE: Servlet.service() for servlet [org.dogtagpki.acme.server.ACMEApplication] in context with path [/acme] threw exception
org.jboss.resteasy.spi.UnhandledException: java.lang.Exception: Order not ready: ROl0zDqknZ
        ...
Caused by: java.lang.Exception: Order not ready: ROl0zDqknZ
	at org.dogtagpki.acme.server.ACMEFinalizeOrderService.handlePOST(ACMEFinalizeOrderService.java:70)
        ...

So you're right, the order didn't end up with a wrong status in the LDAP record, but from client's perspective the order didn't have the expected status. This is because ACMEFinalizeOrderService (http-nio-8080-exec-12) read the order status before ACMEChallengeProcessor (Thread-8) updated it which usually happens immediately after setting the authz to valid but for some reason it's not immediate if minConns=0.

@edewata
Copy link
Contributor Author

edewata commented Sep 12, 2023

I tried fixing the exception as described in the RFC: edewata@49103ac
but certbot doesn't seem to like it:
https://github.com/edewata/pki/actions/runs/6163037711/job/16726539917#step:28:154

Sending POST request to http://pki.example.com:8080/acme/rest/order/sNQrYviWAT/finalize:

https://github.com/edewata/pki/actions/runs/6163037711/job/16726539917#step:28:160

http://pki.example.com:8080/ "POST /acme/rest/order/sNQrYviWAT/finalize HTTP/1.1" 403 90

https://github.com/edewata/pki/actions/runs/6163037711/job/16726539917#step:28:169

{"type":"urn:ietf:params:acme:error:orderNotReady","detail":"Order not ready: sNQrYviWAT"}

certbot failed with this message instead of retrying the finalization:

acme.messages.Error: urn:ietf:params:acme:error:orderNotReady :: The request attempted to finalize an order that is not ready to be finalized :: Order not ready: sNQrYviWAT
An unexpected error occurred:
Order not ready: sNQrYviWAT

@edewata
Copy link
Contributor Author

edewata commented Sep 12, 2023

I've created a new patch similar to this PR but probably more RFC-compliant: edewata@0ba6c70
As before, it basically has to reinterpret the order pending status (see inline comments).

Please let me know if this is OK, then I can update the PR. I'll also try to add compatibility tests using other clients in separate PRs. Thanks!

@frasertweedale
Copy link
Contributor

I do think we ought to merge edewata@49103ac (perhaps in modified form to account for whatever other changes we want to make). Because it make us respond more properly in the "order status != ready" case.

I am still reviewing the other changes.

@frasertweedale
Copy link
Contributor

frasertweedale commented Sep 13, 2023

I do think certbot's behaviour is non-conformant here. Per RFC 8555:

   A request to finalize an order will result in error if the order is
   not in the "ready" state.  In such cases, the server MUST return a
   403 (Forbidden) error with a problem document of type
   "orderNotReady".  The client should then send a POST-as-GET request
   to the order resource to obtain its current state.  The status of the
   order will indicate what action the client should take (see below).

Evidently certbot just bails out upon receiving the 403, which is wrong. But... we should still try and handle this better on our end.

edit: I filed a certbot issue: certbot/certbot#9766

@frasertweedale
Copy link
Contributor

frasertweedale commented Sep 13, 2023

I reveiewed edewata@0ba6c70.

I object to this approach. The RFC is clear that the order must transition to ready when all authz are valid, and failing to do this could break client compatibility if the client is waiting for the order to become ready before asking to finalize the order.

Let me suggest a different approach. In ACMEChallengeProcessor.finalizeValidAuthorization():

  1. Retrieve all pending orders for the current account and authzID
  2. For each pending order:
    1. If all authz except the current authz have status valid, set the order status to ready
  3. Set the authz status to valid.

This flips the order in which the state transitions occur. The Order transitions to ready first, then the last Authz transitions to valid.

For example (just a rough diff; I haven't built it, let alone tested):

diff --git a/base/acme/src/main/java/org/dogtagpki/acme/server/ACMEChallengeProcessor.java b/base/acme/src/main/jav>
index 15514bb39..c5dc1eb68 100644
--- a/base/acme/src/main/java/org/dogtagpki/acme/server/ACMEChallengeProcessor.java
+++ b/base/acme/src/main/java/org/dogtagpki/acme/server/ACMEChallengeProcessor.java
@@ -118,8 +118,6 @@ public class ACMEChallengeProcessor implements Runnable {
         Date expirationTime = engine.getPolicy().getValidAuthorizationExpirationTime(currentTime);
         authorization.setExpirationTime(expirationTime);
 
-        engine.updateAuthorization(account, authorization);
-
         logger.info("Updating pending orders");
 
         Collection<ACMEOrder> orders =
@@ -129,6 +127,10 @@ public class ACMEChallengeProcessor implements Runnable {
             boolean allAuthorizationsValid = true;
 
             for (String orderAuthzID : order.getAuthzIDs()) {
+                if (orderAuthzID.equals(authzID)) {
+                    // We're about to set it to valid, so treat it as such
+                    continue;
+                }
 
                 ACMEAuthorization authz = engine.getAuthorization(account, orderAuthzID);
                 if (authz.getStatus().equals("valid")) continue;
@@ -147,6 +149,13 @@ public class ACMEChallengeProcessor implements Runnable {
 
             engine.updateOrder(account, order);
         }
+
+        // We defer the LDAP update until AFTER any order transitions
+        // occur.  This avoids a race condition where clients that eagerly
+        // proceed to finalization when all Authorizations are "valid"
+        // experience finalization failure, because the __Order__ has not yet
+        // transition to "ready".
+        engine.updateAuthorization(account, authorization);
     }
 
     public void finalizeInvalidAuthorization(ACMEError err) throws Exception {

@edewata
Copy link
Contributor Author

edewata commented Sep 13, 2023

Thanks for the suggestion. I tried your patch as is with PR #4561 and the new exception, and it seems to be working: edewata@3a060d7
https://github.com/edewata/pki/actions/runs/6176665722/job/16766677819

If you prefer, feel free to create a new PR for your patch, or I can just update this PR with your patch (and make you the author).

@frasertweedale
Copy link
Contributor

@edewata great! I'll let you take it from here. Cheers!

edewata and others added 2 commits September 14, 2023 13:26
Previously when the authorization for an order completed the
ACMEChallengeProcessor would update the authorization status
and the order status after that, then the client would call
the ACMEFinalizeOrderService to finalize the order.

However, since these updates are not atomic there is a risk
that the client will detect the new authorization status then
immediately call the ACMEFinalizeOrderService before the order
status can be updated by ACMEChallengeProcessor. Since both
both ACMEFinalizeOrderService and ACMEChallengeProcessor will
read and write the same order at the same time, there could be
a race condition and the finalization could fail.

To avoid the problem the ACMEChallengeProcessor has been
modified to update the order status first before updating the
authorization status.
@sonarcloud
Copy link

sonarcloud bot commented Sep 14, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@edewata
Copy link
Contributor Author

edewata commented Sep 14, 2023

@fmarco76 @ckelleyRH @frasertweedale Thanks! I've updated the PR.

@edewata edewata merged commit 2457fc3 into dogtagpki:master Sep 14, 2023
150 of 151 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants