Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: iOS app crash caused by the request operation canceling #48350

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zhouzh1
Copy link

@zhouzh1 zhouzh1 commented Dec 20, 2024

Summary:

Currently we observed many iOS app crashes caused by the [RCTFileRequestHanlder invalidate] method, just as the below screenshot.
image

Changelog:

[IOS] [FIXED] - app crash caused by the [RCTFileRequestHanlder invalidate] method

Test Plan:

I am not able to reproduce this issue locally either, so the changes in this PR are totally from my inference, I am not sure if it really makes sense, so please help take a deeper look, thanks.

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team. labels Dec 20, 2024
@zhouzh1
Copy link
Author

zhouzh1 commented Dec 20, 2024

@mojodna @jsierles @augustl Could you help take a look at this PR? Thanks.

@augustl
Copy link
Contributor

augustl commented Dec 20, 2024

@zhouzh1 hi! Out of curiosity, why did you tag me in this PR? I keep getting tagged in React Native PRs for some reason, and I'm not sure why :)

Copy link
Contributor

@cipolleschi cipolleschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is proper fix for the issue.
Could you share more about the crash?
Is there an error message?
Can you share the full crashlog?
Is it happening in production only or also in debug?
Is the app in background or in foreground?

Comment on lines 28 to 35
if (_fileQueue) {
for (NSOperation *operation in _fileQueue.operations) {
if ([operation isKindOfClass:[NSOperation class]] && !operation.isCancelled && !operation.isFinished) {
[operation cancel];
}
}
_fileQueue = nil;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary. You can only add NSOperation to an NSOperationQueue and cancellAllOperations runs the same code you are manually writing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cipolleschi Seems that the cancellAllOperations won't do the status checks for the operation, and in the crash report of my iOS app, the stack trace exactly tells me the crash point is just in the cancellAllOperations internal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the official Apple docs for cancelAllOperations.

This method calls the cancel method on all operations currently in the queue.
Canceling the operations does not automatically remove them from the queue or stop those that are currently executing. For operations that are queued and waiting execution, the queue must still attempt to execute the operation before recognizing that it is canceled and moving it to the finished state. For operations that are already executing, the operation object itself must check for cancellation and stop what it is doing so that it can move to the finished state. In both cases, a finished (or canceled) operation is still given a chance to execute its completion block before it is removed from the queue.

And this is the docs of NSOperation cancel.

In any case, calling cancel on an already cancelled or finished operation does not crash the app.

I believe that the crash is happening inside one of the operations that are being cancelled and that's why the crash reporter reports the crash there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that even though we already put assurance for the corresponding code to make it only executed on the main thread or another sole thread, e.g. the JS thread, but because of the nature of object pointer reference, the operation object could still be shared among multiple thread contexts, then the situation you said above happens.

if (_didInvalidate) {
return;
}
RCTUnsafeExecuteOnMainQueueSync(^{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially deadlock. We should not run the unsafe variant of this method. Can you change it with RCTExecuteOnMainQueue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


RCTAssertMainQueue();
RCTLogInfo(@"Invalidating %@ (parent: %@, executor: %@)", self, _parentBridge, [self executorClass]);
RCTAssertMainQueue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that confuses me is that, in your stacktrace, the crash is happening in Thread 26... but this assert should force the app to be on the main thread, which is not the Thread 26... how's this possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cipolleschi Good question, that's because of actually the RCTAssertMainQueue only takes effects in dev build, when in the release build, it does nothing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I added the RCTUnsafeExecuteOnMainQueueSync wrapper to ensure the code to run on the main thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image As you can see, when the `NS_BLOCK_ASSERTIONS` is defined, the `RCTAssert` macro is actually empty, and generally the `NS_BLOCK_ASSERTIONS` is defined in release build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good explanation, but then we should see crashes in development happening because of the assertion. And IIUC, the app does not crash in development, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the crash log, the JS thread is triggering the invalidation. I think that this is the root of the problem: after the JS thread detect the invalidation, we should jump on the UI thread to invalidate everything...

Copy link
Author

@zhouzh1 zhouzh1 Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't encounter this crash in development, but I am not sure if it happens in it, you know, it's a occasional issue on itself.

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 20, 2024

@zhouzh1 hi! Out of curiosity, why did you tag me in this PR? I keep getting tagged in React Native PRs for some reason, and I'm not sure why :)

Hey @augustl, sorry, I thought you're a memeber of the react-native dev team, because your id was displayed in the hint list when I was typing the @ char in the comment box. 😸

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 20, 2024

@cipolleschi Put more information here for your reference.

  1. I only observed this crash in production by sentry, have not seen it in dev.
  2. According the sentry breadcrumbs, seems that most of the events happened after the app was pushed to the background state.
  3. You can see the relevant error messages that we collected by sentry on the below screenshot.
image 4. Posted a complete crash report. [79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip](https://github.com/user-attachments/files/18211158/79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip)

@RSNara
Copy link
Contributor

RSNara commented Dec 20, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 21, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

@RSNara
Copy link
Contributor

RSNara commented Dec 23, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

I just took a look!

RCTCxxBridge invalidate should only ever be called from RCTBridge. And RCTBridge invalidate and reload (but not dealloc) schedules RCTCxxBridge invalidate on the main thread:

https://github.com/facebook/react-native/blob/main/packages/react-native/React/Base/RCTBridge.mm?fbclid=IwZXh0bgNhZW0CMTEAAR0vNpP3fC5jmfsIbD3EGeU8ynLhumgdGU2JgStLpsMNIT8VNT9PqeT3N2I_aem__qaj5Hd0GAQmvSVLs7dQZA#L368

So, it's very curious that you're running into this issue. Could it be that in your code, you're relying on the dealloc method of RCTBridge? And that just synchronously deallocates the RCTCxxBridge on the current (i.e: potentially non-main) thread?

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 24, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

I just took a look!

RCTCxxBridge invalidate should only ever be called from RCTBridge. And RCTBridge invalidate and reload (but not dealloc) schedules RCTCxxBridge invalidate on the main thread:

https://github.com/facebook/react-native/blob/main/packages/react-native/React/Base/RCTBridge.mm?fbclid=IwZXh0bgNhZW0CMTEAAR0vNpP3fC5jmfsIbD3EGeU8ynLhumgdGU2JgStLpsMNIT8VNT9PqeT3N2I_aem__qaj5Hd0GAQmvSVLs7dQZA#L368

So, it's very curious that you're running into this issue. Could it be that in your code, you're relying on the dealloc method of RCTBridge? And that just synchronously deallocates the RCTCxxBridge on the current (i.e: potentially non-main) thread?

Your curiosity is mine as well. Before I submitted this PR, I was also suspecting if there is a certain place in my code or 3rd-party library code where the RCTBridge or the RCTCxxBridge deallocation is invoked explicitly, but I didn't manage to find it. However anyway, we always need to ensure the RCTCxxBridge invalidation to be run on the main thread, is it right? If so, it makes sense to wrap it with the RCTExecuteOnMainQueue?

@cipolleschi
Copy link
Contributor

@zhouzh1 is your app using Expo? I wonder if Expo does something under the hood to try and manage the lifecycle of the Bridge. Similarly, there might be libraries that attempt to do the same. If they connect some private API like the reload one to some JS function, it might happen that the invalidation process starts from the JS Thread instead of from the main one. 🤔

This is an hypothesis that we need to validate, thought.. It would be helpful to know what dependencies are you using. Also, are you using any crash reporting solution like Sentry? those product usually allow you to leave breadcrumbs that can be used to investigate crashes. What was the user doing in the app when the crash occurred? What was the last action issued?

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 24, 2024

@zhouzh1 is your app using Expo? I wonder if Expo does something under the hood to try and manage the lifecycle of the Bridge. Similarly, there might be libraries that attempt to do the same. If they connect some private API like the reload one to some JS function, it might happen that the invalidation process starts from the JS Thread instead of from the main one. 🤔

This is an hypothesis that we need to validate, thought.. It would be helpful to know what dependencies are you using. Also, are you using any crash reporting solution like Sentry? those product usually allow you to leave breadcrumbs that can be used to investigate crashes. What was the user doing in the app when the crash occurred? What was the last action issued?

Yes, our app is using the expo and many of its associated libraries (e.g. the expo-updates, expo-camera, and so on), I think what you said above makes sense.
According to the Sentry breadcrumbs we collected, I found most of the crash events occurring after the app was pushed to the background state, and meanwhile, there were some requests in pending.
image

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 25, 2024

@cipolleschi Any ideas about the above information I provided?

@cipolleschi
Copy link
Contributor

It partially makes sense. Moving an app to background triggers the events on the main thread. So the whole invalidation should already happen in the right thread.
But from your stack trace, the invalidation seems to arrive from a different thread. That's what we do not understand.

Also, the purpose of the invalidation and requestCancelling, is exactly to avoid those kind of crashes... As soon as the app goes in background, the operation are cancelled and they should not be executed... :/


Any chance that you can try and prepare a repro using this template, so that we can investigate this problem with more ease?

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 30, 2024

@cipolleschi Unfortunately we haven't managed to reproduce it...😂, we just observed it on sentry and Xcode's crash statistics, but from what you said, seems that as long as we can ensure that the invalidation process only happens on the main thread, we can certainly avoid those crashes, is it right?

@cipolleschi
Copy link
Contributor

I think so, that's what we assumed when we wrote the code. However, we couldn't predict how frameworks like Expo might use that. :(
It is also a bit tricky to handle because invalidation should block everything else, but if it is not triggered by the main thread, we should jump asynchronously to the main queue, otherwise we risk to incur in deadlocks and this will not block everything else.
@RSNara do you have any idea on how to handle this case?

@cipolleschi
Copy link
Contributor

I was investigating this a little bit more, and I found that when the app is reloaded (pressing r on Metro):

  1. we are already on the main thread
  2. we call [RCTBridge invalidate]. This function jumps asynchronously to the main queue.
  3. That function then calls [RCTCxxBridge invalidate] that performs all the invalidation code.

In your crash, there could be some piece of code that calls directly [RCTCxxBridge invalidate] without passing for the RCTBridge instance. The problem might probably be in Expo, at this point. I looked at all the React Native call to invalidate, and aside those two, none other call directly go to the Bridge.

@cipolleschi
Copy link
Contributor

cipolleschi commented Dec 31, 2024

@zhouzh1 Any chance you can share the package.json with the dependencies you have in your project? This might help to trim down to the dependency that might cause the issue.

@Kudo
Copy link
Contributor

Kudo commented Dec 31, 2024

  1. Posted a complete crash report. 79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip

also would be good to share all threads' stacktrace from sentry if available. this crash log does not have symbolicated stacktrace. it looks like the crash happens when reloading the app. do you use Updates.reloadAsync() or any other reload code in your app?

@zhouzh1
Copy link
Author

zhouzh1 commented Jan 1, 2025

@cipolleschi You can find all the package dependencies in this file. 👉 dependency.json

@zhouzh1
Copy link
Author

zhouzh1 commented Jan 1, 2025

  1. Posted a complete crash report. 79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip

also would be good to share all threads' stacktrace from sentry if available. this crash log does not have symbolicated stacktrace. it looks like the crash happens when reloading the app. do you use Updates.reloadAsync() or any other reload code in your app?

@Kudo You're right, we're using the expo-updates and its reloadAsync method, I was suspecting it as well, but I didn't mamage to find any evidence, if you can help take a look, that would be very helpful.

@zhouzh1
Copy link
Author

zhouzh1 commented Jan 1, 2025

And post a crash report which has clearer symbolicated stack trace. @cipolleschi @Kudo
2024-12-31_21-35-22.0862_-0800-81ec277901cdb54faeb4f79131f4ec43b8824f00.crash.zip

@zhouzh1
Copy link
Author

zhouzh1 commented Jan 2, 2025

@cipolleschi Is it possible that the buggy point is in the onReload methods which are individually defined in the RCTHost.mm file and the RCTBridge.mm file? I saw they are using the RCTAssertMainQueue macro as well, but as you know, this macro won't take effect in release build.

@Kudo
Copy link
Contributor

Kudo commented Jan 2, 2025

And post a crash report which has clearer symbolicated stack trace.
2024-12-31_21-35-22.0862_-0800-81ec277901cdb54faeb4f79131f4ec43b8824f00.crash.zip

that is useful. thanks for sharing the crashlog.

Is it possible that the buggy point is in the onReload methods which are individually defined in the RCTHost.mm file and the RCTBridge.mm file?

it should be unrelated. RCTHost is for new architecture and RCTBridge more on legacy architecture.

@Kudo
Copy link
Contributor

Kudo commented Jan 2, 2025

my finding

  • expo-updates Updates.reloadAsync() uses RCTTriggerReloadCommandListeners to reload the app. the old bridge instance will be deallocated and [RCTCxxBridge invalidate] would be called on main thread. that is correct flow.
  • [RCTCxxBridge invalidate] will switch to different thread at different point:
  • RCTFileRequestHandler init on main thread and invalidate on worker thread. if you want to invalidate on main thread, you can add the code in RCTFileRequestHandler.mm
    - (dispatch_queue_t)methodQueue
    {
      return dispatch_get_main_queue();
    }
  • according to apple, NSOperationQueue should be thread-safe. it should be fine even RCTFileRequestHandler init/invalidate on different threads. i still have no idea how the crash happens. but it could be during reload and fetch('file:///...').
  • during invalidate, the RCTFileRequestHandler.cancelRequest will also be called:
    • from RCTNetworking.invalidate:
      - (void)invalidate
      {
      [super invalidate];
      std::lock_guard<std::mutex> lock(_handlersLock);
      for (NSNumber *requestID in _tasksByRequestID) {
      [_tasksByRequestID[requestID] cancel];
      }
    • then RCTNetworkTask.cancel:
      - (void)cancel
      {
      if (_status == RCTNetworkTaskFinished) {
      return;
      }
      _status = RCTNetworkTaskFinished;
      id token = _requestToken;
      if (token && [_handler respondsToSelector:@selector(cancelRequest:)]) {
      [_handler cancelRequest:token];
      }
    • and RCTFileRequestHandler.cancelRequest:
      - (void)cancelRequest:(NSOperation *)op
      {
      [op cancel];
      }
    • the other hypothesis is from race condition between NSOperation.cancel and NSOperationQueue.cancelAllOperations. in this case, maybe we can try revise the cancelRequest method as
      - (void)cancelRequest:(NSOperation *)op
      {
        if (!op.isCancelled && !op.isFinished) {
          [op cancel];
        }
      }
      that is actually similar to the pr's solution though.

@zhouzh1
Copy link
Author

zhouzh1 commented Jan 3, 2025

@Kudo Thanks for your debugging and the detaild analysis, that's very helpful, if we're not able to get the real root cause for this crash or reproduce it(I tried many ways to mock the crash scenario but didn't catch it), maybe let me patch the RN by this solution and to see what would happen.

@cipolleschi
Copy link
Contributor

@zhouzh1 That's sounds like a good approach. After you verified that the crash is solved, you can update the PR and we can import it.

Also, thank you @Kudo for spending your time on this issue!The analysis was brilliant.

PS: calling cancel on an already cancelled operation should be a noop...

@zhouzh1
Copy link
Author

zhouzh1 commented Jan 17, 2025

@cipolleschi Seems that I succeeded, I applied my patch (that is the latest PR change) in prod env and have been observing it for around a week, I didn't receive that crash report anymore in the latest version of my app, also, I think there is a similar bug as well in the RCTDataRequestHandler.mm module, so I fixed it alongside in this PR, please help take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants