-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OTA API and OTA_EventProcessingTask is not task/thread safe when it comes to accessing common state. #465
Comments
Hi, We are looking into the same and will get back soon. |
Thanks for this extremely detailed bug report @phelter. I spent a few hours digging through our code trying to prove thread safety to myself and I agree - its sorely lacking. The application task, processing loop, and callbacks could easily clash. The OTA library predates me a bit so I'd like to chat with a few people on my team who were here during its most recent iterations development to understand if I'm missing some information on these functions call patterns. At this moment I'm in agreement with the expected behavior you've described. |
@kstribrnAmzn - Thanks for acknowledging this issue. Would it be possible to provide an estimate time for this to be corrected? So that I can plan accordingly. Unfortunately, I don't believe this is the only FreeRTOS or Amazon IoT C-library to have this issue. It might be prudent to look at and confirm the following don't have similar issues as well.
Since the |
Sorry for the delay in responding to you @phelter. Checking our other libraries makes a lot of sense. We want consistency across the FreeRTOS software suite. I'll create stories in our backlog queue so that we don't loose track of this work. As for an estimate - I cannot provide one. Restructuring OTA is high on our priority list however things have shifted a bit recently. I'm not sure exactly when this work will be started and by how many. Any date I provide would be inaccurate and likely out of date within weeks. |
I've created both issues on the Github repositories mentioned requesting help wanted (to get faster support from the community) as well as an internal tracking ticket for the FreeRTOS team. |
I am just new to devoloping IoT with AWS. is there anything I can assist with. volunteer work? please message me. thanks in advance. i look forward to contributing in any way. cheers 497115@protonmail.com |
@eldruid Thanks for reaching out! Between all the repositories my team own (FreeRTOS), there is plenty of work to do. What work or libraries interest you? In the meantime I'm reaching out to others on my team to see if we have any work which would be doable for someone not as familiar with the libraries.This issue you commented on could be a good place to start. There is essentially 2 parts to the work as of right now. The first part is auditing our various libraries, as @phelter had mentioned, to ensure there are no other variables which can be accessed in a thread unsafe manner. Basically, you'd want to look for structures/objects which contain library state and are used by both the library and optionally the user. The issue with these is that the user could be accessing the structure/object from another FreeRTOS task (aka thread) while the library task/thread is updating the values, causing a race condition. The second part of the work is implementing a mutex preventing the race condition. In the case of this issue, the mutex would block access in the case that the object is being updated. In our case, and since this library is not tied to a specific OS, may require further rework. A set of getter methods for the object which requires the OTA task to synchronize could be a better solution. I will say I haven't thought deeply about this so I'm not sure either of these are the right solutions. |
Not to discourage you from working on this issue - as it is important - but you may want to take on a slightly easier issue to start with like FreeRTOS/FreeRTOS-Kernel#617. |
I had an enlightening conversation with @AniruddhaKanhere on the thread safety aspect of this issue. From what I've been able to find, OTA is largely threadsafe but ultimately does not claim any thread safety. For this reason, I don't think I'd call this a bug like I would in other libraries, like coreMqtt-Agent for example. which promise thread safety. However, I think there is still value in auditing this repository in case there are any serious threadsafety issues. From a cursory glance through the APIs you've highlighted, the common issue is a reader-writer problem. While this a problem, supplying a list of best practices for this library (I'll be creating a backlog item on our side) would mitigate most of these. The audit would really be to identify any place where the outside task (aka non-ota-lib) can influence the update or modification of the |
@kstribrnAmzn that is a very unfortunate stance you are taking on this particular issue. First of all the API related to this is implying thread safety - the examples suggest calling So one of two things are wrong. The demos are grossly misrepresenting the way this API is to be consumed and are broken, OR the API is implied thread safe. Pick your poison - but at least one is true. If one were to assume it was not thread safe - one could only call any of the API functions inside the Also, when there are OTA based callbacks being provided to other services - such as a timer (in a different thread) or a different module (such as Mqtt which typically is operating in another thread), or an ISR (which is in interrupt context and not in application/code) - all of that particular code MUST ensure the thread-safety of the Instead of trying to justify why it's not a |
Hey @phelter I agree the current version of the OTA library has a number of shortcomings. Forgoing labels, there should be properly implemented thread safety for a stack like OTA. In practice, it's not far fetched that various subsystem threads may share an OTA thread, in an independent manner, to download files specific to their subsystem. OTA has been under a watchful eye, and we're planning on a major revision, but that's of course triaged/scheduled with consideration for other goals. Thank you for the feedback, it helps to bump OTA priority. |
Hey again @phelter, |
Thanks @Skptak , I understand it's a big change to incorporate that level of atomic'ness or memory barrier access into any of the load/stores of internal data, and the verification to ensure the correct mutex is always locked when those values are accessed (read/modified/written). I appreciate the effort on fixing this. |
Thanks for your patience. Having analyzed this further, we agree accessing a single structure from multiple threads is bad practice, but we haven’t found an occurrence within the library where it could be catastrophic and crash or brick a device. Please update the ticket again, with examples, if you disagree. This is not to say that we are not going to address these problems. In fact we took this as the opportunity to provide a more modular and flexible library. We are just trying to ensure that the people using the library today do not panic and can move to the modular version at their own pace. |
Yes it is not only bad practice but can lead to incorrect data being used and/or settings/configuration being overwritten. My preference would be:
Given the speed at which this serious issue is being resolved I don't believe I'll be using/consuming this or other related Amazon libraries. I don't see anything wrong with the API itself so replacing it with a different library means plenty of more work for all integrators of this library and no work for the owner (Amazon) of this particular library. Please Choose #1. If the users of this library are building If the user of this library is releasing a product with this - then expect customer calls that the OTA functionality doesn't work all the time - OR the operation of it isn't robust from one build to the next or under various loads or operational changes. |
Hi phelter – thanks for raising this issue and providing your feedback. I was satisfied there was no immediate risk to users following our examples after looking at the implementation myself, but your points are valid. Of course, not everybody uses the code in the same way. So then to the question of what to do about it. I wasn’t in favor of the mutex approach as 1) it required more analysis to ensure deadlock avoidance, and 2) It’s fixing the symptom rather than the cause. Additionally, I wanted to request some rework of the agent to make it less opinionated on how it can be used, and what it can connect to, anyway. That brings it into line with our current thinking on having no dependencies on anything other than the C library – making it more widely applicable, and effectively moving it from the right side to the left side of this diagram. Once complete, the rework will allow connection to any OTA source using any protocol, and if you want to get into the weeds of the library and take responsibility for code changes, then also change the state machine. Hence, given what I said above about no immediate risk to those following our examples, it was decided to put resources into the reworked version. i.e. rather than fixing the symptom (the race conditions on shared data) we focus on the cause (the design) so that we eliminate shared data altogether. |
Now that the new OTA library is GA and an example showing how to use it is available, I am closing this issue. If you want to discuss further, please re-open or create a thread on FreeRTOS Forums. Thanks. |
Thank you @aggarg for releasing a potential option to this particular library. My apologies, but how does the release of a new library and demo using that library resolve the issue in this particular repo. The repos you have linked to are lab projects that support different API's to this one. and do not support:
Please leave this open until one of the following has occurred:
There is also an implication of code quality and a level of security to this library in the Readme which is great, but in my opinion is false advertising until the above issues have been resolved. I am unable to re-open this issue as you described - please re-open it on my behalf. Thank you. |
You are right that the release of our new coreOTA library and the corresponding demo examples do not fix the architectural issue in this repository. I will create a PR updating the Readme to provide the deprecation notice - as you mentioned. The new coreOTA is a collection of composable libraries:
We have provided the following 2 examples to help new and existing users transition to using coreOTA -
We will leave this issue open until the PR to update the README is merged. |
I've updated this library with a notice of deprecation pointing to the new core OTA library materials. Closing this out as this satisfies your first ask. |
Describe the bug
The OTA API and the task that is expected to be used use common data values without synchronization between tasks/threads. The OTA implementation is NOT Thread/Task safe.
There is a gross error in the way portions of the
otaAgent
internal state is being read/modified/written. Portions of it are assumed to be atomic across all tasks/threads but there are no guarantees that this is the case.There are 3 Potential tasks/threads where actions can be performed and are currently in contention:
mqtt
orhttp
) executing the callbacksFor the state and or callbacks there is no synchronization barrier (eg a semaphore or mutex) of the
otaAgent
information when any of these three tasks are accessing theotaAgent
common control block.These values MUST be either specified as atomic OR consumed within a semaphore/mutex lock so that actions performed upon them by either a task calling the OTA_*() API functions or the task running
OTA_EventProcessingTask()
will not inadvertently overwrite the values - especially within code portions that have -read - decision - write
I'm only providing the examples pertaining to the API (App -> OTA_EventProcessingTask()) but there are most likely others between the Network registered callbacks and the OTA_EventProcessingTask() as well.
Eg: the OTA_Init()
This should have something along the lines of:
Other API's that require this type of change are:
?? activateFn = otaAgent.pOtaInterface->pal.activate
and then using that if not null.setImageStateWithReason()
is used.OtaAgentEventSuspend
message being received by theOTA_EventProcessingTask
OTA_Resume
- stopped here - you get the idea...OTA_SignalEvent
- for the statisitcs and read of state - the stats should probably have their own lockAPI that looks to be okay:
OTA_CheckForUpdate()
OTA_Err_strerror()
OTA_JobParse_strerror
OTA_PalStatus_strerror
OTA_OsStatus_strerror
As mentioned, did not check any of the handlers that are registered to the network - but assuming there are most likely the same level of issue here.
Host
To Reproduce
Expected behavior
See Above - expected all API calls that use or modify otaAgent.* internal construct - which is used by other tasks, the access of those fields are protected by a semaphore and/or mutex.
Screenshots
N/A
Wireshark logs
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: