Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OTA API and OTA_EventProcessingTask is not task/thread safe when it comes to accessing common state. #465

Closed
phelter opened this issue Dec 14, 2022 · 20 comments
Labels
bug Something isn't working

Comments

@phelter
Copy link

phelter commented Dec 14, 2022

Describe the bug
The OTA API and the task that is expected to be used use common data values without synchronization between tasks/threads. The OTA implementation is NOT Thread/Task safe.

There is a gross error in the way portions of the otaAgent internal state is being read/modified/written. Portions of it are assumed to be atomic across all tasks/threads but there are no guarantees that this is the case.

There are 3 Potential tasks/threads where actions can be performed and are currently in contention:

  • Application task - executing the OTA_* api - eg: OTA_Shutdown(), OTA_GetState(), OTA_SignalEvent(), OTA_ActivateNewImage(), etc.
  • OTA_EventProcessingTask()
  • Network task (mqtt or http) executing the callbacks
  • Timer task - for timers - This one is okay because all of the timers used are sending Events via a queue to the EventProcessingTask.

For the state and or callbacks there is no synchronization barrier (eg a semaphore or mutex) of the otaAgent information when any of these three tasks are accessing the otaAgent common control block.

These values MUST be either specified as atomic OR consumed within a semaphore/mutex lock so that actions performed upon them by either a task calling the OTA_*() API functions or the task running OTA_EventProcessingTask() will not inadvertently overwrite the values - especially within code portions that have - read - decision - write

I'm only providing the examples pertaining to the API (App -> OTA_EventProcessingTask()) but there are most likely others between the Network registered callbacks and the OTA_EventProcessingTask() as well.

Eg: the OTA_Init()

This should have something along the lines of:

    if (otaAgent.lock == NULL)
    {
        otaAgent.lock = xSemaphoreCreateMutex();
        assert(otaAgent.lock != NULL);
    }
    BaseType_t semRet = xSemaphoreTake(otaAgent.lock, portMAX_DELAY);
    assert(pdTRUE == semRet);
    (void)semRet;
    // All reads and/or modifications of otaAgent and it's associated values.
    //  Lines - https://github.com/aws/ota-for-aws-iot-embedded-sdk/blob/c3bd5840979cadfe1f9505e13e49cccb87333650/source/ota.c#L3264-L3347

    semRet            = xSemaphoreGive(otaAgent.lock);
    assert(pdTRUE == semRet);

Other API's that require this type of change are:

  • OTA_Shutdown - requires local copy of state and then return outside of semaphore/mutex lock.
  • OTA_GetState - requires local copy of state and then return outside of semaphore/mutex lock.
  • OTA_GetStatistics - otherwise portions of the stats may not be correct relative to each other. - might suggest a separate lock for this.
  • OTA_ActivateNewImage - requires creating a local copy of the ?? activateFn = otaAgent.pOtaInterface->pal.activate and then using that if not null.
  • OTA_SetImageState - required when setImageStateWithReason() is used.
  • OTA_GetImageState - requires creating a local copy of the imageState within a lock.
  • OTA_Suspend - should move that code into the action performed by the OtaAgentEventSuspend message being received by the OTA_EventProcessingTask
  • OTA_Resume - stopped here - you get the idea...
  • OTA_SignalEvent - for the statisitcs and read of state - the stats should probably have their own lock

API that looks to be okay:

  • OTA_CheckForUpdate()
  • OTA_Err_strerror()
  • OTA_JobParse_strerror
  • OTA_PalStatus_strerror
  • OTA_OsStatus_strerror

As mentioned, did not check any of the handlers that are registered to the network - but assuming there are most likely the same level of issue here.

Host

  • Host OS: Linux - but this is ANY OS including FreeRTOS
  • Version: Ubuntu 18.04

To Reproduce

  • N/A - done by inspection, but Could reproduce by running this through Thread Sanitizer (clang) and discovering the errors.

Expected behavior

See Above - expected all API calls that use or modify otaAgent.* internal construct - which is used by other tasks, the access of those fields are protected by a semaphore and/or mutex.

Screenshots

N/A

Wireshark logs

N/A

Additional context

N/A

@phelter phelter added the bug Something isn't working label Dec 14, 2022
@shubnil
Copy link
Member

shubnil commented Dec 15, 2022

Hi, We are looking into the same and will get back soon.

@kstribrnAmzn
Copy link
Member

Thanks for this extremely detailed bug report @phelter. I spent a few hours digging through our code trying to prove thread safety to myself and I agree - its sorely lacking. The application task, processing loop, and callbacks could easily clash. The OTA library predates me a bit so I'd like to chat with a few people on my team who were here during its most recent iterations development to understand if I'm missing some information on these functions call patterns.

At this moment I'm in agreement with the expected behavior you've described.

@phelter
Copy link
Author

phelter commented Dec 21, 2022

@kstribrnAmzn - Thanks for acknowledging this issue.

Would it be possible to provide an estimate time for this to be corrected? So that I can plan accordingly.

Unfortunately, I don't believe this is the only FreeRTOS or Amazon IoT C-library to have this issue.
I've found the following Bug 570 in the FreeRTOS-Plus-TCP as well which is due to state not being protected across threads.

It might be prudent to look at and confirm the following don't have similar issues as well.

Since the FreeRTOS/coreJSON and FreeRTOS/backoffAlgorithm don't have any task/thread based expectations, I believe they are okay.

@kstribrnAmzn
Copy link
Member

Sorry for the delay in responding to you @phelter. Checking our other libraries makes a lot of sense. We want consistency across the FreeRTOS software suite. I'll create stories in our backlog queue so that we don't loose track of this work.

As for an estimate - I cannot provide one. Restructuring OTA is high on our priority list however things have shifted a bit recently. I'm not sure exactly when this work will be started and by how many. Any date I provide would be inaccurate and likely out of date within weeks.

@kstribrnAmzn
Copy link
Member

I've created both issues on the Github repositories mentioned requesting help wanted (to get faster support from the community) as well as an internal tracking ticket for the FreeRTOS team.

@eldruid
Copy link

eldruid commented Feb 21, 2023

I am just new to devoloping IoT with AWS. is there anything I can assist with. volunteer work? please message me. thanks in advance. i look forward to contributing in any way. cheers 497115@protonmail.com

@kstribrnAmzn
Copy link
Member

@eldruid Thanks for reaching out! Between all the repositories my team own (FreeRTOS), there is plenty of work to do.

What work or libraries interest you?

In the meantime I'm reaching out to others on my team to see if we have any work which would be doable for someone not as familiar with the libraries.This issue you commented on could be a good place to start. There is essentially 2 parts to the work as of right now.

The first part is auditing our various libraries, as @phelter had mentioned, to ensure there are no other variables which can be accessed in a thread unsafe manner. Basically, you'd want to look for structures/objects which contain library state and are used by both the library and optionally the user. The issue with these is that the user could be accessing the structure/object from another FreeRTOS task (aka thread) while the library task/thread is updating the values, causing a race condition.

The second part of the work is implementing a mutex preventing the race condition. In the case of this issue, the mutex would block access in the case that the object is being updated. In our case, and since this library is not tied to a specific OS, may require further rework. A set of getter methods for the object which requires the OTA task to synchronize could be a better solution. I will say I haven't thought deeply about this so I'm not sure either of these are the right solutions.

@kstribrnAmzn
Copy link
Member

Not to discourage you from working on this issue - as it is important - but you may want to take on a slightly easier issue to start with like FreeRTOS/FreeRTOS-Kernel#617.

@kstribrnAmzn
Copy link
Member

I had an enlightening conversation with @AniruddhaKanhere on the thread safety aspect of this issue. From what I've been able to find, OTA is largely threadsafe but ultimately does not claim any thread safety. For this reason, I don't think I'd call this a bug like I would in other libraries, like coreMqtt-Agent for example. which promise thread safety.

However, I think there is still value in auditing this repository in case there are any serious threadsafety issues. From a cursory glance through the APIs you've highlighted, the common issue is a reader-writer problem. While this a problem, supplying a list of best practices for this library (I'll be creating a backlog item on our side) would mitigate most of these. The audit would really be to identify any place where the outside task (aka non-ota-lib) can influence the update or modification of the otaAgent structure. If any code segment satisfied this, I'd consider this a more serious bug as corruption could occur.

@phelter
Copy link
Author

phelter commented Feb 24, 2023

I had an enlightening conversation with @AniruddhaKanhere on the thread safety aspect of this issue. From what I've been able to find, OTA is largely threadsafe but ultimately does not claim any thread safety. For this reason, I don't think I'd call this a bug like I would in other libraries, like coreMqtt-Agent for example. which promise thread safety.

@kstribrnAmzn that is a very unfortunate stance you are taking on this particular issue.

First of all the API related to this is implying thread safety - the examples suggest calling OTA_EventProcessingTask() within it's own thread independent of all other API function calls, and those other API calls can be called outside of that particular thread. So to say it does not claim thread safety is an incorrect statement. All of the demo code eg: ota_demo_core_mqtt.c uses it as if the following function calls ARE thread safe because they are called from a different thread from EventProcessingTask:

So one of two things are wrong. The demos are grossly misrepresenting the way this API is to be consumed and are broken, OR the API is implied thread safe. Pick your poison - but at least one is true. If one were to assume it was not thread safe - one could only call any of the API functions inside the OtaAppCallback which is not explicitly stated anywhere. And that would also imply that the OTAJobEvent_t enumeration is missing Suspend/Resume/GetStatistics events - And there would be no way to query outside of the running OTA task the state of OTA - which renders this component quite limited from a user integration perspective.

Also, when there are OTA based callbacks being provided to other services - such as a timer (in a different thread) or a different module (such as Mqtt which typically is operating in another thread), or an ISR (which is in interrupt context and not in application/code) - all of that particular code MUST ensure the thread-safety of the module's internal state - in this case otaAgent. Otherwise, a user integrating this particular component is essentially setting themselves up for a lot of thread debugging and may as well write it themselves to mitigate all of the issues you do not consider a bug. Identification of the API being thread-safe is irrelevant to these types of changes.

Instead of trying to justify why it's not a high priority bug or a bug at all, I humbly request that the team spends the efforts on fixing it to make it thread safe and add in user unit tests (eg the example code operating with a happy path OTA update) are run-able using Thread/Address Sanitizer builds to ensure future changes do not revert and miss this type of error.

@dachalco
Copy link
Contributor

Hey @phelter

I agree the current version of the OTA library has a number of shortcomings. Forgoing labels, there should be properly implemented thread safety for a stack like OTA. In practice, it's not far fetched that various subsystem threads may share an OTA thread, in an independent manner, to download files specific to their subsystem.

OTA has been under a watchful eye, and we're planning on a major revision, but that's of course triaged/scheduled with consideration for other goals. Thank you for the feedback, it helps to bump OTA priority.

@Skptak
Copy link
Member

Skptak commented Mar 7, 2023

Hey again @phelter,
I just wanted to reach out and let you know that the team has been talking about this issue you've brought up quite a bit. After another lengthy conversation about it we came to the conclusion that you're correct that this is a bug, and not a feature request. As such we're going to be working on getting this fixed as soon as possible.
We'd all like to thank you for doing such an incredible write up of the issue, and doing such a deep dive into our code base so that you could provide comparisons against other libraries we have!

@phelter
Copy link
Author

phelter commented Mar 7, 2023

Thanks @Skptak ,

I understand it's a big change to incorporate that level of atomic'ness or memory barrier access into any of the load/stores of internal data, and the verification to ensure the correct mutex is always locked when those values are accessed (read/modified/written). I appreciate the effort on fixing this.

@aggarg
Copy link
Member

aggarg commented Aug 3, 2023

Thanks for your patience. Having analyzed this further, we agree accessing a single structure from multiple threads is bad practice, but we haven’t found an occurrence within the library where it could be catastrophic and crash or brick a device. Please update the ticket again, with examples, if you disagree.

This is not to say that we are not going to address these problems. In fact we took this as the opportunity to provide a more modular and flexible library. We are just trying to ensure that the people using the library today do not panic and can move to the modular version at their own pace.

@phelter
Copy link
Author

phelter commented Aug 18, 2023

Yes it is not only bad practice but can lead to incorrect data being used and/or settings/configuration being overwritten.
Sure it may not lead to a crash but it means the user of the API cannot guarantee that under all circumstances the functions will operate as expected or described in the documentation under the primary use case (a multi-threaded/or task based environment for Embedded processors).

My preference would be:

  1. Add an OTA semaphore lock and use that to lock other threads from accessing the shared single structure and release when complete.
  • For read-only access - you can locally copy everything used and then release right away.
  • For read/write/modify (load/store) access - you will need to unlock after all changes are made.
    At the very least this will mitigate the issues above. Moving API calls into using Events could be done at a later time or never.
  1. If you are not going to do 1, then the only recourse I see for existing customers is to mitigate the chances of this happening. This can be done by:
  • State it is highly recommended to have only ONE other thread calling the OTA API functions - this will mitigate the chances of contention but won't remove them.
  • Add caveats to all of your API's that these may fail to provide the appropriate results and it is up to the USER of the library to perform a RETRY mechanism for the various API calls (mentioned above) to ensure they work.
  • Note that these are only mitigation to reduce the chances of the API calls not working as expected and won't fix anything but reduce the likelihood of the issue happening.

Given the speed at which this serious issue is being resolved I don't believe I'll be using/consuming this or other related Amazon libraries. I don't see anything wrong with the API itself so replacing it with a different library means plenty of more work for all integrators of this library and no work for the owner (Amazon) of this particular library. Please Choose #1.

If the users of this library are building demo and maker products then I'd agree there's no reason to panic - the demo / maker project will work most of the time.

If the user of this library is releasing a product with this - then expect customer calls that the OTA functionality doesn't work all the time - OR the operation of it isn't robust from one build to the next or under various loads or operational changes.

@RichardBarry
Copy link

Hi phelter – thanks for raising this issue and providing your feedback. I was satisfied there was no immediate risk to users following our examples after looking at the implementation myself, but your points are valid. Of course, not everybody uses the code in the same way. So then to the question of what to do about it. I wasn’t in favor of the mutex approach as 1) it required more analysis to ensure deadlock avoidance, and 2) It’s fixing the symptom rather than the cause. Additionally, I wanted to request some rework of the agent to make it less opinionated on how it can be used, and what it can connect to, anyway. That brings it into line with our current thinking on having no dependencies on anything other than the C library – making it more widely applicable, and effectively moving it from the right side to the left side of this diagram. Once complete, the rework will allow connection to any OTA source using any protocol, and if you want to get into the weeds of the library and take responsibility for code changes, then also change the state machine. Hence, given what I said above about no immediate risk to those following our examples, it was decided to put resources into the reworked version. i.e. rather than fixing the symptom (the race conditions on shared data) we focus on the cause (the design) so that we eliminate shared data altogether.

@aggarg
Copy link
Member

aggarg commented Dec 5, 2023

Now that the new OTA library is GA and an example showing how to use it is available, I am closing this issue. If you want to discuss further, please re-open or create a thread on FreeRTOS Forums.

Thanks.

@aggarg aggarg closed this as completed Dec 5, 2023
@phelter
Copy link
Author

phelter commented Dec 5, 2023

Thank you @aggarg for releasing a potential option to this particular library.

My apologies, but how does the release of a new library and demo using that library resolve the issue in this particular repo.

The repos you have linked to are lab projects that support different API's to this one. and do not support:

  • Suspend/Resume
  • GetStatistics
  • ActivateNewImage / SetImageState

Please leave this open until one of the following has occurred:

  1. This repo has a Notice of Deprecation at the very beginning of the main readme and points to the same above ota libraries you mentioned.
  2. This library has been modified to use the new OTA library you mentioned and all integration work with the same or modified API is operational.

There is also an implication of code quality and a level of security to this library in the Readme which is great, but in my opinion is false advertising until the above issues have been resolved.

I am unable to re-open this issue as you described - please re-open it on my behalf.

Thank you.

@aggarg aggarg reopened this Dec 6, 2023
@kstribrnAmzn
Copy link
Member

kstribrnAmzn commented Dec 6, 2023

You are right that the release of our new coreOTA library and the corresponding demo examples do not fix the architectural issue in this repository. I will create a PR updating the Readme to provide the deprecation notice - as you mentioned.

The new coreOTA is a collection of composable libraries:

  1. Jobs Library which also contains additional support for AWS IoT OTA jobs
  2. MQTT Streaming Library for file block downloads over MQTT
  3. coreHTTP for file block downloads over HTTP

We have provided the following 2 examples to help new and existing users transition to using coreOTA -

  1. Simple OTA Orchestrator
  2. OTA Agent Orchestrator - This one is written to ease the transition of applications using this SDK.

We will leave this issue open until the PR to update the README is merged.

@kstribrnAmzn
Copy link
Member

I've updated this library with a notice of deprecation pointing to the new core OTA library materials. Closing this out as this satisfies your first ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants