-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts #65829
drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts #65829
Conversation
I remember we've discussed this refcounting a few times with @tmleman , @ceolin and @teburd . Not sure if we've missed some patch, or SOF has changed its behaviour, but I now hit this 100% of the time in a very simple case on a newer platform (see linked bug). The duplicate call to dma_stop() might be a bit weird, but as far as I can tell, completely valid and allowed use of the DMA API, so the intel_adsp_gpdma driver should handle this case (we asked for this to begin with to avoid tracking DMA state on SOF side). What remains a mystery is why this only happens on one platform, but this seems to be the clear cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i do we need a similar fix for HDA DMA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that the runtime counting is flawed in the dw_dma_start/stop:
dw_dma_start() -> pm_device_runtime_get()
dw_dma_start() -> NOP
dw_dma_stop() -> pm_device_runtime_put()
dw_dma_stop() -> oops
??
given that both channel finishes before the stop is called?
drivers/dma/dma_intel_adsp_gpdma.c
Outdated
struct intel_adsp_gpdma_data *dev_data = dev->data; | ||
int ret; | ||
|
||
if (!dev_data->enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of CONFIG_PM_DEVICE=n
we will never stop the DMA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi Right, marking current version with DNM. This needs to be fixed.
drivers/dma/dma_intel_adsp_gpdma.c
Outdated
@@ -44,6 +44,7 @@ LOG_MODULE_REGISTER(dma_intel_adsp_gpdma); | |||
/* Device run time data */ | |||
struct intel_adsp_gpdma_data { | |||
struct dw_dma_dev_data dw_data; | |||
bool enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably this is a bug in the DMA client side if there is an unbalance between the start and stop.
The dw_dma_start()
and dw_dma_stop()
rely on correct user behavior as it is handing the pm_device_runtime get/put
Right, the bug might be in the dw_dma_start/stop() balancing?
is this fixes the issue?
diff --git a/drivers/dma/dma_dw_common.c b/drivers/dma/dma_dw_common.c
index bb000298385f..a893672d0922 100644
--- a/drivers/dma/dma_dw_common.c
+++ b/drivers/dma/dma_dw_common.c
@@ -446,7 +446,7 @@ int dw_dma_start(const struct device *dev, uint32_t channel)
/* validate channel */
if (channel >= DW_CHAN_COUNT) {
ret = -EINVAL;
- goto out;
+ goto pm_get;
}
if (dw_dma_is_enabled(dev, channel)) {
@@ -535,6 +535,8 @@ int dw_dma_start(const struct device *dev, uint32_t channel)
/* enable the channel */
dw_write(dev_cfg->base, DW_DMA_CHAN_EN, DW_CHAN_UNMASK(channel));
+
+pm_get:
ret = pm_device_runtime_get(dev);
out:
@@ -554,7 +556,7 @@ int dw_dma_stop(const struct device *dev, uint32_t channel)
}
if (!dw_dma_is_enabled(dev, channel) && chan_data->state != DW_DMA_SUSPENDED) {
- goto out;
+ goto pm_put;
}
#ifdef CONFIG_DMA_DW_HW_LLI
@@ -605,6 +607,8 @@ int dw_dma_stop(const struct device *dev, uint32_t channel)
}
#endif
chan_data->state = DW_DMA_IDLE;
+
+pm_put:
ret = pm_device_runtime_put(dev);
out:
return ret;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi This alone won't fix as dw_dma_is_enabled() will still get called on a powered down device and lead to crash.
But ack, the idea seems to be to inc/dec refcount when a DMA channel is actually enabled. Unfortunately, this won't work as only tracking state we have is the hw registers and we can't safely query them to see whether we shoudl decrease or not.
This seems a bit complicated indeed. It would seem safer that we get a reference whenever we talk to hardware, and that is not done here, which does not seem right. But to allow multiple start/stop, we need some clean way to account for usage. That will potentially mean powering up the hardware just to do a duplicate stop, but this would probably be still cleaner code to read and maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yes, my diff is not correct for the reason that multiple start w/o stop is a possibility or even multiple start w/o a single stop (one shot block copies for example).
But what about using pm_device_state_get()
and PM_DEVICE_STATE_ACTIVE instead of the local flag?
That would work even if CONFIG_PM_DEVICE
is not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it might be possible to have it in the generic code in drivers/dma/dma_dw_common.c
, not sure about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as DNM, we need to be correct when PM is disabled.
For @ujfalusi , the design to support multiple starts and stops is documented here https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/drivers/dma.h#L454 .. whatever application (SOF) does, this driver needs to comply with Zephyr DMA interface.
drivers/dma/dma_intel_adsp_gpdma.c
Outdated
struct intel_adsp_gpdma_data *dev_data = dev->data; | ||
int ret; | ||
|
||
if (!dev_data->enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi Right, marking current version with DNM. This needs to be fixed.
drivers/dma/dma_intel_adsp_gpdma.c
Outdated
@@ -44,6 +44,7 @@ LOG_MODULE_REGISTER(dma_intel_adsp_gpdma); | |||
/* Device run time data */ | |||
struct intel_adsp_gpdma_data { | |||
struct dw_dma_dev_data dw_data; | |||
bool enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi This alone won't fix as dw_dma_is_enabled() will still get called on a powered down device and lead to crash.
But ack, the idea seems to be to inc/dec refcount when a DMA channel is actually enabled. Unfortunately, this won't work as only tracking state we have is the hw registers and we can't safely query them to see whether we shoudl decrease or not.
This seems a bit complicated indeed. It would seem safer that we get a reference whenever we talk to hardware, and that is not done here, which does not seem right. But to allow multiple start/stop, we need some clean way to account for usage. That will potentially mean powering up the hardware just to do a duplicate stop, but this would probably be still cleaner code to read and maintain.
I'd note that trying to keep a reference count in the driver continues to be a complicated endeavor. It seems to be a consistent source of bugs. This is especially true given the multiple start/stop (unbalanced calls) but it isn't the only issue. Would it be better to reference count in channel acquisition and release instead as those are definitively balanced calls? Or push the reference counting up a level where the state of the DMA is better understood? |
18cb200
to
b32afb7
Compare
V2:
|
@teburd wrote:
I gave my thumbs up to your comment, as I'm on the same boat this has been source of bugs. But at the same time, not sure how to actually do this. The acquire/release calls are currently too coarse on SOF side (when you pause audio playback, DMA channel is not released and would not be powered down). So if we have power gating (in DMA drivers) in the start/stop/start/stop cycles, this has a concrete benefit for applications like SOF. So it seems options are:
The current driver basicly takes a PM ref whenever a new channel is enabled, and puts the ref when a channel is released. In theory, this should work. One problem in the implementation is that this relies on HW registers to figure out whether a channel is already enabled. This PR helps with that, although code is not that easy to follow. So yeah, suggestions welcome. |
Forcing balanced start/stop calls I don't think provides a general solution over DMA as DMAs may be used (outside of sof) with one shot or linked list transfers that have completion at some point in the future. Meaning a dma_start() may imply a stop at some point in the future without software interaction either from completion or error. Without interrupts it wouldn't be possible to generalize the get/put operations in that case, and even then I think it'd make things more complicated than needed here. I'm saying that the power state of a DMA channel is, given the above, is basically unknowable inside the driver in a consistent manner. I'd really like a generalized idea of what PM should look like for DMA and document it as the expectation for DMA drivers. To me, this means pushing pm refcounts up a level. If we can't rely on channel get/release then its up to the caller to directly call pm ref counts and the driver to intelligently deal with pm callbacks as would be expected, with minimal logic involved. I think that's ok, that's basically the same expectation sensors have as well in Zephyr. |
The start/stop managed pm_runtime indeed is not a correct way for a DMA as you mentioned the oneshot copies which embed the stop without a formal stop call. I would argue that if a client can decide to request once and use the channel or do a request/transfer/release if it does not need the DMA too frequently (or do that in some more clever way). it is really up to the driver's use of the DMA to pick a fitting use. |
Btw, @ujfalusi @teburd please do review this PR still. Even if we rework the runtime-PM approach, in the very short term, we have a DSP crash in 100% of cases on certain product configurations, so the fix is rather urgently needed. We can still conclude that rework must be done first, not saying that, but some conclusion on next steps is needed. |
/* don't touch hw registers if not powered up */ | ||
ret = pm_device_state_get(dev, &pm_state); | ||
if (ret != -ENOSYS && pm_state != PM_DEVICE_STATE_ACTIVE) | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if pm_device_state_get() returns with error then we assume that the device is active?
ret = pm_device_state_get(dev, &pm_state);
if (ret) { /* if (ret == -ENOSYS) { */
/* Assume the device is active on error */
pm_state = PM_DEVICE_STATE_ACTIVE;
}
if (pm_state != PM_DEVICE_STATE_ACTIVE ||
(!dw_dma_is_enabled(dev, channel) && chan_data->state != DW_DMA_SUSPENDED)) {
goto out;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this begs for a question: why intel_adsp_gpdma_config()
does not need this sort of treatment? It also accesses registers and doing it without runtime pm calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi @teburd I've run tests today and it seems in general accessing the config registers is not a problem. I've verified CPA=0 and I can still access registers. Even accessing the CHAN_EN register works in some configurations of the hw but not all. At a minimum I need to update the commit descriptions.
I'm OK with this fix but want to propose different approach. int dma_start(dev, channel)
{
if (atomic_test_bit(data->ctx.atomi, channel))
return 0;
if (data->ctx.atomic == 0)
pm_device_runtime_get(dev);
// other stuff
atomic_set_bit(data->ctx.atomi, channel);
return 0;
}
int dma_start(dev, channel)
{
if (!atomic_test_bit(data->ctx.atomi, channel))
return 0;
// other stuff
atomic_clear_bit(data->ctx.atomi, channel);
if (data->ctx.atomic == 0)
pm_device_runtime_put(dev);
return 0;
} This atomic value is already in driver data and I didn't see any use of it in driver. This code will look and work the same for GP DMA and HDA DMA. |
@teburd @tmleman I looked at a few options:
All in all, this is more complicated than I thought (and probably many things @tmleman you already checked with the previous modification to start/stop). I think I'll update the current PR as it is now so we can merge at least a version of the fix. |
b32afb7
to
830a281
Compare
V3 uploaded:
|
Someone else was needing this as well I think for an agilex part so I have #65390 that I believe would solve part of that.
Config is viewed as a retained state of the driver, once configured it always remains configured. At least with the current state diagram in the DMA zephyr docs that I concocted.
👍 |
@teburd wrote:
Ok, that looks just what the doctor ordered. I think we should try to move at least the intel gpdma variant on top of that and grab the runtime ref on channel allocate. But beyond this PR.
So indeed then the only practical option within the current framework is to get the runtime reference when acquiring the channel, and put it on channel release. Then all config-start-stop-config-start-start-stop actions are done with a the runtime ref held and tracking is easier to do (and follow). I still need to investigate a bit more what are the side-impacts of holding the power-domain ref for longer. On these targets, the power domain driver is actually setting/clearing a constraint (power-off-allowed/prohibited versus directly powering up/down the hardware). |
830a281
to
7493061
Compare
The DMA interface allows start and stop to be called multiple times and driver should ensure nothing bad happens if the calls are not balanced. Fix an issue where after a start-stop sequence the DMA would be powered down, and then a subsequent stop would result in a crash as driver accesses registers of a powered down hardware block. Fix the issue by handling stop without actually reading the hw registers to check channel status. Link: thesofproject/sof#8503 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
7493061
to
3f68025
Compare
The DMA interface allows start and stop to be called multiple times and driver should ensure nothing bad happens if the calls are not balanced.
Fix an issue with following sequence:
Fix the issue by tracking hardware status in the driver, and blocking the stop if hardware is already powered down. Multiple starts are already handled in dw_dma_start().
Link: thesofproject/sof#8503