Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts #65829

Merged

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Nov 27, 2023

The DMA interface allows start and stop to be called multiple times and driver should ensure nothing bad happens if the calls are not balanced.

Fix an issue with following sequence:

  • dma_start -> device powered up
  • dma_stop -> device powered down
  • dma_stop -> call to dw_dma_is_enabled() and crash

Fix the issue by tracking hardware status in the driver, and blocking the stop if hardware is already powered down. Multiple starts are already handled in dw_dma_start().

Link: thesofproject/sof#8503

@kv2019i kv2019i added the platform: Intel ADSP Intel Audio platforms label Nov 27, 2023
@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 27, 2023

I remember we've discussed this refcounting a few times with @tmleman , @ceolin and @teburd . Not sure if we've missed some patch, or SOF has changed its behaviour, but I now hit this 100% of the time in a very simple case on a newer platform (see linked bug).

The duplicate call to dma_stop() might be a bit weird, but as far as I can tell, completely valid and allowed use of the DMA API, so the intel_adsp_gpdma driver should handle this case (we asked for this to begin with to avoid tracking DMA state on SOF side).

What remains a mystery is why this only happens on one platform, but this seems to be the clear cause.

@zephyrbot zephyrbot added the area: DMA Direct Memory Access label Nov 27, 2023
lgirdwood
lgirdwood previously approved these changes Nov 27, 2023
Copy link
Collaborator

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i do we need a similar fix for HDA DMA ?

Copy link
Collaborator

@ujfalusi ujfalusi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that the runtime counting is flawed in the dw_dma_start/stop:
dw_dma_start() -> pm_device_runtime_get()
dw_dma_start() -> NOP
dw_dma_stop() -> pm_device_runtime_put()
dw_dma_stop() -> oops
??

given that both channel finishes before the stop is called?

struct intel_adsp_gpdma_data *dev_data = dev->data;
int ret;

if (!dev_data->enabled) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of CONFIG_PM_DEVICE=n we will never stop the DMA?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi Right, marking current version with DNM. This needs to be fixed.

@@ -44,6 +44,7 @@ LOG_MODULE_REGISTER(dma_intel_adsp_gpdma);
/* Device run time data */
struct intel_adsp_gpdma_data {
struct dw_dma_dev_data dw_data;
bool enabled;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably this is a bug in the DMA client side if there is an unbalance between the start and stop.
The dw_dma_start() and dw_dma_stop() rely on correct user behavior as it is handing the pm_device_runtime get/put

Right, the bug might be in the dw_dma_start/stop() balancing?
is this fixes the issue?

diff --git a/drivers/dma/dma_dw_common.c b/drivers/dma/dma_dw_common.c
index bb000298385f..a893672d0922 100644
--- a/drivers/dma/dma_dw_common.c
+++ b/drivers/dma/dma_dw_common.c
@@ -446,7 +446,7 @@ int dw_dma_start(const struct device *dev, uint32_t channel)
 	/* validate channel */
 	if (channel >= DW_CHAN_COUNT) {
 		ret = -EINVAL;
-		goto out;
+		goto pm_get;
 	}
 
 	if (dw_dma_is_enabled(dev, channel)) {
@@ -535,6 +535,8 @@ int dw_dma_start(const struct device *dev, uint32_t channel)
 
 	/* enable the channel */
 	dw_write(dev_cfg->base, DW_DMA_CHAN_EN, DW_CHAN_UNMASK(channel));
+
+pm_get:
 	ret = pm_device_runtime_get(dev);
 
 out:
@@ -554,7 +556,7 @@ int dw_dma_stop(const struct device *dev, uint32_t channel)
 	}
 
 	if (!dw_dma_is_enabled(dev, channel) && chan_data->state != DW_DMA_SUSPENDED) {
-		goto out;
+		goto pm_put;
 	}
 
 #ifdef CONFIG_DMA_DW_HW_LLI
@@ -605,6 +607,8 @@ int dw_dma_stop(const struct device *dev, uint32_t channel)
 	}
 #endif
 	chan_data->state = DW_DMA_IDLE;
+
+pm_put:
 	ret = pm_device_runtime_put(dev);
 out:
 	return ret;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi This alone won't fix as dw_dma_is_enabled() will still get called on a powered down device and lead to crash.

But ack, the idea seems to be to inc/dec refcount when a DMA channel is actually enabled. Unfortunately, this won't work as only tracking state we have is the hw registers and we can't safely query them to see whether we shoudl decrease or not.

This seems a bit complicated indeed. It would seem safer that we get a reference whenever we talk to hardware, and that is not done here, which does not seem right. But to allow multiple start/stop, we need some clean way to account for usage. That will potentially mean powering up the hardware just to do a duplicate stop, but this would probably be still cleaner code to read and maintain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes, my diff is not correct for the reason that multiple start w/o stop is a possibility or even multiple start w/o a single stop (one shot block copies for example).

But what about using pm_device_state_get() and PM_DEVICE_STATE_ACTIVE instead of the local flag?
That would work even if CONFIG_PM_DEVICE is not enabled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it might be possible to have it in the generic code in drivers/dma/dma_dw_common.c, not sure about that.

@kv2019i kv2019i changed the title drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts [DNM] drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts Nov 28, 2023
Copy link
Collaborator Author

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as DNM, we need to be correct when PM is disabled.

For @ujfalusi , the design to support multiple starts and stops is documented here https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/drivers/dma.h#L454 .. whatever application (SOF) does, this driver needs to comply with Zephyr DMA interface.

struct intel_adsp_gpdma_data *dev_data = dev->data;
int ret;

if (!dev_data->enabled) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi Right, marking current version with DNM. This needs to be fixed.

@@ -44,6 +44,7 @@ LOG_MODULE_REGISTER(dma_intel_adsp_gpdma);
/* Device run time data */
struct intel_adsp_gpdma_data {
struct dw_dma_dev_data dw_data;
bool enabled;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi This alone won't fix as dw_dma_is_enabled() will still get called on a powered down device and lead to crash.

But ack, the idea seems to be to inc/dec refcount when a DMA channel is actually enabled. Unfortunately, this won't work as only tracking state we have is the hw registers and we can't safely query them to see whether we shoudl decrease or not.

This seems a bit complicated indeed. It would seem safer that we get a reference whenever we talk to hardware, and that is not done here, which does not seem right. But to allow multiple start/stop, we need some clean way to account for usage. That will potentially mean powering up the hardware just to do a duplicate stop, but this would probably be still cleaner code to read and maintain.

@teburd
Copy link
Collaborator

teburd commented Nov 28, 2023

I'd note that trying to keep a reference count in the driver continues to be a complicated endeavor. It seems to be a consistent source of bugs.

This is especially true given the multiple start/stop (unbalanced calls) but it isn't the only issue.

Would it be better to reference count in channel acquisition and release instead as those are definitively balanced calls? Or push the reference counting up a level where the state of the DMA is better understood?

@kv2019i kv2019i changed the title [DNM] drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts drivers: dma: intel_adsp_gpdma: fix issue with stop and PM refcounts Nov 28, 2023
@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 28, 2023

V2:

  • I confirmed this flow of events happens on other ACE platforms, but memory exception for accessing powered-down device does NOT happen on all configurations -> this explains why this has not been seen before
  • implemented simpler logic proposed by @ujfalusi to use pm_device_state_get() to avoid the specific case of trying to access device registers when powered off
  • tested on system that triggers [BUG][ARL] DSP panic when freeing dmic capture pipeline thesofproject/sof#8503 and verified test passes with this PR

@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 28, 2023

@teburd wrote:

Would it be better to reference count in channel acquisition and release instead as those are definitively balanced calls? Or push > the reference counting up a level where the state of the DMA is better understood?

I gave my thumbs up to your comment, as I'm on the same boat this has been source of bugs. But at the same time, not sure how to actually do this. The acquire/release calls are currently too coarse on SOF side (when you pause audio playback, DMA channel is not released and would not be powered down). So if we have power gating (in DMA drivers) in the start/stop/start/stop cycles, this has a concrete benefit for applications like SOF. So it seems options are:

  • drop support for power-gating at start/stop granularity (application and driver change needed)
  • we go back to balanced start/stops and push book-keeping to app layer (app and driver changes needed)
  • improve current driver implementation

The current driver basicly takes a PM ref whenever a new channel is enabled, and puts the ref when a channel is released. In theory, this should work. One problem in the implementation is that this relies on HW registers to figure out whether a channel is already enabled. This PR helps with that, although code is not that easy to follow.

So yeah, suggestions welcome.

@teburd
Copy link
Collaborator

teburd commented Nov 28, 2023

@teburd wrote:

Would it be better to reference count in channel acquisition and release instead as those are definitively balanced calls? Or push > the reference counting up a level where the state of the DMA is better understood?

I gave my thumbs up to your comment, as I'm on the same boat this has been source of bugs. But at the same time, not sure how to actually do this. The acquire/release calls are currently too coarse on SOF side (when you pause audio playback, DMA channel is not released and would not be powered down). So if we have power gating (in DMA drivers) in the start/stop/start/stop cycles, this has a concrete benefit for applications like SOF. So it seems options are:

* drop support for power-gating at start/stop granularity (application and driver change needed)

* we go back to balanced start/stops and push book-keeping to app layer (app and driver changes needed)

* improve current driver implementation

The current driver basicly takes a PM ref whenever a new channel is enabled, and puts the ref when a channel is released. In theory, this should work. One problem in the implementation is that this relies on HW registers to figure out whether a channel is already enabled. This PR helps with that, although code is not that easy to follow.

So yeah, suggestions welcome.

Forcing balanced start/stop calls I don't think provides a general solution over DMA as DMAs may be used (outside of sof) with one shot or linked list transfers that have completion at some point in the future. Meaning a dma_start() may imply a stop at some point in the future without software interaction either from completion or error. Without interrupts it wouldn't be possible to generalize the get/put operations in that case, and even then I think it'd make things more complicated than needed here.

I'm saying that the power state of a DMA channel is, given the above, is basically unknowable inside the driver in a consistent manner.

I'd really like a generalized idea of what PM should look like for DMA and document it as the expectation for DMA drivers. To me, this means pushing pm refcounts up a level. If we can't rely on channel get/release then its up to the caller to directly call pm ref counts and the driver to intelligently deal with pm callbacks as would be expected, with minimal logic involved. I think that's ok, that's basically the same expectation sensors have as well in Zephyr.

@ujfalusi
Copy link
Collaborator

Forcing balanced start/stop calls I don't think provides a general solution over DMA as DMAs may be used (outside of sof) with one shot or linked list transfers that have completion at some point in the future. Meaning a dma_start() may imply a stop at some point in the future without software interaction either from completion or error. Without interrupts it wouldn't be possible to generalize the get/put operations in that case, and even then I think it'd make things more complicated than needed here.

I'm saying that the power state of a DMA channel is, given the above, is basically unknowable inside the driver in a consistent manner.

I'd really like a generalized idea of what PM should look like for DMA and document it as the expectation for DMA drivers. To me, this means pushing pm refcounts up a level. If we can't rely on channel get/release then its up to the caller to directly call pm ref counts and the driver to intelligently deal with pm callbacks as would be expected, with minimal logic involved. I think that's ok, that's basically the same expectation sensors have as well in Zephyr.

The start/stop managed pm_runtime indeed is not a correct way for a DMA as you mentioned the oneshot copies which embed the stop without a formal stop call.
Imho for a DMA (channel) should be kept powered between the the time it has been requested and released otherwise the client must assume that the DMA is powered down outside of start/stop or start/completion and it might not restore context of a set configuration which is put there for a bang-oneshot-copy.
Starting a DMA copy might or might not involve a DMA power up, reconfigure before the transfer starts.

I would argue that if a client can decide to request once and use the channel or do a request/transfer/release if it does not need the DMA too frequently (or do that in some more clever way).

it is really up to the driver's use of the DMA to pick a fitting use.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 29, 2023

Btw, @ujfalusi @teburd please do review this PR still. Even if we rework the runtime-PM approach, in the very short term, we have a DSP crash in 100% of cases on certain product configurations, so the fix is rather urgently needed. We can still conclude that rework must be done first, not saying that, but some conclusion on next steps is needed.

/* don't touch hw registers if not powered up */
ret = pm_device_state_get(dev, &pm_state);
if (ret != -ENOSYS && pm_state != PM_DEVICE_STATE_ACTIVE)
goto out;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pm_device_state_get() returns with error then we assume that the device is active?

	ret = pm_device_state_get(dev, &pm_state);
	if (ret) { /* if (ret == -ENOSYS) { */
		/* Assume the device is active on error */
		pm_state = PM_DEVICE_STATE_ACTIVE;
	}

	if (pm_state != PM_DEVICE_STATE_ACTIVE ||
	    (!dw_dma_is_enabled(dev, channel) && chan_data->state != DW_DMA_SUSPENDED)) {
		goto out;
	}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this begs for a question: why intel_adsp_gpdma_config() does not need this sort of treatment? It also accesses registers and doing it without runtime pm calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi @teburd I've run tests today and it seems in general accessing the config registers is not a problem. I've verified CPA=0 and I can still access registers. Even accessing the CHAN_EN register works in some configurations of the hw but not all. At a minimum I need to update the commit descriptions.

@tmleman
Copy link
Collaborator

tmleman commented Nov 29, 2023

I'm OK with this fix but want to propose different approach.

int dma_start(dev, channel)
{
	if (atomic_test_bit(data->ctx.atomi, channel))
		return 0;

	if (data->ctx.atomic == 0)
		pm_device_runtime_get(dev);
	// other stuff
	atomic_set_bit(data->ctx.atomi, channel);
	return 0;
}

int dma_start(dev, channel)
{
	if (!atomic_test_bit(data->ctx.atomi, channel))
		return 0;

	// other stuff
	atomic_clear_bit(data->ctx.atomi, channel);
	if (data->ctx.atomic == 0)
		pm_device_runtime_put(dev);
	return 0;
}

This atomic value is already in driver data and I didn't see any use of it in driver. This code will look and work the same for GP DMA and HDA DMA.

teburd
teburd previously approved these changes Nov 29, 2023
@kv2019i
Copy link
Collaborator Author

kv2019i commented Dec 7, 2023

@teburd @tmleman I looked at a few options:

  1. use acquire/release channels
  • this could potentially work, but I didn't realize this is not exposed to drivers in current code, but is done in common code in dma.h
  • would require to extend dma.h
  1. use the ctx->atomic counter
  • so this is actually used but done in common code in dma.h , i.e. z_impl_dma_request_channel
  • by the time driver config/start/stop is called, the bit is already set/cleared so we can't really use this to drive logic
  1. taking runtime get already in config
  • strictly speaking we don't need this (the registers are available), but the code looks a bit odd if we don't call runtime get when configuring
  • but what's the counter-operation to config? if we take runtime ref in config, the channel is active so it we should keep the ref but when to release it then? stop is not really a 1:1 counterpart to config either

All in all, this is more complicated than I thought (and probably many things @tmleman you already checked with the previous modification to start/stop). I think I'll update the current PR as it is now so we can merge at least a version of the fix.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Dec 7, 2023

V3 uploaded:

  • improve pm_device_state_get() return value handling. it currently only returns -ENOSYS on error, but ack on @ujfalusi 's review comment that driver code should not assume this -> changed this now. if pm_device_state_get() returns an unknown error, we shall not assume anything but use the normal flow to see if device needs to be stopped

@teburd
Copy link
Collaborator

teburd commented Dec 7, 2023

@teburd @tmleman I looked at a few options:

  1. use acquire/release channels
  • this could potentially work, but I didn't realize this is not exposed to drivers in current code, but is done in common code in dma.h
  • would require to extend dma.h

Someone else was needing this as well I think for an agilex part so I have #65390 that I believe would solve part of that.

  1. use the ctx->atomic counter
  • so this is actually used but done in common code in dma.h , i.e. z_impl_dma_request_channel
  • by the time driver config/start/stop is called, the bit is already set/cleared so we can't really use this to drive logic
  1. taking runtime get already in config
  • strictly speaking we don't need this (the registers are available), but the code looks a bit odd if we don't call runtime get when configuring
  • but what's the counter-operation to config? if we take runtime ref in config, the channel is active so it we should keep the ref but when to release it then? stop is not really a 1:1 counterpart to config either

Config is viewed as a retained state of the driver, once configured it always remains configured. At least with the current state diagram in the DMA zephyr docs that I concocted.

https://docs.zephyrproject.org/latest/hardware/peripherals/dma.html#channel-state-machine-expectations

All in all, this is more complicated than I thought (and probably many things @tmleman you already checked with the previous modification to start/stop). I think I'll update the current PR as it is now so we can merge at least a version of the fix.

👍

@kv2019i
Copy link
Collaborator Author

kv2019i commented Dec 8, 2023

@teburd wrote:

Someone else was needing this as well I think for an agilex part so I have #65390 that I believe would solve part of that.

Ok, that looks just what the doctor ordered. I think we should try to move at least the intel gpdma variant on top of that and grab the runtime ref on channel allocate. But beyond this PR.

  • but what's the counter-operation to config? if we take runtime ref in config, the channel is active so it we should keep the ref >
    Config is viewed as a retained state of the driver, once configured it always remains configured. At least with the current state diagram in the DMA zephyr docs that I concocted.

https://docs.zephyrproject.org/latest/hardware/peripherals/dma.html#channel-state-machine-expectations

So indeed then the only practical option within the current framework is to get the runtime reference when acquiring the channel, and put it on channel release. Then all config-start-stop-config-start-start-stop actions are done with a the runtime ref held and tracking is easier to do (and follow).

I still need to investigate a bit more what are the side-impacts of holding the power-domain ref for longer. On these targets, the power domain driver is actually setting/clearing a constraint (power-off-allowed/prohibited versus directly powering up/down the hardware).

The DMA interface allows start and stop to be called multiple
times and driver should ensure nothing bad happens if the calls
are not balanced.

Fix an issue where after a start-stop sequence the DMA would be
powered down, and then a subsequent stop would result in a crash
as driver accesses registers of a powered down hardware block.

Fix the issue by handling stop without actually reading the hw
registers to check channel status.

Link: thesofproject/sof#8503
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Dec 14, 2023

V4 uploaded:

  • adjusted the code comment based on @ujfalusi feedback to make it explicit that (it's on purpose that) if pm_device_state_get() returns an error, we will NOT skip but check the hardware registers for device state and stop if needed

Please @teburd @ujfalusi check.

@fabiobaltieri fabiobaltieri merged commit c7e3ccd into zephyrproject-rtos:main Dec 15, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: DMA Direct Memory Access platform: Intel ADSP Intel Audio platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants