Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google AEC rework: dynamic formats, legacy platform support #8919

Merged
merged 7 commits into from
Jun 21, 2024

Conversation

andyross
Copy link
Contributor

@andyross andyross commented Mar 7, 2024

This is almost exactly the code as it stands in PR #8571, but rebased on top of current main (which descends from mtl-008-drop-stable) and with as many individual changes split out as I could. Basically this unbreaks AEC for IPC3 targets (mt8195 in particular) with a unified implementation, supports dynamic stream sample formats, improves performance and code size significantly, and removes a few hundred lines of duplicated/unparametrized code. Tested on mt8195 and MTL, will pull out a ADL board and try that as soon as I get a chance.

See individual patch commit messages for more details.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu can you review.
Can someone from @thesofproject/google also review. Thanks.

* are single-core component and know we can safely use the cache for
* AEC work. XTOS SOF is cached by default, so stub the Zephyr API.
*/
#define arch_xtensa_cached_ptr(p) (p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be automatically picked up in headers when building form non xtos targets. I know we removed for Zephyr target, but for xtos it should be around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is the Zephyr API, not the old SOF one. I'm translating in the opposite direction (basically writing Zephyr code in the module, with fakes for legacy Mediatek builds).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood we will be porting this to platforms still using XTOS so it is needed

@@ -399,9 +399,6 @@ static int google_rtc_audio_processing_init(struct processing_module *mod)
return -EINVAL;
}

cd->config.output_fmt = mod->priv.cfg.input_pins[SOF_AEC_DMIC_QUEUE_ID].audio_fmt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the topology tagging, can you create a FEATURE# for it and map out what would be needed etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Really this is a special case of a general flaw, which is that the topology data isn't reliably exposed to the firmware. There's a big library of code in the kernel to interpret it and translate it to ipc4 commands, but 80% of the time all we really want is the tuple array down here.

Comment on lines +111 to +145
static ALWAYS_INLINE float clamp_rescale(float max_val, float x)
{
float min = -1.0f;
float max = 1.0f - 1.0f / max_val;

return max_val * (x < min ? min : (x > max ? max : x));
}

static ALWAYS_INLINE float s16_to_float(const char *ptr)
{
float scale = -(float)SHRT_MIN;
float x = *(int16_t *)ptr;

return (1.0f / scale) * x;
}

static ALWAYS_INLINE void float_to_s16(float x, char *dst)
{
*(int16_t *)dst = (int16_t)clamp_rescale(-(float)SHRT_MIN, x);
}

static ALWAYS_INLINE float s32_to_float(const char *ptr)
{
float scale = -(float)INT_MIN;
float x = *(int32_t *)ptr;

return (1.0f / scale) * x;
}

static ALWAYS_INLINE void float_to_s32(float x, char *dst)
{
*(int32_t *)dst = (int16_t)clamp_rescale(-(float)INT_MIN, x);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have these conversions in a maths header. @singalsu where can @andyross find existing conversions ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't, not in the form needed here. There is a fixed point set of conversion macros, but those are in maximal/64-bit precision for accuracy (they're intended for creating constants) and a pcm_converter library that is a C function that takes as an argument a whole array. The idea here is to get optimized code that does the fully unrolled de/interleave, format conversion and buffer rollover in one go. There was some

Also, to be fair, this is replacing code that also doesn't use the existing converters. That one was based on xcc/xt-clang intrinsics where this one is C, but both generate the same HiFi/FPU output (actually xt-clang on -O3 will even autovectorize this, though we don't use that mode).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu can we split these out into a common location after merge, it would save a lot of duplication. Also we should plan for SIMD versions of these especially since we are passing in arrays here.

Copy link
Contributor Author

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, came by to nag for reviews and realized I still had stale comments. But regardless: nag for reviews please? This still rebases fine and smokes OK on MTL and mt8195 for me.

Copy link
Contributor

@lkoenig lkoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look good to me. Did you try it on a device and checked the audio ?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski pls review.

Copy link
Contributor

@marcinszkudlinski marcinszkudlinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing comp_buffer structures will stop working soon enough, it cannot be done this way.
@andyross, pipeline ID may be moved to struct sof_audio_stream_params and be accessible through sink/src API

@@ -602,18 +604,16 @@ static int google_rtc_audio_processing_prepare(struct processing_module *mod,
return -EINVAL;
}

struct comp_buffer *b0 = list_first_item(&dev->bsource_list, struct comp_buffer, sink_list);
Copy link
Contributor

@marcinszkudlinski marcinszkudlinski Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using sink/source interface it is NOT guaranteed that comp_buffer structures are in use.
sink and source API may be provided by other entities.
At the moment there's a workaround in use - DpQueues do always work as a "shadow buffer" for comp_buffer, so this code it will work, but not for long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I get that. It's relying on a stale/deprecated API scheme. But see the commit message. The problem is that the mechanism currently being used to answer the question "Is this source/sink connected to a component on the same pipeline or a different one?" is just incorrect as it stands. Those numbers are ephemeral values assigned at runtime in the kernel (e.g. they're like file descriptors), and the only reason they work here is because they've been hand-matched to the values chosen for this specific device with this topology.

So, I guess I'd argue we should pick the correct but to-be-replaced solution over the incorrect one. For sure a proper API is required. Basically this is a variant of a common SOF problem: the topology as it exists in the kernel is only incompletely exposed inside the firmware (everything gets cooked by the kernel, which drops a lot of information), so everywhere we're having to use trickery to recover it.

@@ -24,14 +24,6 @@ config COMP_GOOGLE_RTC_AUDIO_PROCESSING
This component takes raw microphones input and playback reference
and outputs an echo-free microphone signal.

Copy link
Contributor

@marcinszkudlinski marcinszkudlinski Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internals of the AEC library are floating point already.

just a comment: from my measurements of the AEC binary I got - 16bit API used to be faster than 32bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that maybe polluted by conversion overhead? Or maybe it's a cache footprint issue? My understanding is that the behavior is identical once you get across the API boundary. @lkoenig might have input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally the int16 are converted back to float32. If the int16 is faster than the float32, we might need to look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu any comments here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkoenig I might have had an obsolete version of AEC binary, so my observations may be inaccurate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From AEC quality point of view I think 32 bit float is a good choice over 16 bit integer. The raw Mic signal level is typically low since there is headroom for very loud ambient noise.

@@ -93,6 +108,130 @@ void GoogleRtcFree(void *ptr)
return rfree(ptr);
}

static ALWAYS_INLINE float clamp_rescale(float max_val, float x)
{
float min = -1.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using HiFi coprocessor for this (peformance!!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler automatically emits HiFi instructions here (or FPU, for example on platforms that have HiFi3 w/o floating point support). There was some disassembly posted months ago in a previous iteration of this PR that showed what the inner loops look like. Basically the resulting code here is pretty much optimal. Actually if you build with -O3 (we don't) xt-clang is capable of autovectorizing the loop for another ~2x benefit on HiFi4/5 platforms.

struct comp_buffer *b0 = list_first_item(&dev->bsource_list, struct comp_buffer, sink_list);
struct comp_buffer *b1 = list_next_item(b0, sink_list);

cd->aec_reference_source = (is_ref_buffer(dev, b0) ? 0 : 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous parentheses

struct comp_buffer *b1 = list_next_item(b0, sink_list);

cd->aec_reference_source = (is_ref_buffer(dev, b0) ? 0 : 1);
cd->raw_microphone_source = (cd->aec_reference_source == 1) ? 0 : 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 1 - cd->aec_reference_source

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep it simple, if we need to remap IDs then lets have a comment saying why/how.

Copy link
Collaborator

@singalsu singalsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me as far as I can understand this.

* are single-core component and know we can safely use the cache for
* AEC work. XTOS SOF is cached by default, so stub the Zephyr API.
*/
#define arch_xtensa_cached_ptr(p) (p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood we will be porting this to platforms still using XTOS so it is needed

#endif

#ifndef __ZEPHYR__
#define ALWAYS_INLINE inline __attribute__((always_inline))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this in some rtos header for XTOS?

static __aligned(PLATFORM_DCACHE_ALIGN)
uint8_t aec_mem_blob[CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_MEMORY_BUFFER_SIZE_KB * 1024];

#define NUM_FRAMES (CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_SAMPLE_RATE_HZ \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be dynamic with IPC4 right?

return max_val * (x < min ? min : (x > max ? max : x));
}

static ALWAYS_INLINE float s16_to_float(const char *ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these should exist somewhere more generic so other components can use them if they are more optimized than the existing macros.

@lgirdwood
Copy link
Member

accessing comp_buffer structures will stop working soon enough, it cannot be done this way. @andyross, pipeline ID may be moved to struct sof_audio_stream_params and be accessible through sink/src API

@andyross I think this old method is probably more and more in the untested territory today (before its fully deleted). Are you able to update ?

@andyross
Copy link
Contributor Author

I'll can take a look at what's needed to put bidirectional pipeline IDs into sink/src, sure. There's a little design worry though. Are we all sure that's the right spot though? Are all streams guaranteed to be connected on both sides to something "on a pipeline"? Are there lifecycle concerns (e.g. does a stream ever get reconnected)?

I guess if we're going to do that, can't we put a module backpointer into the stream instead? That would be more general IMHO, and preserve the old capability of traversing the component/stream graph via a known API.

@lgirdwood
Copy link
Member

I'll can take a look at what's needed to put bidirectional pipeline IDs into sink/src, sure. There's a little design worry though. Are we all sure that's the right spot though? Are all streams guaranteed to be connected on both sides to something "on a pipeline"? Are there lifecycle concerns (e.g. does a stream ever get reconnected)?

I guess if we're going to do that, can't we put a module backpointer into the stream instead? That would be more general IMHO, and preserve the old capability of traversing the component/stream graph via a known API.

@marcinszkudlinski @mwasko can you guys comment here. Thanks !

@cujomalainey
Copy link
Member

accessing comp_buffer structures will stop working soon enough, it cannot be done this way. @andyross, pipeline ID may be moved to struct sof_audio_stream_params and be accessible through sink/src API

@andyross I think this old method is probably more and more in the untested territory today (before its fully deleted). Are you able to update ?

Can this work be expedited as it is blocking us on a lot of devices. Also I would assume this is breaking DSM as well.

@andyross
Copy link
Contributor Author

andyross commented Jun 4, 2024

I think I'm seeing a path here. Give me a day or so and be prepared for a little refactoring.

@andyross
Copy link
Contributor Author

andyross commented Jun 7, 2024

OK, reworked to move the comp_buffer handling out of the source idenfitication path via some refactoring of the buffer fields. There's still some use to detect "is source active", which is something that the new scheme doesn't have an API for. But that's not needed for IPC4 builds as the pipelines are always synchronized at the kernel level, so I can hide it under and #ifdef.

This is now sitting on some slightly wordy refactoring patches, not all of which are strictly required. Will submit those separately; please review there and keep the AEC reviews as clean as possible.

@andyross
Copy link
Contributor Author

andyross commented Jun 7, 2024

See #9210 and #9211 for split-out review of the early patches in this series.

comp_data_blob_handler_free(cd->tuning_handler);
rfree(cd);
return 0;
}

#define PIPE_ID(srcsink) ((srcsink)->audio_stream_params->pipeline_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote also a note in other PR.
use

static inline uint32_t sink_get_pipleine_id(struct sof_sink *sink)
{
	return sink->audio_stream_params->pipeline_id;
}

and put it in sink_api.h

I would like to keep as many sink/source internals "private" as possible

@andyross
Copy link
Contributor Author

Rebase on current version of #9211, use the sink/source_get_pipeline_id() methods added there.

The operation of the AEC component uses a single buffer as an internal
heap.  This is very large, over half the available SRAM at component
creation time on MTL.  That's just a poor fit for the heap.  It would
be trivial to create a fragmentation scenario by creating/destroying
components (which happens under user control all the way out in Linux
userspace!) where AEC can't initialize and microphone input breaks.

Longer term we can look at moving this usage back to the heap by
integrating the component's internal allocations with the SOF/Zephyr
heap (which is quite performant), allowing it to make fine-grained
allocations which will work more robustly.

Signed-off-by: Andy Ross <andyross@google.com>
This technique doesn't work.  The ID returned by get_source_id() is
not fixed by topology.  Those numbers are derived from a component ID
allocated in the Linux kernel via ida_alloc().  The specific values
will depend on the state and history of the allocator, we can't
compare them via numerical identity here in the firmware even if they
happen to work right now due to topology ordering.

Fall back to the older technique of checking whether the input source
is on the same pipeline as the AEC component to determine if it's the
microphone input.

Signed-off-by: Andy Ross <andyross@google.com>
The internals of the AEC library are floating point already.  If we're
going to support using the float variant of the API, we should use it
always to avoid all the complexity.

Signed-off-by: Andy Ross <andyross@google.com>
In point of fact the AEC code has never worked on SOF main, and this
code doesn't either.  It got left here as new code got added to
support for MTL, and it's just a wart.

Really there's nothing "IPC4ish" at all about the new code at all,
there's no reason a single code path can't be used for both, the
process/source/sink APIs are identical.  The new code doesn't work
with legacy builds, mind you.  But it will.  Remove the stuff that
will never be used.

Signed-off-by: Andy Ross <andyross@google.com>
On IPC3 pipelines, triggers can arrive at this component due to
changes in the reference pipeline.  Those aren't for us, and have the
effect of incorrectly resetting the capture stream if someone stops
playback.  Earlier product branches handled this logic in the pipeline
layer, but that never reached SOF main, and it's easier to do here by
just ignoring the event.

On IPC4, triggers never propagate across pipelines (and in any case
dependent pipeline state management happens in the host kernel), so
this becomes a benign noop.

Signed-off-by: Andy Ross <andyross@google.com>
Put the AEC tunables inside an if COMP_GOOGLE_RTC_AUDIO_PROCESSING for
hygine.  This prevents them from appearing in .config files of SOF
builds where AEC was never enabled at all.

Set MOCK via a default instead of select.  Select is unoverridable, it
forces the MOCK to be used whenever COMP_STUBS=y, but it's more
flexible to allow the app to pick and choose which components get
stubbed (STUBS is often set at the platform layer).

Signed-off-by: Andy Ross <andyross@google.com>
@andyross
Copy link
Contributor Author

Final rebase now that #9211 has merged; still tests OK or me on mt8195 and mtl. Should be good to go now, no code changes since last version.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 19, 2024

@andyross The Zephyr LLEXT fail seem related to code added in the PR https://github.com/thesofproject/sof/actions/runs/9569679341/job/26382722247?pr=8919
Other test results still pending... will wait to see those.

Big rewrite of the core processing code of AEC:

Support both S32 and S16 input and output formats, dynamically
selected at prepare() time based on stream configuration.

Copy/convert data in maximally inlined/unrolled loops, using
cleanly-generated (no duplication!) custom conversion utilities for
each format variant.

Orthogonalize and elaborate the validation code in prepare().  Check
all state for all input/output streams.

Decouple AEC operation from the input stream, filling zeros on
underflow and allowing AEC to run in circumstances where no playback
data exists and to recover when it starts/stops.  IPC3 setups can
exploit this now, unfortunately IPC4 always starts connected pipelines
from the host kernel so sees no benefit.

Fix a latency bug with the original code where it would copy the
processed results to the output stream before the call to
ProcessCapture() instead of after, leading to a needless delay.  Copy
the results as soon as they are available, if the output buffer backs
up, we'll continue at the next call to process()

Signed-off-by: Andy Ross <andyross@google.com>
@andyross
Copy link
Contributor Author

Ah, indeed. Just the build warning, right? Fixed.

dst[i] = &dst_bufs[i][frame0];

err = source_get_data(src, bytes, (void *)&buf, (void *)&bufstart, &bufsz);
assert(err == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not handle errors properly? Particularly because assert() is a NOP in release builds, or is Google building with them enabled always?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only documented error happens when the size passed is larger than available data, which is untrue by construction as we've already queried the buffer state up the stack. Thus the assert, to catch bugs with that logic.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say, that assert()s can be clarified / cleaned up in a follow-up

@andyross
Copy link
Contributor Author

Ping for merge? Any last reviews needed?

@cujomalainey cujomalainey merged commit 28a5265 into thesofproject:main Jun 21, 2024
40 of 46 checks passed
@lyakh lyakh mentioned this pull request Jun 26, 2024
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 27, 2024

This was merged with sparse annotations broken:
https://github.com/thesofproject/sof/actions/runs/9584654607/job/26428767199

Tentative fixup from @andyross in #9265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants