-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google AEC rework: dynamic formats, legacy platform support #8919
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu can you review.
Can someone from @thesofproject/google also review. Thanks.
* are single-core component and know we can safely use the cache for | ||
* AEC work. XTOS SOF is cached by default, so stub the Zephyr API. | ||
*/ | ||
#define arch_xtensa_cached_ptr(p) (p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be automatically picked up in headers when building form non xtos targets. I know we removed for Zephyr target, but for xtos it should be around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is the Zephyr API, not the old SOF one. I'm translating in the opposite direction (basically writing Zephyr code in the module, with fakes for legacy Mediatek builds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood we will be porting this to platforms still using XTOS so it is needed
@@ -399,9 +399,6 @@ static int google_rtc_audio_processing_init(struct processing_module *mod) | |||
return -EINVAL; | |||
} | |||
|
|||
cd->config.output_fmt = mod->priv.cfg.input_pins[SOF_AEC_DMIC_QUEUE_ID].audio_fmt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the topology tagging, can you create a FEATURE# for it and map out what would be needed etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Really this is a special case of a general flaw, which is that the topology data isn't reliably exposed to the firmware. There's a big library of code in the kernel to interpret it and translate it to ipc4 commands, but 80% of the time all we really want is the tuple array down here.
static ALWAYS_INLINE float clamp_rescale(float max_val, float x) | ||
{ | ||
float min = -1.0f; | ||
float max = 1.0f - 1.0f / max_val; | ||
|
||
return max_val * (x < min ? min : (x > max ? max : x)); | ||
} | ||
|
||
static ALWAYS_INLINE float s16_to_float(const char *ptr) | ||
{ | ||
float scale = -(float)SHRT_MIN; | ||
float x = *(int16_t *)ptr; | ||
|
||
return (1.0f / scale) * x; | ||
} | ||
|
||
static ALWAYS_INLINE void float_to_s16(float x, char *dst) | ||
{ | ||
*(int16_t *)dst = (int16_t)clamp_rescale(-(float)SHRT_MIN, x); | ||
} | ||
|
||
static ALWAYS_INLINE float s32_to_float(const char *ptr) | ||
{ | ||
float scale = -(float)INT_MIN; | ||
float x = *(int32_t *)ptr; | ||
|
||
return (1.0f / scale) * x; | ||
} | ||
|
||
static ALWAYS_INLINE void float_to_s32(float x, char *dst) | ||
{ | ||
*(int32_t *)dst = (int16_t)clamp_rescale(-(float)INT_MIN, x); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't, not in the form needed here. There is a fixed point set of conversion macros, but those are in maximal/64-bit precision for accuracy (they're intended for creating constants) and a pcm_converter library that is a C function that takes as an argument a whole array. The idea here is to get optimized code that does the fully unrolled de/interleave, format conversion and buffer rollover in one go. There was some
Also, to be fair, this is replacing code that also doesn't use the existing converters. That one was based on xcc/xt-clang intrinsics where this one is C, but both generate the same HiFi/FPU output (actually xt-clang on -O3 will even autovectorize this, though we don't use that mode).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu can we split these out into a common location after merge, it would save a lot of duplication. Also we should plan for SIMD versions of these especially since we are passing in arrays here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, came by to nag for reviews and realized I still had stale comments. But regardless: nag for reviews please? This still rebases fine and smokes OK on MTL and mt8195 for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look good to me. Did you try it on a device and checked the audio ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski pls review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessing comp_buffer structures will stop working soon enough, it cannot be done this way.
@andyross, pipeline ID may be moved to struct sof_audio_stream_params and be accessible through sink/src API
@@ -602,18 +604,16 @@ static int google_rtc_audio_processing_prepare(struct processing_module *mod, | |||
return -EINVAL; | |||
} | |||
|
|||
struct comp_buffer *b0 = list_first_item(&dev->bsource_list, struct comp_buffer, sink_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when using sink/source interface it is NOT guaranteed that comp_buffer structures are in use.
sink and source API may be provided by other entities.
At the moment there's a workaround in use - DpQueues do always work as a "shadow buffer" for comp_buffer, so this code it will work, but not for long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... I get that. It's relying on a stale/deprecated API scheme. But see the commit message. The problem is that the mechanism currently being used to answer the question "Is this source/sink connected to a component on the same pipeline or a different one?" is just incorrect as it stands. Those numbers are ephemeral values assigned at runtime in the kernel (e.g. they're like file descriptors), and the only reason they work here is because they've been hand-matched to the values chosen for this specific device with this topology.
So, I guess I'd argue we should pick the correct but to-be-replaced solution over the incorrect one. For sure a proper API is required. Basically this is a variant of a common SOF problem: the topology as it exists in the kernel is only incompletely exposed inside the firmware (everything gets cooked by the kernel, which drops a lot of information), so everywhere we're having to use trickery to recover it.
@@ -24,14 +24,6 @@ config COMP_GOOGLE_RTC_AUDIO_PROCESSING | |||
This component takes raw microphones input and playback reference | |||
and outputs an echo-free microphone signal. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internals of the AEC library are floating point already.
just a comment: from my measurements of the AEC binary I got - 16bit API used to be faster than 32bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that maybe polluted by conversion overhead? Or maybe it's a cache footprint issue? My understanding is that the behavior is identical once you get across the API boundary. @lkoenig might have input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally the int16 are converted back to float32. If the int16 is faster than the float32, we might need to look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu any comments here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lkoenig I might have had an obsolete version of AEC binary, so my observations may be inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From AEC quality point of view I think 32 bit float is a good choice over 16 bit integer. The raw Mic signal level is typically low since there is headroom for very loud ambient noise.
@@ -93,6 +108,130 @@ void GoogleRtcFree(void *ptr) | |||
return rfree(ptr); | |||
} | |||
|
|||
static ALWAYS_INLINE float clamp_rescale(float max_val, float x) | |||
{ | |||
float min = -1.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using HiFi coprocessor for this (peformance!!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler automatically emits HiFi instructions here (or FPU, for example on platforms that have HiFi3 w/o floating point support). There was some disassembly posted months ago in a previous iteration of this PR that showed what the inner loops look like. Basically the resulting code here is pretty much optimal. Actually if you build with -O3 (we don't) xt-clang is capable of autovectorizing the loop for another ~2x benefit on HiFi4/5 platforms.
struct comp_buffer *b0 = list_first_item(&dev->bsource_list, struct comp_buffer, sink_list); | ||
struct comp_buffer *b1 = list_next_item(b0, sink_list); | ||
|
||
cd->aec_reference_source = (is_ref_buffer(dev, b0) ? 0 : 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superfluous parentheses
struct comp_buffer *b1 = list_next_item(b0, sink_list); | ||
|
||
cd->aec_reference_source = (is_ref_buffer(dev, b0) ? 0 : 1); | ||
cd->raw_microphone_source = (cd->aec_reference_source == 1) ? 0 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about 1 - cd->aec_reference_source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep it simple, if we need to remap IDs then lets have a comment saying why/how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me as far as I can understand this.
* are single-core component and know we can safely use the cache for | ||
* AEC work. XTOS SOF is cached by default, so stub the Zephyr API. | ||
*/ | ||
#define arch_xtensa_cached_ptr(p) (p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood we will be porting this to platforms still using XTOS so it is needed
#endif | ||
|
||
#ifndef __ZEPHYR__ | ||
#define ALWAYS_INLINE inline __attribute__((always_inline)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put this in some rtos header for XTOS?
static __aligned(PLATFORM_DCACHE_ALIGN) | ||
uint8_t aec_mem_blob[CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_MEMORY_BUFFER_SIZE_KB * 1024]; | ||
|
||
#define NUM_FRAMES (CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_SAMPLE_RATE_HZ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be dynamic with IPC4 right?
return max_val * (x < min ? min : (x > max ? max : x)); | ||
} | ||
|
||
static ALWAYS_INLINE float s16_to_float(const char *ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these should exist somewhere more generic so other components can use them if they are more optimized than the existing macros.
@andyross I think this old method is probably more and more in the untested territory today (before its fully deleted). Are you able to update ? |
I'll can take a look at what's needed to put bidirectional pipeline IDs into sink/src, sure. There's a little design worry though. Are we all sure that's the right spot though? Are all streams guaranteed to be connected on both sides to something "on a pipeline"? Are there lifecycle concerns (e.g. does a stream ever get reconnected)? I guess if we're going to do that, can't we put a module backpointer into the stream instead? That would be more general IMHO, and preserve the old capability of traversing the component/stream graph via a known API. |
@marcinszkudlinski @mwasko can you guys comment here. Thanks ! |
Can this work be expedited as it is blocking us on a lot of devices. Also I would assume this is breaking DSM as well. |
I think I'm seeing a path here. Give me a day or so and be prepared for a little refactoring. |
OK, reworked to move the comp_buffer handling out of the source idenfitication path via some refactoring of the buffer fields. There's still some use to detect "is source active", which is something that the new scheme doesn't have an API for. But that's not needed for IPC4 builds as the pipelines are always synchronized at the kernel level, so I can hide it under and #ifdef. This is now sitting on some slightly wordy refactoring patches, not all of which are strictly required. Will submit those separately; please review there and keep the AEC reviews as clean as possible. |
comp_data_blob_handler_free(cd->tuning_handler); | ||
rfree(cd); | ||
return 0; | ||
} | ||
|
||
#define PIPE_ID(srcsink) ((srcsink)->audio_stream_params->pipeline_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote also a note in other PR.
use
static inline uint32_t sink_get_pipleine_id(struct sof_sink *sink)
{
return sink->audio_stream_params->pipeline_id;
}
and put it in sink_api.h
I would like to keep as many sink/source internals "private" as possible
changes addressed, just one tiny comment
Rebase on current version of #9211, use the sink/source_get_pipeline_id() methods added there. |
The operation of the AEC component uses a single buffer as an internal heap. This is very large, over half the available SRAM at component creation time on MTL. That's just a poor fit for the heap. It would be trivial to create a fragmentation scenario by creating/destroying components (which happens under user control all the way out in Linux userspace!) where AEC can't initialize and microphone input breaks. Longer term we can look at moving this usage back to the heap by integrating the component's internal allocations with the SOF/Zephyr heap (which is quite performant), allowing it to make fine-grained allocations which will work more robustly. Signed-off-by: Andy Ross <andyross@google.com>
This technique doesn't work. The ID returned by get_source_id() is not fixed by topology. Those numbers are derived from a component ID allocated in the Linux kernel via ida_alloc(). The specific values will depend on the state and history of the allocator, we can't compare them via numerical identity here in the firmware even if they happen to work right now due to topology ordering. Fall back to the older technique of checking whether the input source is on the same pipeline as the AEC component to determine if it's the microphone input. Signed-off-by: Andy Ross <andyross@google.com>
The internals of the AEC library are floating point already. If we're going to support using the float variant of the API, we should use it always to avoid all the complexity. Signed-off-by: Andy Ross <andyross@google.com>
In point of fact the AEC code has never worked on SOF main, and this code doesn't either. It got left here as new code got added to support for MTL, and it's just a wart. Really there's nothing "IPC4ish" at all about the new code at all, there's no reason a single code path can't be used for both, the process/source/sink APIs are identical. The new code doesn't work with legacy builds, mind you. But it will. Remove the stuff that will never be used. Signed-off-by: Andy Ross <andyross@google.com>
On IPC3 pipelines, triggers can arrive at this component due to changes in the reference pipeline. Those aren't for us, and have the effect of incorrectly resetting the capture stream if someone stops playback. Earlier product branches handled this logic in the pipeline layer, but that never reached SOF main, and it's easier to do here by just ignoring the event. On IPC4, triggers never propagate across pipelines (and in any case dependent pipeline state management happens in the host kernel), so this becomes a benign noop. Signed-off-by: Andy Ross <andyross@google.com>
Put the AEC tunables inside an if COMP_GOOGLE_RTC_AUDIO_PROCESSING for hygine. This prevents them from appearing in .config files of SOF builds where AEC was never enabled at all. Set MOCK via a default instead of select. Select is unoverridable, it forces the MOCK to be used whenever COMP_STUBS=y, but it's more flexible to allow the app to pick and choose which components get stubbed (STUBS is often set at the platform layer). Signed-off-by: Andy Ross <andyross@google.com>
Final rebase now that #9211 has merged; still tests OK or me on mt8195 and mtl. Should be good to go now, no code changes since last version. |
@andyross The Zephyr LLEXT fail seem related to code added in the PR https://github.com/thesofproject/sof/actions/runs/9569679341/job/26382722247?pr=8919 |
Big rewrite of the core processing code of AEC: Support both S32 and S16 input and output formats, dynamically selected at prepare() time based on stream configuration. Copy/convert data in maximally inlined/unrolled loops, using cleanly-generated (no duplication!) custom conversion utilities for each format variant. Orthogonalize and elaborate the validation code in prepare(). Check all state for all input/output streams. Decouple AEC operation from the input stream, filling zeros on underflow and allowing AEC to run in circumstances where no playback data exists and to recover when it starts/stops. IPC3 setups can exploit this now, unfortunately IPC4 always starts connected pipelines from the host kernel so sees no benefit. Fix a latency bug with the original code where it would copy the processed results to the output stream before the call to ProcessCapture() instead of after, leading to a needless delay. Copy the results as soon as they are available, if the output buffer backs up, we'll continue at the next call to process() Signed-off-by: Andy Ross <andyross@google.com>
Ah, indeed. Just the build warning, right? Fixed. |
dst[i] = &dst_bufs[i][frame0]; | ||
|
||
err = source_get_data(src, bytes, (void *)&buf, (void *)&bufstart, &bufsz); | ||
assert(err == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not handle errors properly? Particularly because assert()
is a NOP in release builds, or is Google building with them enabled always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only documented error happens when the size passed is larger than available data, which is untrue by construction as we've already queried the buffer state up the stack. Thus the assert, to catch bugs with that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say, that assert()
s can be clarified / cleaned up in a follow-up
Ping for merge? Any last reviews needed? |
This was merged with sparse annotations broken: |
This is almost exactly the code as it stands in PR #8571, but rebased on top of current main (which descends from mtl-008-drop-stable) and with as many individual changes split out as I could. Basically this unbreaks AEC for IPC3 targets (mt8195 in particular) with a unified implementation, supports dynamic stream sample formats, improves performance and code size significantly, and removes a few hundred lines of duplicated/unparametrized code. Tested on mt8195 and MTL, will pull out a ADL board and try that as soon as I get a chance.
See individual patch commit messages for more details.