Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug stream thread info pr #9398

Merged
merged 4 commits into from
Sep 6, 2024

Conversation

jsarha
Copy link
Contributor

@jsarha jsarha commented Aug 23, 2024

This PR contains following things:

  • debug-stream definitions
  • implementation for transporting the debug-stream records over debug window slot
  • thread info implementation
    • extracts per thread stack and cpu usage statistics from Zephyr core
    • sends the statistics as debug-stream records
  • python script for receiving the debug-stream records over debug window slot transport and decoding them to human readable form

This PR requires following PR to compile (if the feature is enabled in Kconfig):
zephyrproject-rtos/zephyr#77467

and following PR to work (for python script to access the debug window slot):
thesofproject/linux#5154

and this fix to be able to disable tlemetry performance measurements, that tries to use the same debug-window slot (a configuration that allows both is also possible, but I wanted to get this PR out):
#9392

If one wants to try the PR in practice on MTL, one only needs the linux PR, and this branch of mine that contains all the other necessary changes:
https://github.com/jsarha/sof/tree/debug_stream_thread_info_testing

src/debug/debug_stream/debug_stream.h Outdated Show resolved Hide resolved
src/debug/debug_stream/Kconfig Outdated Show resolved Hide resolved
LOG_MODULE_REGISTER(debug_strem_slot);

struct cpu_mutex {
struct k_mutex m;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I assume mutex is per core and not not lock for all core ? - if so we should inline comment this description.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, if this is nothing else but a mutex, why embed it into another structure? Just use an array of struct k_mutex objects. And if you do insist of having a separate structure for it, maybe use the debug_ namespace with it too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are mutexes meant for different cores and AFAIK they should be on different cache-lines to avoid trouble. That is why they are inside a struct that I can force to be cache-aligned.

uint32_t record_size = rec->size;
uint32_t record_start, buf_remain;

LOG_DBG("Sending record %u id %u len %u\n", rec->serial, rec->id, rec->size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go after the buffer check below ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A matter or taste. I wanted to know for sure the function is at least called, when I could not see anything happening in the circular buffer.

src/debug/debug_stream/debug_stream_slot.c Show resolved Hide resolved
* Section descriptor defines core ID, offset and size of the circular
* buffer in the debug window slot.
*/
struct debug_stream_section_descriptor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again - is user API, needs to go in another directory and needs packed/aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packed hardly does anything when everything is 32-bit words, and the descriptor is already aligned to cache-line size. The header itself must be written to beginning of slot so its automatically 4k page size aligned, which should be enough, and also if its not the idea is still to have it at the very beginning of the slot. I can put the packed aligned stuff to the header too, its not needed, and I feel it could even be misleading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I was amazed when I realized that the C standard has nothing like __packed. C is really not a low-level language ;-)

It seems unlikely but I think 32 bits fields could be 64-bits aligned on some 64 bits architectures, depending on the compiler flags, the phase of the moon etc.

https://www.catb.org/esr/structure-packing/ (sorry for the invalid certificate)
https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#index-packed-type-attribute
https://linux.die.net/man/1/pahole

Copy link
Member

@lgirdwood lgirdwood Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to be explicit here even though parent is packed and this is all 32bit words, i.e. we could have direct users that may reference this structure without the parent context (and hence not optimize for 32 but aligned/packed access).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha FWIW agree with @lgirdwood - please add packed. It's good for all ABI types to have it even if they "obviously" effectively would be packed by default by the compiler anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can of course create all kinds of constructs to invent the __aligned() again with variable sized paddings, and use __packed, but now I do not have time for it so, I leave this as it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha Wouldn't just removing __aligned and using __packed already solve it? And just look through those types to verify that you arrange elements from larger to smaller - first all 64-bit ones, then 32-bit ones etc. Only if you embed structures, well, then yes, just pad the structure that you're embedding with some uint8_t reserved[N] array to align to the next 64 bits, that's it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is several ways to do this, and probably it would work even if the core specific circular-buffer descriptors were not 64-byte aligned, since after the initialization the contents is read only and the memory is accessed through uncached address-space. Still @lgirdwood suggested that also the static descriptor part should be 64-bit aligned, just to be safe. I am really fine with anything that works, but I would like to do it only once more, and the time is running little short if we want this to go to 2.11.

Copy link
Contributor Author

@jsarha jsarha Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood @lyakh , I am more and more starting to think the __aligned() should be dropped from "struct debug_stream_section_descriptor", and for that matter all the padding hacks too. We can not have the ABI depend on cache-line size, or at least we should then write the offsets explicitly out in the debug window, but that is pretty much what the "struct debug_stream_section_descriptor" is for: It is all static data that is anyway initialized by core 0 before the threads running on other cores have started and its in uncached memory, and it contains the offsets of the core specific circular buffers, and their sizes. So, there should not and must not be any cash-line size dependent padding in the header that describes where the core-specific circular buffers are. I've tested this works fine with sof-mtl-nocodec-ssp0-ssp2.tplg on MTL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha I agree. The padding that I was proposing wasn't for cache-line size alignment, it was for "natural data type alignment." I.e. if you have an 8-bit member followed by a 16-bit value and you cannot swap them, add a padding like

struct abi_data {
    uint8_t type;
    uint8_t reserved0[3];
    uint32_t size;
    ....
};

tools/debugstream/thread-info.py Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_slot.c Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_slot.c Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_slot.c Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_slot.c Outdated Show resolved Hide resolved
tools/debugstream/thread-info.py Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_thread_info.c Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_thread_info.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No showstoppers, aside the assert usage.

src/debug/debug_stream/debug_stream.h Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_thread_info.c Outdated Show resolved Hide resolved
tools/debugstream/debug-stream.py Outdated Show resolved Hide resolved
tools/debugstream/debug-stream.py Outdated Show resolved Hide resolved
src/debug/debug_stream/debug_stream_slot.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug-stream.py has 352 pylint warnings: that's almost as many as the number of lines in the file! I'm NOT requesting "pylint perfection" but the number of warnings should be small enough to keep pylint usable. It's not usable right now.

The good news is: 256 out of these 352 are the same W0311 and you can get rid of ALL of them with a single black call:

apt install black
black tools/debugstream/debug-stream.py
reformatted tools/debugstream/debug-stream.py

Warning: black reformatting happens in place.

Another pro tip: sorted_pylint () { pylint "$@" | sort -t: -k2 -n ; }

tools/debugstream/debug-stream.py:370:0: W0311: Bad indentation. Found 5 spaces, expected 20 (bad-indentation)
tools/debugstream/debug-stream.py:371:0: W0311: Bad indentation. Found 5 spaces, expected 20 (bad-indentation)
tools/debugstream/debug-stream.py:372:0: W0311: Bad indentation. Found 6 spaces, expected 24 (bad-indentation)
tools/debugstream/debug-stream.py:374:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
tools/debugstream/debug-stream.py:375:0: W0311: Bad indentation. Found 3 spaces, expected 12 (bad-indentation)
tools/debugstream/debug-stream.py:376:0: W0311: Bad indentation. Found 3 spaces, expected 12 (bad-indentation)
tools/debugstream/debug-stream.py:377:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
tools/debugstream/debug-stream.py:378:0: W0311: Bad indentation. Found 3 spaces, expected 12 (bad-indentation)
tools/debugstream/debug-stream.py:379:0: W0311: Bad indentation. Found 4 spaces, expected 16 (bad-indentation)
tools/debugstream/debug-stream.py:380:0: W0311: Bad indentation. Found 4 spaces, expected 16 (bad-indentation)
tools/debugstream/debug-stream.py:381:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)
tools/debugstream/debug-stream.py:382:0: W0311: Bad indentation. Found 2 spaces, expected 8 (bad-indentation)

tools/debugstream/debug-stream.py Outdated Show resolved Hide resolved
@marc-hb
Copy link
Collaborator

marc-hb commented Aug 26, 2024

but the number of warnings should be small enough to keep pylint usable.

Forgot this sorry: most warnings are easy to fix but in some rare cases pylint is wrong or you don't have the time to figure out how (feel free to ask for help privately). Then you can easily silence any warning with a # pylint: disable... comment. Either in the spot (preferred), or at in the whole file. Find examples in git -C sof[-test] grep pylint.*disable.

Or you can just leave some for someone else to fix later.

src/debug/debug_stream/Kconfig Outdated Show resolved Hide resolved
src/debug/debug_stream/Kconfig Show resolved Hide resolved
src/debug/debug_stream/Kconfig Outdated Show resolved Hide resolved
tools/debugstream/debug-stream.py Outdated Show resolved Hide resolved
tools/debugstream/debug-stream.py Outdated Show resolved Hide resolved
tools/debugstream/debug-stream.py Outdated Show resolved Hide resolved
@jsarha
Copy link
Contributor Author

jsarha commented Sep 4, 2024

The new version should address all the comment according to reply, except for "serial" > "seqnum" and getting rid off __aligned() and using __packed and padding members instead. Oh, and I am not working from python ctypes to construct change, not yet anyway.

@marc-hb marc-hb dismissed their stale review September 4, 2024 19:04

pylint mostly clean

@jsarha
Copy link
Contributor Author

jsarha commented Sep 5, 2024

Now also "serial" -> "seqnum" change is done. Aome idle debug prints were also removed.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as none of my questions actually uncover bugs and since this is new code, they can be addressed after merge, nothing seems to be critical

src/include/user/debug_stream.h Outdated Show resolved Hide resolved
} __aligned(CONFIG_DCACHE_LINE_SIZE);

/* CPU specific mutexes for each circular buffer */
static struct cpu_mutex cpu_mutex[CONFIG_MP_MAX_NUM_CPUS];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this puts cpu_mutex[] into .bss which is uncached in present builds. That means, that you don't need to cache-line align them and to flush caches. In general, even if it were cached, you would always access it as such, mixing cached and uncached access to a mutex would very likely break its functionality. So you wouldn't need to flush caches anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then that is a problem. Is all static (and global) data uncached? If that is the case, then I guess I should reserve the table from heap, but are all heap allocations automatically cache-line aligned?

The mutex only to control access within one core so AFAIU the cache should not be a problem. This is for the situation where higher priority thread interrupts record sending and that higher priority thread tries to send a record of its own.

I'd still say its safer to keep the mutexes on different cache-lines, even if the memory is uncached, so I suggest not to do anything right now, but move the mutexes to cached memory later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha I'd keep them uncached and not bother with cache lines. Should be both simpler and safer. Though a bit slower, but don't think this slow down would be significant for you. But if you do decide to go via cache, you can get cached aliases to that memory, but then yes, you'd need to reserve a whole cache line-sized buffer for each.

buf->next_seqno = 0;
buf->w_ptr = 0;
k_mutex_init(&cpu_mutex[i].m);
sys_cache_data_flush_range(&cpu_mutex[i], sizeof(cpu_mutex[i]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this, for now, but eventually I would like have these mutexes in cached memory.

LOG_MODULE_REGISTER(debug_strem_slot);

struct cpu_mutex {
struct k_mutex m;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, if this is nothing else but a mutex, why embed it into another structure? Just use an array of struct k_mutex objects. And if you do insist of having a separate structure for it, maybe use the debug_ namespace with it too

if (!buf)
return -ENODEV;

if (rec->size >= desc.buf_words) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is .size in words? Should be clearly documented then... Or maybe use a different name or make it in bytes

Copy link
Contributor Author

@jsarha jsarha Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its documented here:

struct debug_stream_record {
	uint32_t id;		/* Record id of abstract data record */
	uint32_t seqno;		/* Increments after each record */
	uint32_t size;		/* Size of the whole record in words */
	uint32_t data[];
} __packed;

, but I change the data-member name.

int i;

/* Mark the thread as active */
ud->active_threads[ud->thread_count] = tid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_count sounds like it's the number of threads, i.e. it begins with 1? Can this access outside of the array?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha but does it begin with 0 or with 1? I.e. does the element ud->active_threads[0] ever get used here?

* returns it.
*/

static uint32_t thread_info_get_cycles(void *tid, k_thread_runtime_stats_t *thread_stats,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use struct k_thread *?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I am trying to hint that the pointer is used as an id only, its not accessed at all. I'll add a comment.


/* If there is more than THREAD_INFO_MAX_THREADS threads in this core */
if (i == ARRAY_SIZE(ud->previous->threads))
LOG_INF("No place found for %s %p", name, tid);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning?

/* If no cached value was found, look for an empty slot to store the recent value */
for (i = 0; i < ARRAY_SIZE(ud->previous->threads); i++) {
if (ud->previous->threads[i].tid == NULL) {
ud->previous->threads[i].tid = tid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a follow-up, but maybe we can have a better name than previous here. You don't store more than one snapshot, so isn't it just "current"?

Copy link
Contributor Author

@jsarha jsarha Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the cycles number from the previous round. After the calculations it will be "current", but when its used next time its again from "previous" round.

#define THREAD_STACK_SIZE (2048)
static K_THREAD_STACK_ARRAY_DEFINE(info_thread_stacks, CONFIG_MP_MAX_NUM_CPUS,
THREAD_STACK_SIZE);
static struct k_thread info_thread[CONFIG_MP_MAX_NUM_CPUS];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG_MP_NUM_CPUS?

Copy link
Contributor Author

@jsarha jsarha Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG_MP_NUM_CPUS is obsolete, MP_MAX_NUM_CPUS should be used:

config MP_NUM_CPUS
int "Number of CPUs/cores [DEPRECATED]"
default MP_MAX_NUM_CPUS
range 1 12
help
This is deprecated, please use MP_MAX_NUM_CPUS instead.

config MP_MAX_NUM_CPUS
int "Maximum number of CPUs/cores"
default 1
range 1 12
help
Maximum number of multiprocessing-capable cores available to the
multicpu API and SMP features.

Jyri Sarha added 4 commits September 5, 2024 20:34
…tion

This commit adds definition for abstract debug stream record and
stream header. The header file contains the necessary documentation.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
Implementation for debug stream transportation over debug window
slot. The protocol details are documented in debug_stream_slot.h
header.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
Implementation for collecting Thread stack and CPU usage statistic from
Zephyr core and send the info and debug_stream records. The records
encoding is defined and documented in debug_stream_thread_info.c
source file.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
Python implementation for receiving and decoding debug-stream records
from debug window slot transportation. Opens SOF debugfs file
"debug_stream" reads and decodes the records from the circular buffer
documented in soc_debug_window_slot.h. This initial version only knows
of DEBUG_STREAM_RECORD_ID_THREAD_INFO records.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
@jsarha
Copy link
Contributor Author

jsarha commented Sep 5, 2024

Following changes were made:

  • No cache-line size alingment for static data circular buffer descriptors
  • record member "size" to "size_words"
  • struct record_buf called bufd and not rec to avoid confusion
  • Typo-fixes
  • TODO comment for using python construct instead of ctypes stucts for data models
  • thread_count overflow check
  • INF to WRN logging promotion for thread_count overflow
  • Comment out sys_cache_data_flush_range()
  • Add comment for using "void *tid" in thread_info_get_cycles()

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing comments

buf->next_seqno = 0;
buf->w_ptr = 0;
k_mutex_init(&cpu_mutex[i].m);
/* The core specific mutexes are now .dss which is uncached so the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.bss

int i;

/* Mark the thread as active */
ud->active_threads[ud->thread_count] = tid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha but does it begin with 0 or with 1? I.e. does the element ud->active_threads[0] ever get used here?

@kv2019i
Copy link
Collaborator

kv2019i commented Sep 6, 2024

@marcinszkudlinski ok for you to proceed? default for this is "n".

@kv2019i
Copy link
Collaborator

kv2019i commented Sep 6, 2024

@wszypelt Can you check? This seems to have failed to "Copy build to DUT" step.

@kv2019i kv2019i merged commit 9152b87 into thesofproject:main Sep 6, 2024
43 of 47 checks passed
@jsarha
Copy link
Contributor Author

jsarha commented Sep 6, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants