ch4/ofi: refactor MPIDI_OFI_request_t #6895

hzhou · 2024-02-06T20:57:39Z

Pull Request Description

Having multiple non-intercepting code paths to use the same structure fields in request makes code difficult to read.
This PR separates the fields into distinct union members with more intuitive names in order to make the code more readable.

Before this PR the fields are defined as:

typedef struct {
    struct fi_context context[MPIDI_OFI_CONTEXT_STRUCTS];       /* fixed field, do not move */
    int event_id;               /* fixed field, do not move */
    MPL_atomic_int_t util_id;
    MPI_Datatype datatype;
    int nic_num;                /* Store the nic number so we can use it to cancel a request later
                                 * if needed. */
    enum MPIDI_OFI_req_kind kind;
    union {
        struct fid_mr **send_mrs;
        void *remote_info;
    } huge;
    union {
        struct {
            void *buf;
            size_t count;
            MPI_Datatype datatype;
            char *pack_buffer;
        } pack;
        struct iovec *nopack;
    } noncontig;
    union {
        struct iovec iov;
        void *inject_buf;       /* Internal buffer for inject emulation */
    } util;
    ...

It is quite confusing on which paths are using which fields and which paths are intercepting. After this PR:

typedef struct {
    struct fi_context context[MPIDI_OFI_CONTEXT_STRUCTS];       /* fixed field, do not move */
    int event_id;               /* fixed field, do not move */
    MPL_atomic_int_t util_id;
    MPI_Datatype datatype;
    int nic_num;                /* Store the nic number so we can use it to cancel a request later
                                 * if needed. */
    enum MPIDI_OFI_req_kind kind;
    union {
        /* send path */
        struct {
            void *pack_buffer;
        } pack_send;
        struct {
            struct iovec *iovs;
        } nopack_send;
        struct {
            struct fid_mr **mrs;
            void *pack_buffer;
        } huge_send;
        struct {
            int vci_local;
            int ctx_idx;
            fi_addr_t remote_addr;
            uint64_t cq_data;
            uint64_t match_bits;
            int pipeline_tag;
            int num_remain;
        } pipeline_send;
        struct {
            void *inject_buf;
        } am_inject_emu;

        /* The recv path can be uncertain depend on the actual send path,
         * thus some fields are significant and need be preset to NULL.
         */
        struct {
            void *remote_info;  /* huge path if not NULL */
            char *pack_buffer;  /* need unpack if not NULL */
            void *buf;
            MPI_Aint count;
            MPI_Datatype datatype;
            struct iovec msg_iov;       /* FI_CLAIM require fi_trecvmsg which require usage of iov.
                                         * We always set it with {recv_buf, data_sz} since they are
                                         * useful for the huge recv path as well.
                                         */
        } recv;
        struct {
            struct iovec *iovs;
        } nopack_recv;
        struct {
            int vci_local;
            int ctx_idx;
            fi_addr_t remote_addr;
            uint64_t match_bits;
            uint64_t mask_bits;
            MPI_Aint offset;
            int pipeline_tag;
            int num_inrecv;
            int num_remain;
            bool is_sync;
            void *buf;
            MPI_Aint count;
            MPI_Datatype datatype;
        } pipeline_recv;
    } u;
} MPIDI_OFI_request_t;

The union fields are not intercepting and clearly identifies the code paths.
[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

MPIR_gpu_req is a union type for either a MPL_gpu_request or a MPIR_Typerep_req, thus it is not just for gpu. Potentially this type can be extended to include other internal async task handles. Thus we rename it to MPIR_async_req. We also establish the convention of naming the variable async_req.

Add an inline wrapper for testing MPIR_async_req. Modify the order of header inclusion due to the dependency on typerep_pre.h.

Refactor the async copy in receive events using MPIR_async facilities.

Refactor the async copy before sending a chunk.

Both gpu_send_task_queue and gpu_recv_task_queue have been ported to async things.

Pipeline send allocates chunk buffers then spawns async copy. The allocation may run out of genq buffers, thus it is disigned as async tasks. The send copy are triggered upon completion of buffer alloc, thus it is renamed into spawn_send_copy and turned into internal static function. This removes MPIDI_OFI_global.gpu_send_queue.

Pipeline recv allocates chunk buffers and then post fi_trecv. The allocation may run out of genq buffers and we also control the number of outstanding recvs, thus it is designed as async tasks. The async recv copy are triggered in recv event when data arrives. This removes MPIDI_OFI_global.gpu_recv_queue. All ofi-layer progress routines for gpu pipelining are now removed.

Consolidate the gpu pipeline code. MPIDI_OFI_gpu_pipeline_request is now an internal struct in ofi_gpu_pipeline.c, rename to struct chunk_req. MPIDI_OFI_gpu_pipeline_recv_copy is now an internal function, rename to start_recv_copy.

Move all gpu pipeline specific code into ofi_gpu_pipeline.c. Make a new function MPIDI_OFI_gpu_pipeline_recv that fills rreq with persistent pipeline_info data. Rename the original MPIDI_OFI_gpu_pipeline_recv into static function start_recv_chunk.

Make the code cleaner to separate the pipeline_info type into a union of send and recv.

Don't mix the usage of cc_ptr, use separate and explicit counters to track the progress and completion of chunks.

Follow a similar approach as nonblocking collectives, internal pipeline chunks use separate tag space (MPIDI_OFI_GPU_PIPELINE_SEND) and incrementing tags to avoid mismatch with regular messages.

Separate the recv tasks between the initial header and chunks since the paths clearly separates them. Use a single async item for all chunk recvs rather than unnecessarily enqueuing individual chunks since we can track the chunks in the state.

It is needed to compile under noinline configuration.

Move these utility functions to ofi_impl.h since they are simple and non-specific. It also simplifies figuring out which file to include especially for .c files.

Separate non-intercepting paths with distinct union members to make the code more explicit.

The paths of pipeline code is not intercepting with other paths, thus pipeline_info can be part of the same union.

Rather than sharing the pack_buffer field in pack_send, which obfuscates the code, use a separate union member for the huge send path.

It is part of the recv path, thus need be in the recv struct. The recv paths may switch paths depending on the actual protocols used by sender, thus some of the fields from different paths need live in the same struct and use the NULL sentinel to tell whether certain path is in effect. Currently this include remote_info for huge_send protocol, and pack_buffer for pack_recv. It's possible to have huge_recv+pack_recv.

The am emulated inject path does not intercept with any native paths, thus it should be part of the big union.

The util.iov field is used by fi_trecvmsg with FI_CLAIM flag or the huge recv path for threshold checking and fi_read.

The fast path of MPIDI_OFI_EVENT_SEND and MPIDI_OFI_EVENT_RECV has already been handled in MPIDI_OFI_dispatch_optimized. Thus it is not critical to mark likely/unlikely branches in MPIDI_OFI_dispatch_function. Just use a big switch for simplicity.

There is no way to recover a pipelined message in the native ofi path if the receiver didn't expect it. We could always save the necessary information in all recv paths, but I am not sure we are willing to accept the overhead, which harms the small message latencies. Assert failure for now. The proper pipeline implementation need to happen as active messages or even at the MPIR-layer.

Remove a redundant assignment of *request.

hzhou · 2024-02-09T16:23:46Z

test:mpich/ch4/ofi

hzhou · 2024-02-09T19:19:59Z

direct-nm had weird signal 7 during init. Retest to confirm -

test:mpich/ch4/ofi

raffenet · 2024-02-20T02:24:02Z

This request object changes look good. I think I had a few comments on the previous PR that I need to revisit before this can go in.

hzhou added 6 commits February 6, 2024 10:16

misc: add MPIR_async_test

0cc53d6

Add an inline wrapper for testing MPIR_async_req. Modify the order of header inclusion due to the dependency on typerep_pre.h.

ch4/ipc: refactor gpu_ipc_async_poll to use MPIR_async_test

ac6aa4c

ch4/ofi: refactor pipeline recv async copy

b510cad

Refactor the async copy in receive events using MPIR_async facilities.

ch4/ofi: refactor pipeline send async copy

6727ffc

Refactor the async copy before sending a chunk.

ch4/ofi: remove MPIDI_OFI_gpu_progress_task

6486c72

Both gpu_send_task_queue and gpu_recv_task_queue have been ported to async things.

hzhou force-pushed the 2402_ofi_req branch 2 times, most recently from ed2c544 to 0cf1638 Compare February 6, 2024 23:31

hzhou added 7 commits February 7, 2024 09:54

ch4/ofi: refactor pipeline_info into a union

c41c4be

Make the code cleaner to separate the pipeline_info type into a union of send and recv.

ch4/ofi: use explicit counters to track gpu pipeline

1c4a9f0

Don't mix the usage of cc_ptr, use separate and explicit counters to track the progress and completion of chunks.

ch4/ofi: use internal tag for pipeline chunk match_bits

ffafce1

Follow a similar approach as nonblocking collectives, internal pipeline chunks use separate tag space (MPIDI_OFI_GPU_PIPELINE_SEND) and incrementing tags to avoid mismatch with regular messages.

hzhou force-pushed the 2402_ofi_req branch 3 times, most recently from 951bf49 to 127a607 Compare February 7, 2024 21:16

hzhou marked this pull request as ready for review February 7, 2024 21:16

hzhou force-pushed the 2402_ofi_req branch 2 times, most recently from b7f89a7 to 4318961 Compare February 8, 2024 16:54

hzhou added 5 commits February 8, 2024 13:39

ch4/ofi: refactor gpu pipeline recv_alloc

4cc74c6

Separate the recv tasks between the initial header and chunks since the paths clearly separates them. Use a single async item for all chunk recvs rather than unnecessarily enqueuing individual chunks since we can track the chunks in the state.

ch4/ofi: include ofi_impl.h in ofi_gpu_pipeline.c

c24a4a9

It is needed to compile under noinline configuration.

ch4/ofi: move some inline util functions

bdd903e

Move these utility functions to ofi_impl.h since they are simple and non-specific. It also simplifies figuring out which file to include especially for .c files.

---- START HERE ----

207c5d3

ch4/ofi: re-organize MPIDI_OFI_request_t noncontig union

fa962d7

Separate non-intercepting paths with distinct union members to make the code more explicit.

hzhou force-pushed the 2402_ofi_req branch from 4318961 to 2e6866f Compare February 9, 2024 15:33

hzhou added 3 commits February 9, 2024 10:10

ch4/ofi: merge pipeline_info in MPIDI_OFI_request_t

59dc0fb

The paths of pipeline code is not intercepting with other paths, thus pipeline_info can be part of the same union.

ch4/ofi: add huge_send in MPIDI_OFI_request_t union

e8ab242

Rather than sharing the pack_buffer field in pack_send, which obfuscates the code, use a separate union member for the huge send path.

hzhou added 5 commits February 9, 2024 10:10

ch4/ofi: move inject_buf to MPIDI_OFI_request_t.u

fe6b73c

The am emulated inject path does not intercept with any native paths, thus it should be part of the big union.

ch4/ofi: move the util.iov into MPIDI_OFI_request_t.u

ccd0b27

The util.iov field is used by fi_trecvmsg with FI_CLAIM flag or the huge recv path for threshold checking and fi_read.

ch4/ofi: Remove a redundant assignment

647751e

Remove a redundant assignment of *request.

hzhou force-pushed the 2402_ofi_req branch from 2e6866f to 647751e Compare February 9, 2024 16:13

hzhou requested a review from raffenet February 9, 2024 21:29

raffenet approved these changes Feb 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch4/ofi: refactor MPIDI_OFI_request_t #6895

ch4/ofi: refactor MPIDI_OFI_request_t #6895

hzhou commented Feb 6, 2024 •

edited

Loading

hzhou commented Feb 9, 2024

hzhou commented Feb 9, 2024

raffenet commented Feb 20, 2024

ch4/ofi: refactor MPIDI_OFI_request_t #6895

Are you sure you want to change the base?

ch4/ofi: refactor MPIDI_OFI_request_t #6895

Conversation

hzhou commented Feb 6, 2024 • edited Loading

Pull Request Description

Author Checklist

hzhou commented Feb 9, 2024

hzhou commented Feb 9, 2024

raffenet commented Feb 20, 2024

hzhou commented Feb 6, 2024 •

edited

Loading