Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why I get DML_STATUS_PAGE_FAULT_ERROR when I use low level API #48

Open
shenghansen opened this issue Oct 10, 2024 · 7 comments
Open

why I get DML_STATUS_PAGE_FAULT_ERROR when I use low level API #48

shenghansen opened this issue Oct 10, 2024 · 7 comments
Assignees

Comments

@shenghansen
Copy link

I use low level API for mem_move ,when I allocate a buffer on the heap and increase its size, I encounter a DML_STATUS_PAGE_FAULT_ERROR. Although I have already added DML_STATUS_PAGE_FAULT_ERROR during compilation and set block_on_fault=1 in the configuration file, I am still confused as to why this error keeps occurring.

my config file
success.config.txt
linux kernel 5.19.0-051900-generic
CPU Intel(R) Xeon(R) Gold 6448H
dsa device info:
0001:75:01.0 System peripheral: Intel Corporation Device 0b25
Subsystem: Device 1f24:6005
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
NUMA node: 2
IOMMU group: 305
Region 0: Memory at 253ffff40000 (64-bit, prefetchable) [size=64K]
Region 2: Memory at 253ffff00000 (64-bit, prefetchable) [size=128K]
Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag+ RBE+ FLReset+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+ LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
Capabilities: [80] MSI-X: Enable+ Count=9 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [90] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [160 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Capabilities: [170 v1] Virtual Channel
Caps: LPEVC=1 RefClk=100ns PATEntryBits=1
Arb: Fixed+ WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=fd
Status: NegoPending- InProgress-
VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=1 ArbSelect=Fixed TC/VC=02
Status: NegoPending- InProgress-
Capabilities: [200 v1] Designated Vendor-Specific: Vendor=8086 ID=0005 Rev=0 Len=24 <?>
Capabilities: [220 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [230 v1] Process Address Space ID (PASID)
PASIDCap: Exec- Priv+, Max PASID Width: 14
PASIDCtl: Enable+ Exec- Priv+
Capabilities: [240 v1] Page Request Interface (PRI)
PRICtl: Enable+ Reset-
PRISta: RF- UPRGI- Stopped+
Page Request Capacity: 00000200, Page Request Allocation: 00000200
Kernel driver in use: idxd
Kernel modules: idxd

@shenghansen
Copy link
Author

And Cmake shows
CMake Warning:
Manually-specified variables were not used by the project:

DML_FLAG_BLOCK_ON_FAULT

@abdelrahim-hentabli
Copy link
Contributor

Hey @shenghansen, the DML_FLAG_BLOCK_ON_FAULT flag is a job flag that needs to be put in as a job flag.

So something like:

job->flags |= DML_FLAG_BLOCK_ON_FAULT

as opposed to a CMAKE flag

@abdelrahim-hentabli abdelrahim-hentabli self-assigned this Oct 23, 2024
@shenghansen
Copy link
Author

When I try dml_job_ptr->flags = DML_FLAG_BLOCK_ON_FAULT; or dml_job_ptr->flags |= DML_FLAG_BLOCK_ON_FAULT;
I got error 14 DML_STATUS_JOB_FLAGS_ERROR = 14u, /**< Invalid flags field in dml_job_t. */
Could you please explain it a little more clearly?

@shenghansen
Copy link
Author

what is the type of job?

@abdelrahim-hentabli
Copy link
Contributor

hey @shenghansen if you wanted to use DML_FLAG_BLOCK_ON_FAULT you need to ensure that the accelerators are set up to use block on fault by checking/setting that the device is configured with "block_on_fault": 1 in accel-config for all enabled workqueues.

If that doesn't work, could you share the code that you are using to submit the job?

@shenghansen
Copy link
Author

Hi,Thanks for your reply ,I am sure I set block_on_fault": 1 in my config,and my code is:

 dml_path_t execution_path = DML_PATH_HW;
    dml_job_t* dml_job_ptr = NULL;
    uint32_t job_size = 0u;
    dml_status_t status = dml_get_job_size(execution_path, &job_size);
    if (DML_STATUS_OK != status) {
        ERROR("An error {} occurred during getting job size.", static_cast<int>(status));
        return 1;
    }
    dml_job_ptr = (dml_job_t*)malloc(job_size);
    status = dml_init_job(execution_path, dml_job_ptr);
    if (DML_STATUS_OK != status) {
        ERROR("An error {} occurred during job initialization.", static_cast<int>(status));
        free(dml_job_ptr);
        return 1;
    }
    uint32_t batch_buffer_length = 0u;
    status = dml_get_batch_size(dml_job_ptr, BATCH_SIZE, &batch_buffer_length);
    if (DML_STATUS_OK != status) {
        ERROR("An error {} occurred during getting batch size.", static_cast<int>(status));
        return 1;
    }
    uint8_t* batch_buffer_ptr = (uint8_t*)malloc(batch_buffer_length);
    dml_job_ptr->operation = DML_OP_BATCH;
    dml_job_ptr->destination_first_ptr = batch_buffer_ptr;
    dml_job_ptr->destination_length = batch_buffer_length;
    dml_job_ptr->numa_id = numa;
    dml_job_ptr->flags = DML_FLAG_BLOCK_ON_FAULT;


    // begin batch set
    size_t buffer_size = size / BATCH_SIZE;
    size_t remainder = size % BATCH_SIZE;
    for (size_t i = 0; i < BATCH_SIZE - 1; i++) {
        status = dml_batch_set_mem_move_by_index(dml_job_ptr,
                                                 i,
                                                 source + i * buffer_size,
                                                 destination + i * buffer_size,
                                                 buffer_size,
                                                 DML_FLAG_PREFETCH_CACHE);
        if (DML_STATUS_OK != status) {
            ERROR("An error {} occurred during setting of batch operation.",
                  static_cast<int>(status));
            return 1;
        }
    }
    status = dml_batch_set_mem_move_by_index(dml_job_ptr,
                                             BATCH_SIZE - 1,
                                             source + (BATCH_SIZE - 1) * buffer_size,
                                             destination + (BATCH_SIZE - 1) * buffer_size,
                                             buffer_size + remainder,
                                             DML_FLAG_PREFETCH_CACHE);
    if (DML_STATUS_OK != status) {
        ERROR("An error {} occurred during setting of batch operation.", static_cast<int>(status));
        return 1;
    }
    // execute
    status = dml_execute_job(dml_job_ptr, DML_WAIT_MODE_BUSY_POLL);
    if (DML_STATUS_OK != status) {
        ERROR("An error {} occurred during job execution.", static_cast<int>(status));
        return 1;
    }
    if (dml_job_ptr->result != 0) {
        ERROR("Operation result is incorrect.");
        return 1;
    }

    status = dml_finalize_job(dml_job_ptr);
    if (DML_STATUS_OK != status) {
        ERROR("An error {} occurred during job finalization.", static_cast<int>(status));
        free(dml_job_ptr);
        return 1;
    }

Hope you can help me

@shenghansen
Copy link
Author

And the config is

[
  {
    "dev":"dsa0",
    "read_buffer_limit":0,
    "groups":[
      {
        "dev":"group0.0",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0
,
        "grouped_workqueues":[
          {
            "dev":"wq0.0",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.1",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.2",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.3",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.4",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.5",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.6",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq0.7",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.0",
            "group_id":0
          },
          {
            "dev":"engine0.1",
            "group_id":0
          },
          {
            "dev":"engine0.2",
            "group_id":0
          },
          {
            "dev":"engine0.3",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group0.1",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group0.2",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group0.3",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      }
    ]
  },
  {
    "dev":"dsa2",
    "read_buffer_limit":0,
    "groups":[
      {
        "dev":"group2.0",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0
,
        "grouped_workqueues":[
          {
            "dev":"wq2.0",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.1",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.2",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.3",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.4",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.5",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.6",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq2.7",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine2.0",
            "group_id":0
          },
          {
            "dev":"engine2.1",
            "group_id":0
          },
          {
            "dev":"engine2.2",
            "group_id":0
          },
          {
            "dev":"engine2.3",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group2.1",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group2.2",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group2.3",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      }
    ]
  },
  {
    "dev":"dsa4",
    "read_buffer_limit":0,
    "groups":[
      {
        "dev":"group4.0",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0
,
        "grouped_workqueues":[
          {
            "dev":"wq4.0",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.1",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.2",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.3",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.4",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.5",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.6",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq4.7",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine4.0",
            "group_id":0
          },
          {
            "dev":"engine4.1",
            "group_id":0
          },
          {
            "dev":"engine4.2",
            "group_id":0
          },
          {
            "dev":"engine4.3",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group4.1",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group4.2",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group4.3",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      }
    ]
  },
  {
    "dev":"dsa6",
    "read_buffer_limit":0,
    "groups":[
      {
        "dev":"group6.0",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0
,
        "grouped_workqueues":[
          {
            "dev":"wq6.0",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.1",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.2",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.3",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.4",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.5",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.6",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          },
          {
            "dev":"wq6.7",
            "mode":"shared",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "max_batch_size":128,
            "max_transfer_size":2147483648,
            "type":"user",
            "name":"user_default_wq",
            "threshold":16
            
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine6.0",
            "group_id":0
          },
          {
            "dev":"engine6.1",
            "group_id":0
          },
          {
            "dev":"engine6.2",
            "group_id":0
          },
          {
            "dev":"engine6.3",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group6.1",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group6.2",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      },
      {
        "dev":"group6.3",
        "read_buffers_reserved":0,
        "use_read_buffer_limit":0

      }
    ]
  }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants