Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysmon failed on B580 #85

Open
alanzhai219 opened this issue Mar 7, 2025 · 17 comments
Open

sysmon failed on B580 #85

alanzhai219 opened this issue Mar 7, 2025 · 17 comments

Comments

@alanzhai219
Copy link

alanzhai219 commented Mar 7, 2025

Reproduce

dpkg -l | grep intel
ii  intel-gsc                                     0.9.5-112~u24.04                         amd64        Intel(R) Graphics System Controller Firmware
ii  intel-igc-core-2                              2.7.11                                   amd64        Intel(R) Graphics Compiler for OpenCL(TM)
ii  intel-igc-opencl-2                            2.7.11                                   amd64        Intel(R) Graphics Compiler for OpenCL(TM)
ii  intel-level-zero-gpu                          1.6.32567.17                             amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-media-va-driver:amd64                   24.1.0+dfsg1-1                           amd64        VAAPI driver for the Intel GEN8+ Graphics family
ii  intel-metrics-discovery                       1.13.179-1077~24.04                      amd64        Intel(R) Metrics Discovery Application Programming Interface --
ii  intel-metrics-library                         1.0.182-1077~24.04                       amd64        Intel(R) Metrics Library for MDAPI (Intel(R) Metrics Discovery
ii  intel-microcode                               3.20250211.0ubuntu0.24.04.1              amd64        Processor microcode firmware for Intel CPUs
ii  intel-ocloc                                   24.52.32224.14-1077~24.04                amd64        Tool for managing Intel Compute GPU device binary format
ii  intel-opencl-icd                              25.05.32567.17                           amd64        Intel graphics compute runtime for OpenCL
ii  libchewing3:amd64                             0.6.0-1build1                            amd64        intelligent phonetic input method library
ii  libchewing3-data                              0.6.0-1build1                            all          intelligent phonetic input method library - data files
ii  libdrm-intel1:amd64                           2.4.124+git2501180500.a7eb2c~oibaf~n     amd64        Userspace interface to intel-specific kernel DRM services -- runtime
ii  xserver-xorg-video-intel                      2:2.99.917+git20210115-1build1           amd64        X.Org X server -- Intel i8xx, i9xx display driver

sycl-ls                                                                                                                                                                        1 ↵
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.32567.170000]
[opencl:gpu][opencl:0] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) B580 Graphics OpenCL 3.0 NEO  [25.05.32567.17]

Build

cd pti-gpu/tools/sysmon
mkdir build && cd build
cmake ..
make

Log

./sysmon -p
=====================================================================================
sysmon: /home/az/workspace/pti-gpu/tools/sysmon/main.cc:117: void PrintShorInfo(ze_driver_handle_t, zes_device_handle_t, uint32_t): Assertion `status == ZE_RESULT_SUCCESS' failed.
[2]    14188 IOT instruction (core dumped)  ./sysmon -p
@mschilling0
Copy link
Contributor

Thanks, I was able to reproduce on Ubuntu 24.10 too. Looking into it.

@alanzhai219
Copy link
Author

@mschilling0 any updates? It is from l0 or driver, I think.

@mschilling0
Copy link
Contributor

Yes, I filed an issue with their team, they seem to have accepted it. I will follow up and ask for an update.

@mschilling0
Copy link
Contributor

No updates yet other than they've done an initial triage and seemed to accept it. I will let you know any updates here. It might have to work its way through their processes.

@pratikbariintel
Copy link

The API zesDeviceGetProperties is expected to return ZE_RESULT_ERROR_UNINITIALIZED i.e error code 78000001 when the core handle is given. This core handle has been created since the zeInit based sysman initialization is used.

On the BMG system, the sysman should be initialized using the zesInit based initialization which creates a separate Sysman handle. This Sysman handle should be given to the zesDeviceGetProperties API to fetch the correct values.

@pratikbariintel
Copy link

Regarding the BMG details, on BMG Xe KMD is enabled. On Xe KMD only sysman initialization with zesInit is supported. Legacy sysman initialization is not supported.
https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md

@alanzhai219
Copy link
Author

alanzhai219 commented Mar 19, 2025

@pratikbariintel Hi, the current initialization is zeInit

status = zeInit(ZE_INIT_FLAG_GPU_ONLY);
.
I try to replace zesDeviceGetProperties with zeDeviceGetProperties and then zeDeviceGetProperties can get correct property.
So the next question is why zesDeviceGetProperties and zes related api failed.
As you paste above, you mean zes_device_handle_t should be fetched as above example guide?

@pratikbariintel
Copy link

The zeDeviceGetProperties is a core API and it expects a Core handle here (ze_device_handle_t). Here it will pass correctly.
However, the zesDeviceGetProperties is a sysman API and it expects a Sysman handle (zes_device_handle_t) (As the core handle and the sysman handle has been separated out).
All the zes related APIs are the Sysman API and hence will require only the Sysman handles.
The flow to use the zes APIs should be zesInit. zesDriverGet and zesDeviceGet

@alanzhai219
Copy link
Author

alanzhai219 commented Mar 19, 2025

@pratikbariintel Got it. I mis-understand the api according the spec.

/// @brief Handle of device object
typedef ze_device_handle_t zes_device_handle_t;

https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/include/zes_api.h#L34C13-L34C48

@AshwinKumarKulkarni
Copy link

On B580, to access sysman function.
Now: zesInit() + zesDriverGet() + zesDeviceGet() to be called. sysman devce handles output from zesdeviceGet should be used for all other subsequent sysman APIs. This is recommended option.

Later: we have also added support for core device handles to be used for zesDevice*** APIs after successful sysman initialization through zesInit(). This support has not yet reached public driver and may be available in approximately in months time.

@alanzhai219
Copy link
Author

zesDeviceEnumEngineGroups still cannot achieve correct engines on B580 via zes_device_handle_t.
@AshwinKumarKulkarni @pratikbariintel

@pratikbariintel
Copy link

@alanzhai219 The support for the enumeration of the Engine Handles for Xe driver has been recently added. It will be available with the new driver release in 3-4 weeks.

@mschilling0
Copy link
Contributor

@pratikbariintel @AshwinKumarKulkarni So, from this discussion, I think sysman still needs to be fixed. Ideally, we need to keep compatibility with devices older than BMG.

Is there an API call where we can determine if the device should use legacy mode or zesInit / zes_handle mode?

Should we just use a failed return code? Check /proc for xe (Linux)?

@pratikbariintel
Copy link

At present there is no separate API to check the legacy mode or the new mode. This should be referred for the Sysman initialization
https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#support-and-limitations

@AshwinKumarKulkarni
Copy link

Below pseudocode may help to decide zesInit/legacy, please check if it is useful

  //core initialization
  zeInit(..)
  zeDriverGet(...)
  zeDeviceGet(...)

  //check GPU platform
  ze_device_properties_t properties = {};
  ze_device_ip_version_ext_t ip_version_ext{};
  properties.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES;
  properties.pNext = &ip_version_ext;
  ip_version_ext.stype = ZE_STRUCTURE_TYPE_DEVICE_IP_VERSION_EXT;
  ip_version_ext.pNext = nullptr;
  result = zeDeviceGetProperties(ze_device, &properties);
  CHECK_RESULT_FOR_SUCCESS(result);

  //decision
  if (properties.type == ZE_DEVICE_TYPE_GPU && properties.vendorId == 0x8086) {
    ze_device_ip_version_ext_t *ip_version =
        static_cast<ze_device_ip_version_ext_t *>(properties.pNext);
    if (ip_version->ipVersion >= 0x05004000) { // BMGs ip version
      //go with zesInit based sysman init-recommended
      //Legacy not supported on Xe KMD
    }else{
      //go with legacy based sysman init 
    }
  }

@alanzhai219
Copy link
Author

These infrastructure-related APIs should be ready before the new hardware is released. @pratikbariintel @AshwinKumarKulkarni

@mschilling0
Copy link
Contributor

Below pseudocode may help to decide zesInit/legacy, please check if it is useful

  //core initialization
  zeInit(..)
  zeDriverGet(...)
  zeDeviceGet(...)

  //check GPU platform
  ze_device_properties_t properties = {};
  ze_device_ip_version_ext_t ip_version_ext{};
  properties.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES;
  properties.pNext = &ip_version_ext;
  ip_version_ext.stype = ZE_STRUCTURE_TYPE_DEVICE_IP_VERSION_EXT;
  ip_version_ext.pNext = nullptr;
  result = zeDeviceGetProperties(ze_device, &properties);
  CHECK_RESULT_FOR_SUCCESS(result);

  //decision
  if (properties.type == ZE_DEVICE_TYPE_GPU && properties.vendorId == 0x8086) {
    ze_device_ip_version_ext_t *ip_version =
        static_cast<ze_device_ip_version_ext_t *>(properties.pNext);
    if (ip_version->ipVersion >= 0x05004000) { // BMGs ip version
      //go with zesInit based sysman init-recommended
      //Legacy not supported on Xe KMD
    }else{
      //go with legacy based sysman init 
    }
  }

Thanks! I will try it out next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants