You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0
Performance issue description
I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.
To reduce latency, I replaced the PyTorch function F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True) with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch for the original model and opt_unimatch for the modified one).
ori_unimatch:
benchmark_app -m ori_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 75.63 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ] img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ] ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ] ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 3059.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: Model0
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] PERF_COUNT: True
[ INFO ] ENABLE_CPU_PINNING: False
[ INFO ] MODEL_PRIORITY: Priority.MEDIUM
[ INFO ] GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ] GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ] GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ] GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ] GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ] CACHE_DIR:
[ INFO ] CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ] COMPILATION_NUM_THREADS: 32
[ INFO ] NUM_STREAMS: 2
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] INFERENCE_PRECISION_HINT: f16
[ INFO ] DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ] ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ] DEVICE_ID: 1
[ INFO ] EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 460.74 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count: 131 iterations
[ INFO ] Duration: 60207.90 ms
[ INFO ] Latency:
[ INFO ] Median: 458.77 ms
[ INFO ] Average: 458.70 ms
[ INFO ] Min: 452.05 ms
[ INFO ] Max: 465.72 ms
[ INFO ] Throughput: 4.35 FPS
opt_unimatch:
benchmark_app -m opt_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 80.84 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ] img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ] ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ] ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 8530.97 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: Model0
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] PERF_COUNT: True
[ INFO ] ENABLE_CPU_PINNING: False
[ INFO ] MODEL_PRIORITY: Priority.MEDIUM
[ INFO ] GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ] GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ] GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ] GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ] GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ] CACHE_DIR:
[ INFO ] CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ] COMPILATION_NUM_THREADS: 32
[ INFO ] NUM_STREAMS: 2
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] INFERENCE_PRECISION_HINT: f16
[ INFO ] DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ] ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ] DEVICE_ID: 1
[ INFO ] EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 242.54 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count: 278 iterations
[ INFO ] Duration: 60109.22 ms
[ INFO ] Latency:
[ INFO ] Median: 215.37 ms
[ INFO ] Average: 215.41 ms
[ INFO ] Min: 205.85 ms
[ INFO ] Max: 229.31 ms
[ INFO ] Throughput: 9.25 FPS
Hi, do you also have the same issue with the iGPU or CPU in your system?
Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead
OpenVINO Version
Master Branch
Operating System
Windows System
Device used for inference
dGPU
OpenVINO installation
PyPi
Programming Language
Python
Hardware Architecture
x86 (64 bits)
Model used
https://github.com/autonomousvision/unimatch
Model quantization
No
Target Platform
OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0
Performance issue description
I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.
To reduce latency, I replaced the PyTorch function
F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True)
with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch
for the original model andopt_unimatch
for the modified one).ori_unimatch:
opt_unimatch:
Step-by-step reproduction
GMFlow-scale2-regrefine6-mixdata
from the Model_Zoo and save it thepretrained
folder.gmflow_demo.sh
inScripts
to run the model:/unimatch/matching.py
to this implementation, and redo the step 4 and 5.Issue submission checklist
The text was updated successfully, but these errors were encountered: