[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

schrodingho · 2025-01-15T08:39:05Z

OpenVINO Version

Master Branch

Operating System

Windows System

Device used for inference

dGPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

https://github.com/autonomousvision/unimatch

Model quantization

No

Target Platform

OS Name: Microsoft Windows 11 Enterprise
OS Version: 10.0.22631 N/A Build 22631
CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU.0: Intel(R) UHD Graphics 770
GPU.1: Intel(R) Arc(TM) A770 Graphics
OpenVINO version: 2024.6.0

Performance issue description

I used OpenVINO to accelerate Unimatch flow inference on a dGPU (Arc A770) and profiled the converted model using benchmark_app. The profiling report revealed that GridSample is the bottleneck, accounting for 80% of the total execution time.

To reduce latency, I replaced the PyTorch function F.grid_sample(input, grid, mode="bilinear", padding_mode="zeros", align_corners=True) with a decomposed version (from this implementation). After benchmarking, this modification reduced the latency from 458.70ms to 215.41ms without affecting the generated flows. I am curious why the original GridSample operator is slow on the Arc A770. Do you have any insights, or suggest some other optimizations, like customizing GridSample OpenCL kernel? I've attached the benchmark_app results and reports for reference (ori_unimatch for the original model and opt_unimatch for the modified one).

ori_unimatch:

benchmark_app -m ori_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 75.63 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_7) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 3059.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 460.74 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to 
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            131 iterations
[ INFO ] Duration:         60207.90 ms
[ INFO ] Latency:
[ INFO ]    Median:        458.77 ms
[ INFO ]    Average:       458.70 ms
[ INFO ]    Min:           452.05 ms
[ INFO ]    Max:           465.72 ms
[ INFO ] Throughput:   4.35 FPS

opt_unimatch:

benchmark_app -m opt_unimatch.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Build ................................. 2024.6.0-17404-4c0f47d2335-releases/2024/6
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Turn on performance counters for GPU.1 device since report type is detailed_counters.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 80.84 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : f32 / [...] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : f32 / [...] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 2
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     img0 (node: img0) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ]     img1 (node: img1) : u8 / [N,C,H,W] / [2,3,320,576]
[ INFO ] Model outputs:
[ INFO ]     ***NO_NAME*** (node: aten::reshape/Reshape_16) : f32 / [...] / [2,2,320,576]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 8530.97 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ]   PERF_COUNT: True
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]   GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]   CACHE_DIR:
[ INFO ]   CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]   PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   COMPILATION_NUM_THREADS: 32
[ INFO ]   NUM_STREAMS: 2
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   INFERENCE_PRECISION_HINT: f16
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]   DEVICE_ID: 1
[ INFO ]   EXECUTION_DEVICES: ['GPU.1']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'img0'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'img1'!. This input will be filled with random values!
[ INFO ] Fill input 'img0' with random values
[ INFO ] Fill input 'img1' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 242.54 ms
[Step 11/11] Dumping statistics report
[ INFO ] Performance counters report is stored to
[ INFO ] Statistics report is stored to 
[ INFO ] Execution Devices:['GPU.1']
[ INFO ] Count:            278 iterations
[ INFO ] Duration:         60109.22 ms
[ INFO ] Latency:
[ INFO ]    Median:        215.37 ms
[ INFO ]    Average:       215.41 ms
[ INFO ]    Min:           205.85 ms
[ INFO ]    Max:           229.31 ms
[ INFO ] Throughput:   9.25 FPS

Step-by-step reproduction

Clone the Unimatch.
Download the pretrained model GMFlow-scale2-regrefine6-mixdata from the Model_Zoo and save it the pretrained folder.
Follow the script gmflow_demo.sh in Scripts to run the model:

python main_flow.py \
--inference_dir demo/flow-davis \
--resume pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth \
--output_path output/gmflow-scale2-regrefine6-davis \
--padding_factor 16 \
--upsample_factor 4 \
--num_scales 2 \
--attn_splits_list 2 8 \
--corr_radius_list -1 4 \
--prop_radius_list -1 1 \
--reg_refine \
--num_reg_refine 2

Add OpenVINO converting code in it and compile the model.

from pathlib import Path
import openvino as ov
ov_opt_device = "cpu"
model_without_ddp = model_without_ddp.to(ov_opt_device)

FIG_H = 320
FIG_W = 576

dummy_input1 = torch.randn(2, 3, FIG_H, FIG_W)
dummy_input2 = torch.randn(2, 3, FIG_H, FIG_W)

example_inputs = (
    dummy_input1,
    dummy_input2,
)
inputs = {
    "img0": dummy_input1,
    "img1": dummy_input2,
}
input_info = [(name, list(inp.shape)) for name, inp in inputs.items()]
UNIMATCH_OV_PATH = Path(f"opt_unimatch.xml")
model_without_ddp.eval()

with torch.no_grad():
    ov_model = ov.convert_model(model_without_ddp, input=input_info, example_input=example_inputs)
    ov.save_model(ov_model, UNIMATCH_OV_PATH, compress_to_fp16=True)

Use benchmark_app to profile it.

benchmark_app -m %converted_model%.xml -d GPU.1 -api sync -infer_precision f16 -hint throughput -report_type detailed_counters -report_folder "%report_folder%"

Change the F.gridsample in /unimatch/matching.py to this implementation, and redo the step 4 and 5.

Issue submission checklist

I'm reporting a performance issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

The text was updated successfully, but these errors were encountered:

dnkurek · 2025-01-15T19:30:19Z

Hi, do you also have the same issue with the iGPU or CPU in your system?

Could be that simply grid_sample kernel was not optimized at all, since you are running the slow reference version. This would probably involve writing a opt version instead

schrodingho · 2025-01-16T02:36:55Z

Hi, I just ran benchmarks on the iGPU (UHD 770) and the CPU (i9-13900K). The iGPU has the same issue (the grid_sample_ref is slow):

ori_unimatch

opt_unimatch

For CPU, it seems to have no such issue (original is better):

ori_unimatch

opt_unimatch

dnkurek · 2025-01-16T04:50:23Z

Yeah so it looks like grid_sample_ref needs to be optimized and make a grid_sample_opt version perhaps...

schrodingho added performance Performance related topics support_request labels Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

schrodingho commented Jan 15, 2025 •

edited

Loading

dnkurek commented Jan 15, 2025

schrodingho commented Jan 16, 2025

dnkurek commented Jan 16, 2025

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

[Performance]: GridSample in converted model runs very slowly on Arc770 dGPU #28448

Comments

schrodingho commented Jan 15, 2025 • edited Loading

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Target Platform

Performance issue description

Step-by-step reproduction

Issue submission checklist

dnkurek commented Jan 15, 2025

schrodingho commented Jan 16, 2025

dnkurek commented Jan 16, 2025

schrodingho commented Jan 15, 2025 •

edited

Loading