From 5f39d0357234317a3a337c99c3edc67b51abffda Mon Sep 17 00:00:00 2001 From: Ruyman Reyes Castro Date: Tue, 24 Oct 2023 20:01:39 +0100 Subject: [PATCH] Samsung SAIT presentation about SYCL PIM language extensions --- language/README.rst | 108 ++++++++++++++++++++++++++++++-------------- 1 file changed, 75 insertions(+), 33 deletions(-) diff --git a/language/README.rst b/language/README.rst index ccefa5c..b9aa0c4 100644 --- a/language/README.rst +++ b/language/README.rst @@ -71,61 +71,103 @@ Hyesun Hong, * PIM/PNM technology enables computation directly on memory * Prevents data movement improving performance and reducing consumption -* PIM operates directly on memory banks by reading and storing on rows and columns +* Operates directly on memory banks by reading and storing on rows and columns * Aquabolt-XL is the first demonstrator * Can be drop in on any memory controller * CXL-PNM is the CXL variant for PNM, can work with multiple PIM SYCL Extension for PIM/PNM - * Goals - * Seamlessly integrate PIM/PNM operation into SYCL - * Allow combination of xGPU and PIM/PNM in one device kernel - * Not specific to one hardware - * Design - * Vector operation seem like natural fit, but no convergence guarantee and vector size explicit - * Model as special function unit - * Aligns with trends to model special functional units inside accelerators - * Compiler automatic mapping often not possible - * joint_matrix - * Group functions - * Easy to use - * Can easily be combined with device code - * Give necessary convergence guarantees - * Recap of SYCL work-item, work-group and group functions - * Group functions must be encountered in converged control flow +* Work in collaboration with Codeplay Software team +* Goals + + * Seamlessly integrate PIM/PNM operation into SYCL + * Allow combination of xGPU and PIM/PNM in one device kernel + * Not specific to one hardware + +* Design + + * Vector operation seem like natural fit + * no convergence guarantee and vector size explicit + +* Model as special function unit + + * Aligns with trends to model special functional units inside accelerators + * Compiler automatic mapping often not possible + * joint_matrix-like interface + + +* Group functions + + * Easy to use + * Can easily be combined with device code + * Give necessary convergence guarantees + + +* Recap of SYCL work-item, work-group and group functions + + * Group functions must be encountered in converged control flow + * Extension - * Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product - * Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process + + * Extended group functions with additional overload of joint_reduce + * and new joint_transform and joint_inner_product + * Block size as template parameter, number of blocks as runtime parameter + * allows calculation of number of elements to process + * Extension for PNM - * Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group -* PNM standalone has less opportunity for parallelism, also limited by memory controller - * -> Combine PNM and PIM, PNM generates commands for PIM blocks + + * Added new overloads of joint_exclusive_scan, + * joint_inclusive_scan, reduce_over_group + +* PNM standalone has less opportunity for parallelism + + * limited by memory controller + * -> Combine PNM and PIM, PNM generates commands for PIM blocks + * Two modes + * PIM mode: PIM blocks can operate independently, can choose number of blocks * PNM mode: Synchronized execution on multiple PIM blocks + * Mapping + * Every PIM block is one work-item * PNM with attached PIM blocks forms one work-group + * Execution - * Work-item operations map to PIM operation - * Group functions map to PNM operation + + * Work-item operations map to PIM operation + * Group functions map to PNM operation + * Example + * work-item execution maps to PIM * group function maps to PNM + * Conclusion + * Integrate support for PIM/PNM into SYCL Q&A -* Are the proposed functions specific to PIM or could also be used with other HW? - * Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL - * Can also map nicely to other types of hardware, for example vector processor +* Are the proposed functions specific to PIM, could also be used with other HW? + + * Can also be used with other hardware. + * Semantics not PIM-specific, but translation of C++ to SYCL + * Can also map nicely to other types of hardware, e.g. vector processor + * Why have the user explicitly specify a block-size? - * Not a hardware detail - * Rather a promise by the user that data-blocks will always be at least that big - * Promise allows device compiler to perform optimizations, efficient looping inside PIM unit -* Could num_blocks runtime parameter be replaced by iterator, requiring to be divisable by block-size - * Yes, that is possible, mainly a design question - * Current version might have additional implications regarding alignment + + * Not a hardware detail + * Rather a promise by the user that data-blocks + will always be at least that big + * Promise allows device compiler to perform optimizations, + efficient looping inside PIM unit + +* Could num_blocks runtime parameter be replaced by iterator? + + * requires to be divisable by block-size + * Yes, that is possible, mainly a design question + * Current version might have additional implications regarding alignment 2023-06-05