Skip to content
This repository has been archived by the owner on Feb 5, 2024. It is now read-only.

Commit

Permalink
Samsung SAIT presentation about SYCL PIM language extensions
Browse files Browse the repository at this point in the history
  • Loading branch information
Ruyk committed Oct 27, 2023
1 parent 7dcdbe9 commit 5f39d03
Showing 1 changed file with 75 additions and 33 deletions.
108 changes: 75 additions & 33 deletions language/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,61 +71,103 @@ Hyesun Hong,

* PIM/PNM technology enables computation directly on memory
* Prevents data movement improving performance and reducing consumption
* PIM operates directly on memory banks by reading and storing on rows and columns
* Operates directly on memory banks by reading and storing on rows and columns
* Aquabolt-XL is the first demonstrator
* Can be drop in on any memory controller
* CXL-PNM is the CXL variant for PNM, can work with multiple PIM

SYCL Extension for PIM/PNM
* Goals
* Seamlessly integrate PIM/PNM operation into SYCL
* Allow combination of xGPU and PIM/PNM in one device kernel
* Not specific to one hardware
* Design
* Vector operation seem like natural fit, but no convergence guarantee and vector size explicit
* Model as special function unit
* Aligns with trends to model special functional units inside accelerators
* Compiler automatic mapping often not possible
* joint_matrix
* Group functions
* Easy to use
* Can easily be combined with device code
* Give necessary convergence guarantees
* Recap of SYCL work-item, work-group and group functions
* Group functions must be encountered in converged control flow
* Work in collaboration with Codeplay Software team
* Goals

* Seamlessly integrate PIM/PNM operation into SYCL
* Allow combination of xGPU and PIM/PNM in one device kernel
* Not specific to one hardware

* Design

* Vector operation seem like natural fit
* no convergence guarantee and vector size explicit

* Model as special function unit

* Aligns with trends to model special functional units inside accelerators
* Compiler automatic mapping often not possible
* joint_matrix-like interface


* Group functions

* Easy to use
* Can easily be combined with device code
* Give necessary convergence guarantees


* Recap of SYCL work-item, work-group and group functions

* Group functions must be encountered in converged control flow

* Extension
* Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product
* Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process

* Extended group functions with additional overload of joint_reduce
* and new joint_transform and joint_inner_product
* Block size as template parameter, number of blocks as runtime parameter
* allows calculation of number of elements to process

* Extension for PNM
* Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group
* PNM standalone has less opportunity for parallelism, also limited by memory controller
* -> Combine PNM and PIM, PNM generates commands for PIM blocks

* Added new overloads of joint_exclusive_scan,
* joint_inclusive_scan, reduce_over_group

* PNM standalone has less opportunity for parallelism

* limited by memory controller
* -> Combine PNM and PIM, PNM generates commands for PIM blocks

* Two modes

* PIM mode: PIM blocks can operate independently, can choose number of blocks
* PNM mode: Synchronized execution on multiple PIM blocks

* Mapping

* Every PIM block is one work-item
* PNM with attached PIM blocks forms one work-group

* Execution
* Work-item operations map to PIM operation
* Group functions map to PNM operation

* Work-item operations map to PIM operation
* Group functions map to PNM operation

* Example

* work-item execution maps to PIM
* group function maps to PNM

* Conclusion

* Integrate support for PIM/PNM into SYCL

Q&A
* Are the proposed functions specific to PIM or could also be used with other HW?
* Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL
* Can also map nicely to other types of hardware, for example vector processor
* Are the proposed functions specific to PIM, could also be used with other HW?

* Can also be used with other hardware.
* Semantics not PIM-specific, but translation of C++ to SYCL
* Can also map nicely to other types of hardware, e.g. vector processor

* Why have the user explicitly specify a block-size?
* Not a hardware detail
* Rather a promise by the user that data-blocks will always be at least that big
* Promise allows device compiler to perform optimizations, efficient looping inside PIM unit
* Could num_blocks runtime parameter be replaced by iterator, requiring to be divisable by block-size
* Yes, that is possible, mainly a design question
* Current version might have additional implications regarding alignment

* Not a hardware detail
* Rather a promise by the user that data-blocks
will always be at least that big
* Promise allows device compiler to perform optimizations,
efficient looping inside PIM unit

* Could num_blocks runtime parameter be replaced by iterator?

* requires to be divisable by block-size
* Yes, that is possible, mainly a design question
* Current version might have additional implications regarding alignment


2023-06-05
Expand Down

0 comments on commit 5f39d03

Please sign in to comment.