diff --git a/README.md b/README.md index 9b278ec..296d17b 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ git clone https://github.com/oprecomp/HLS_BLSTM.git /actions/hls_blst cd make snap_config (In the ncurses menu select HLS_BLSTM) ``` - * Latest supported [snap version:](https://github.com/open-power/snap/commit/2fb8fb85f9a6ec7bdbf837522c8ce839e87de281) + * Latest supported snap version: [2fb8fb85f9a6ec7bdbf837522c8ce839e87de281](https://github.com/open-power/snap/commit/2fb8fb85f9a6ec7bdbf837522c8ce839e87de281) (Updated 27-09-2019) ```Bash cd git checkout 2fb8fb85f9a6ec7bdbf837522c8ce839e87de281 @@ -81,6 +81,12 @@ xsim -gui hardware/six/xsim/latest/top.wdb & ``` * Run the hardware version (bitstream preparation on x86, run on POWER8/9) + * Due to a bug in floating point - to - fixed point conversion in C++ synthesizable code in versions of Vivado > 2017.1, we had to do the casting of input pixel values from floating to fixed point in CPU. + * To do so, we need the equivalent Xilinx libraries in order to compile an executable in POWER, with these Xilinx libraries. + * Since these libraries are copyright to Xilinx, we cannot include them in this repo. + * So, given that we are ready to execute the action on POWER, e.g. server `powerserver`, and we have already verified the cosim on a x86 development server, e.g. `devhostx86`, we need to copy the required Xilinx libraries from `devhostx86` to `powerserver`, as follows (considering we have logged in to `powerserver`): + * `scp -r devhostx86:/actions/hls_blstm/sw/third-party/xilinx/* /actions/hls_blstm/sw/third-party/xilinx/third-party/xilinx/` + * The Xilinx libraries in `devhostx86` should have been copied from Xilinx installation dir to `hls_blstm/sw/third-party/xilinx/` in a earlier step, when executing `make` in `hls_blstm/snap_modification_files`. ```Bash cd make image @@ -89,18 +95,87 @@ scp hardware/build/Images/.bin user@remoteP8Server://path_to_bi (ont the remote POWER8/9 server, given that you have cloned the repo and having prepared files like in x86) sudo capi-flash-script /path_to-bin/file.bin cd /actions/hls_blstm/sw -SNAP_CONFIG=0x0 make +SNAP_CONFIG=FPGA make sudo ../../../software/snap_maint -vvv sudo SNAP_CONFIG=FPGA ./snap_blstm -i ../data/samples_1/ -g ../data/gt_1/ -C0 ``` +*
For example this is sample output for 1 image (click to expand) +

+ + ##### SNAP_CONFIG=FPGA ./snap_blstm -i ../data/samples_1/ -g ../data/gt_1/ -o out.txt + + ```bash + INFO: Read 1 files from path ../data/samples_1/ + INFO: Read 1 files from path ../data/gt_1/ + DEBUG: listOfImages.size() = 1 + Start ... + DEBUG: numberOfColumnsVec[0] = 566, total_pixels_in_action = 0 + INFO: numberOfColumnsVec[0] = 566 + ACTION PARAMETERS: + input image 0: ../data/samples_1/fontane_brandenburg01_1862_0043_1600px_010001.raw.lnrm.png.txt, 566 columns, 28300 fw-bw pixels, 113200 bytes + output: out.txt + type_in: 0 HOST_DRAM + addr_in: 00007fff9adf0000 + type_out: 0 HOST_DRAM + addr_out: 00000000359d0000 + size_in: 113200 (0x0001ba30) + size_out: 332 (0x0000014c) + prepare blstm job of 80 bytes size + 00000000: 00 00 df 9a ff 7f 00 00 30 ba 01 00 00 00 12 00 | ........0....... + 00000010: 00 00 9d 35 00 00 00 00 4c 01 00 00 00 00 23 00 | ...5....L....... + 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ + 00000030: 36 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 6............... + 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ + + INFO: Accelerator returned code on MMIO (AXILite job struct field) : 126 + INFO: AXI transactions registered on MMIO : In: 1769(0x6e9), Out: 5(0x5) + 00000000: 00 00 df 9a ff 7f 00 00 30 ba 01 00 00 00 12 00 | ........0....... + 00000010: 00 00 9d 35 00 00 00 00 4c 01 00 00 00 00 23 00 | ...5....L....... + 00000020: 7e 00 00 00 44 00 00 00 e9 06 00 00 05 00 00 00 | ....D........... + 00000030: 36 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 6............... + 00000040: 44 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | D............... + + DEBUG tb: vecPredictedStringLen[0] = 68 + INFO: writing output data 0x359d0000 68 uintegers to out.txt + INFO: RETC=102 + INFO: SNAP run 0 blstm took 96411 usec + 0 Expected: spruches nichts, daß eine leise Bitterkeit oder ein Wort der Resig- + Predicted: spruches nichts, daß eine leise Bitterkeit oder ein Wort der Resig- Accuracy: 100 % + Predicted id: 72 69 71 74 56 61 58 72 1 67 62 56 61 73 72 8 1 57 54 86 1 58 62 67 58 1 65 58 62 72 58 1 27 62 73 73 58 71 64 58 62 73 1 68 57 58 71 1 58 62 67 1 0 48 68 71 73 1 57 58 71 1 43 58 72 62 60 9 + INFO: Accelerator status code on MMIO : 126 (0x7e) + INFO: AXI transaction registered on MMIO : + INFO: In: 1769 (0x6e9) + INFO: Out:5 (0x5) + Accuracy: 100% + Measured time ... 0 seconds (111596 us) for 1 images. Action time 96411 us (96411 us per action -> 1 images, ~96411 us / image) + ``` +

+
+ +* You can choose the level of verbosity in [`#define DEBUG_LEVEL LOG_CRITICAL`](https://github.ibm.com/DID/hls_blstm/blob/master/include/common_def.h) (when you measure timing, choose the `LOG_NOTHING` option). + +## Accelerator Scaling +* The HLS_BLSTM action has been designed in a way that the scaling the number of parallel processing engines is automatically enabled by a single option: `HW_THREADS_PER_ACTION`] in (https://github.ibm.com/DID/hls_blstm/blob/master/include/common_def.h). +* Please note that in order to enable this option, another option has to be aligned `ACC_CALLS_PER_ACTION`. The difference is as follows: + * `ACC_CALLS_PER_ACTION`: The number of accelerator calls on a single action execution. This number defines how many data streams shall be input to IP from host memory (and results back), It should be less or equal to MAX_NUMBER_IMAGES_TEST_SET. Valid for simulation and synthesis. + * `HW_THREADS_PER_ACTION`: The number of physical accelerator threads per action. It differentiates from `ACC_CALLS_PER_ACTION`, as this value defines the number of physical accelerator instantiations, regardless the input size, i.e. if `HW_THREADS_PER_ACTION < ACC_CALLS_PER_ACTION`, then some of the physical accelerators shall be executed more than once (for serving extra load), while, when `HW_THREADS_PER_ACTION == ACC_CALLS_PER_ACTION`, then all physical accelerators shall be executed exactly once. It should be less or equal to `ACC_CALLS_PER_ACTION`. Valid only for synthesis. + * Practically, the `ACC_CALLS_PER_ACTION` controls the batching of input images per AFU, while the `HW_THREADS_PER_ACTION` the real parallel engines in FPGA. + * In `AD8K5` and `ADKU3` it was difficult to succeed valid timing closure of less than -200ps with more than `HW_THREADS_PER_ACTION = ACC_CALLS_PER_ACTION = 2`. + * In `AD9V3` the best scenario that has been tested is `HW_THREADS_PER_ACTION = ACC_CALLS_PER_ACTION = 4` at 250MHz. (reaching ~95% BRAMs). + + ![HLS_BLSTM scaling](./var/hls_blstm_scaling.png "Overview of hls_blstm architecture.") + ## Dependencies ### i. FPGA Card selection As of now, the following FPGA card has been used with HLS_BLSTM: -* [Alpha-Data ADM-PCIE-KU3](http://www.alpha-data.com/dcp/products.php?product=adm-pcie-ku3) -* [Alpha-Data AAADM-PCIE-9V3](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3) +* CAPI1.0 + * [Alpha-Data ADM-PCIE-KU3](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-ku3) + * [Alpha-Data ADM-PCIE-8K5](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-8k5) +* CAPI2.0 + * [Alpha-Data ADM-PCIE-9V3](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3) ### ii. Development #### a) SNAP @@ -131,12 +206,6 @@ The original software BLSTM code may be referred by the following citation: `V. ## Acknowledgement The original BLSTM software code of this example has been coded by Vladimir Rybalkin. It was provided by the Microelectronic Systems Design Research Group, University of Kaiserslautern, as part of the [OPRECOMP](http://oprecomp.eu/) microbenchmark suite. -## Next steps -The hls_blstm demonstrator has been already tested on the ADM-PCIE-KU3 device (FPGA XCKU060-FFVA1156), attached a POWER8 host, on IBM Zurich Heterogeneous Cloud (ZHC2) cloud. Future milestones are: - -- [ ] Porting to ADM-PCIE-8K5 (XCKU115-2-FLVA1517E) - almost double resources than KU3. -- [ ] Porting to POWER9 + CAPI2.0 - ## License Copyright 2018 - The OPRECOMP Project Consortium, IBM Research GmbH. All rights reserved. diff --git a/var/hls_blstm_scaling.png b/var/hls_blstm_scaling.png new file mode 100644 index 0000000..3d038ac Binary files /dev/null and b/var/hls_blstm_scaling.png differ