Skip to content

Commit

Permalink
Updated README.
Browse files Browse the repository at this point in the history
  • Loading branch information
diamantopoulos committed Sep 27, 2019
1 parent bfffcba commit 3c8bd95
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 10 deletions.
89 changes: 79 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ git clone https://github.com/oprecomp/HLS_BLSTM.git <SNAP_ROOT>/actions/hls_blst
cd <SNAP_ROOT>
make snap_config (In the ncurses menu select HLS_BLSTM)
```
* Latest supported [snap version:](https://github.com/open-power/snap/commit/2fb8fb85f9a6ec7bdbf837522c8ce839e87de281)
* Latest supported snap version: [2fb8fb85f9a6ec7bdbf837522c8ce839e87de281](https://github.com/open-power/snap/commit/2fb8fb85f9a6ec7bdbf837522c8ce839e87de281) (Updated 27-09-2019)
```Bash
cd <SNAP_ROOT>
git checkout 2fb8fb85f9a6ec7bdbf837522c8ce839e87de281
Expand Down Expand Up @@ -81,6 +81,12 @@ xsim -gui hardware/six/xsim/latest/top.wdb &
```
* Run the hardware version (bitstream preparation on x86, run on POWER8/9)
* Due to a bug in floating point - to - fixed point conversion in C++ synthesizable code in versions of Vivado > 2017.1, we had to do the casting of input pixel values from floating to fixed point in CPU.
* To do so, we need the equivalent Xilinx libraries in order to compile an executable in POWER, with these Xilinx libraries.
* Since these libraries are copyright to Xilinx, we cannot include them in this repo.
* So, given that we are ready to execute the action on POWER, e.g. server `powerserver`, and we have already verified the cosim on a x86 development server, e.g. `devhostx86`, we need to copy the required Xilinx libraries from `devhostx86` to `powerserver`, as follows (considering we have logged in to `powerserver`):
* `scp -r devhostx86:<snap_root>/actions/hls_blstm/sw/third-party/xilinx/* <snap_root>/actions/hls_blstm/sw/third-party/xilinx/third-party/xilinx/`
* The Xilinx libraries in `devhostx86` should have been copied from Xilinx installation dir to `hls_blstm/sw/third-party/xilinx/` in a earlier step, when executing `make` in `hls_blstm/snap_modification_files`.
```Bash
cd <SNAP_ROOT>
make image
Expand All @@ -89,18 +95,87 @@ scp <SNAP_ROOT>hardware/build/Images/<file>.bin user@remoteP8Server://path_to_bi
(ont the remote POWER8/9 server, given that you have cloned the repo and having prepared files like in x86)
sudo capi-flash-script /path_to-bin/file.bin
cd <SNAP_ROOT>/actions/hls_blstm/sw
SNAP_CONFIG=0x0 make
SNAP_CONFIG=FPGA make
sudo ../../../software/snap_maint -vvv
sudo SNAP_CONFIG=FPGA ./snap_blstm -i ../data/samples_1/ -g ../data/gt_1/ -C0
```
* <details><summary>For example this is sample output for 1 image (click to expand)</summary>
<p>
##### SNAP_CONFIG=FPGA ./snap_blstm -i ../data/samples_1/ -g ../data/gt_1/ -o out.txt
```bash
INFO: Read 1 files from path ../data/samples_1/
INFO: Read 1 files from path ../data/gt_1/
DEBUG: listOfImages.size() = 1
Start ...
DEBUG: numberOfColumnsVec[0] = 566, total_pixels_in_action = 0
INFO: numberOfColumnsVec[0] = 566
ACTION PARAMETERS:
input image 0: ../data/samples_1/fontane_brandenburg01_1862_0043_1600px_010001.raw.lnrm.png.txt, 566 columns, 28300 fw-bw pixels, 113200 bytes
output: out.txt
type_in: 0 HOST_DRAM
addr_in: 00007fff9adf0000
type_out: 0 HOST_DRAM
addr_out: 00000000359d0000
size_in: 113200 (0x0001ba30)
size_out: 332 (0x0000014c)
prepare blstm job of 80 bytes size
00000000: 00 00 df 9a ff 7f 00 00 30 ba 01 00 00 00 12 00 | ........0.......
00000010: 00 00 9d 35 00 00 00 00 4c 01 00 00 00 00 23 00 | ...5....L.......
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
00000030: 36 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 6...............
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

INFO: Accelerator returned code on MMIO (AXILite job struct field) : 126
INFO: AXI transactions registered on MMIO : In: 1769(0x6e9), Out: 5(0x5)
00000000: 00 00 df 9a ff 7f 00 00 30 ba 01 00 00 00 12 00 | ........0.......
00000010: 00 00 9d 35 00 00 00 00 4c 01 00 00 00 00 23 00 | ...5....L.......
00000020: 7e 00 00 00 44 00 00 00 e9 06 00 00 05 00 00 00 | ....D...........
00000030: 36 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 6...............
00000040: 44 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | D...............

DEBUG tb: vecPredictedStringLen[0] = 68
INFO: writing output data 0x359d0000 68 uintegers to out.txt
INFO: RETC=102
INFO: SNAP run 0 blstm took 96411 usec
0 Expected: spruches nichts, daß eine leise Bitterkeit oder ein Wort der Resig-
Predicted: spruches nichts, daß eine leise Bitterkeit oder ein Wort der Resig- Accuracy: 100 %
Predicted id: 72 69 71 74 56 61 58 72 1 67 62 56 61 73 72 8 1 57 54 86 1 58 62 67 58 1 65 58 62 72 58 1 27 62 73 73 58 71 64 58 62 73 1 68 57 58 71 1 58 62 67 1 0 48 68 71 73 1 57 58 71 1 43 58 72 62 60 9
INFO: Accelerator status code on MMIO : 126 (0x7e)
INFO: AXI transaction registered on MMIO :
INFO: In: 1769 (0x6e9)
INFO: Out:5 (0x5)
Accuracy: 100%
Measured time ... 0 seconds (111596 us) for 1 images. Action time 96411 us (96411 us per action -> 1 images, ~96411 us / image)
```
</p>
</details>
* You can choose the level of verbosity in [`#define DEBUG_LEVEL LOG_CRITICAL`](https://github.ibm.com/DID/hls_blstm/blob/master/include/common_def.h) (when you measure timing, choose the `LOG_NOTHING` option).
## Accelerator Scaling
* The HLS_BLSTM action has been designed in a way that the scaling the number of parallel processing engines is automatically enabled by a single option: `HW_THREADS_PER_ACTION`] in (https://github.ibm.com/DID/hls_blstm/blob/master/include/common_def.h).
* Please note that in order to enable this option, another option has to be aligned `ACC_CALLS_PER_ACTION`. The difference is as follows:
* `ACC_CALLS_PER_ACTION`: The number of accelerator calls on a single action execution. This number defines how many data streams shall be input to IP from host memory (and results back), It should be less or equal to MAX_NUMBER_IMAGES_TEST_SET. Valid for simulation and synthesis.
* `HW_THREADS_PER_ACTION`: The number of physical accelerator threads per action. It differentiates from `ACC_CALLS_PER_ACTION`, as this value defines the number of physical accelerator instantiations, regardless the input size, i.e. if `HW_THREADS_PER_ACTION < ACC_CALLS_PER_ACTION`, then some of the physical accelerators shall be executed more than once (for serving extra load), while, when `HW_THREADS_PER_ACTION == ACC_CALLS_PER_ACTION`, then all physical accelerators shall be executed exactly once. It should be less or equal to `ACC_CALLS_PER_ACTION`. Valid only for synthesis.
* Practically, the `ACC_CALLS_PER_ACTION` controls the batching of input images per AFU, while the `HW_THREADS_PER_ACTION` the real parallel engines in FPGA.
* In `AD8K5` and `ADKU3` it was difficult to succeed valid timing closure of less than -200ps with more than `HW_THREADS_PER_ACTION = ACC_CALLS_PER_ACTION = 2`.
* In `AD9V3` the best scenario that has been tested is `HW_THREADS_PER_ACTION = ACC_CALLS_PER_ACTION = 4` at 250MHz. (reaching ~95% BRAMs).
![HLS_BLSTM scaling](./var/hls_blstm_scaling.png "Overview of hls_blstm architecture.")
## Dependencies
### i. FPGA Card selection
As of now, the following FPGA card has been used with HLS_BLSTM:
* [Alpha-Data ADM-PCIE-KU3](http://www.alpha-data.com/dcp/products.php?product=adm-pcie-ku3)
* [Alpha-Data AAADM-PCIE-9V3](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3)
* CAPI1.0
* [Alpha-Data ADM-PCIE-KU3](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-ku3)
* [Alpha-Data ADM-PCIE-8K5](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-8k5)
* CAPI2.0
* [Alpha-Data ADM-PCIE-9V3](https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3)
### ii. Development
#### a) SNAP
Expand Down Expand Up @@ -131,12 +206,6 @@ The original software BLSTM code may be referred by the following citation: `V.
## Acknowledgement
The original BLSTM software code of this example has been coded by Vladimir Rybalkin. It was provided by the Microelectronic Systems Design Research Group, University of Kaiserslautern, as part of the [OPRECOMP](http://oprecomp.eu/) microbenchmark suite.
## Next steps
The hls_blstm demonstrator has been already tested on the ADM-PCIE-KU3 device (FPGA XCKU060-FFVA1156), attached a POWER8 host, on IBM Zurich Heterogeneous Cloud (ZHC2) cloud. Future milestones are:
- [ ] Porting to ADM-PCIE-8K5 (XCKU115-2-FLVA1517E) - almost double resources than KU3.
- [ ] Porting to POWER9 + CAPI2.0
## License
Copyright 2018 - The OPRECOMP Project Consortium,
IBM Research GmbH. All rights reserved.
Expand Down
Binary file added var/hls_blstm_scaling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 3c8bd95

Please sign in to comment.