# FPGA implementation of a memory-mapped coprocessor Tutorial 11 on Dedicated systems Teacher: Giuseppe Scollo University of Catania Department of Mathematics and Computer Science Graduate Course in Computer Science, 2017-18 DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo #### 1 di 16 ### Table of Contents - 1. FPGA implementation of a memory-mapped coprocessor - 2. tutorial outline - 3. project workflow - 4. coprocessor hardware interface - 5. coprocessor as a Qsys component (1) - 6. coprocessor as a Qsys component (2) - 7. coprocessor as a Qsys component (3) - 8. Nios II system with coprocessor and Performance Counter - 9. mapping to FPGA and compilation - 10. software driver - 11. test and performance measurement programs (1) - 12. test and performance measurement programs (2) - 13. test with blocking acceleration - 14. test with nonblocking acceleration - 15. references DMI - Graduate Course in Computer Science ## tutorial outline # this tutorial deals with: - > FPGA implementation of the project idea proposed in lecture 11 - coprocessor hardware interface - > coprocessor as a Qsys component - Nios II system with coprocessor and Performance Counter - software driver - test and performance measurement with the Monitor Program - test with blocking acceleration - test with nonblocking acceleration DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 3 di 16 ## project workflow # development main phases: - > VHDL description of the coprocessor with Avalon MM interface - Qsys construction of a Nios II system with coprocessor and performance counter - > system mapping to FPGA and compilation - > TCL script production for HAL software driver generation - production of the software application for testing and performance measurement, in two versions: sequential: blocking execution of the coprocessor computation pipelined: nonblocking execution of the coprocessor computation - compilation and execution of the application under the Monitor Program, for two variants of each version: one with defaut value of the optimization level, the other with level O3 - save of performance reports and project archiving DMI - Graduate Course in Computer Science ## coprocessor hardware interface two VHDL sources implement the memory-mapped coprocessor: - delay\_collatz.vhd, modified version of the output by the fdlvhd translation of the Gezel source presented in the second lecture, according to the third lab experience - delay\_collatz\_interface.vhd, which contains an instance of the computational component and accesses the following Avalon bus signals: clock, resetn, read, write, chipselect, waitrequest, writedata, readdata both files are avaulable in the vhdl folder of the attached archive, as well as in the VHDL/code/e11 folder of the reserved lab area the folder also contains std\_logic\_arithext.vhd, which is needed to compile the computational component, delay\_collatz\_codesign.vhd, which will be explained next, and delay\_collatz.diff, the latter for documentation only, to show the changes made to the translation produced by fdlvhd consultation of the delay\_collatz\_interface.vhd source shows the relationships between the I/O signals of the computational component and the Avalon interface signals DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 5 di 16 ## coprocessor as a Qsys component (1) folder codesign in the attached archive is preset to host the project development after having copied the \*.vhd files from folder vhdl into it, the Qsys custom component construction goes much like in the tutorial seen in lab tutorial 10, with due differences for the present case after creation of project delay\_collatz\_codesign, with top-level entity having the same name, the construction of the custom component delay\_collatz\_interface may proceed in particular, the Conduit Avalon interface is not needed by the present component, since it makes no use of peripherals outside the FPGA the new component type definition is shown in the figure DMI - Graduate Course in Computer Science ## coprocessor as a Qsys component (2) the next step is the assignment of VHDL files that describe the component and their analysis, as shown in the figure # N.B. for this project, it is not necessary to copy files for simulation DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 7 di 16 ## coprocessor as a Qsys component (3) finally, the new component definition ends with the definition of its Avalon interfaces and placement of its signals under the appropriate interfaces, as shown in the figure DMI — Graduate Course in Computer Science DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 9 di 16 ## mapping to FPGA and compilation for the construction of the Nios II system shown in the previous figures it may be useful to consult the Qsys introduction tutorial with a few differences, such as: memory size is 128 KB in the present case, all base addresses are assigned by th system, etc. the final steps to map the system to the FPGA are as follows: ## in Qsys: - save the system with name embedded\_system by File > Save As... - generate the VHDL code for it by Generate > Generate HDL... # exit Qsys, then in Quartus: - assign the embedded\_system.qip file (in embedded\_system/synthesis) to the project - import assignments from file DE1\_SoC.qsf in folder de1soc of the attached archive - File > Save Project - > compile delay\_collatz\_codesign.vhd DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 10 di 16 #### software driver folder script in the attached archive contains two TCL scripts for the generation of the software driver in the BSP for the project the two scripts differ for a single command, present in one of them, that prescribes optimization level 03 rather than the default level 01 these two scripts are to be copied in folder codesign/ip/delay\_collatz\_avalon\_interface in the same folder, respectively under HAL/inc and HAL/src, copy is to be made of the C sources delay\_collatz\_avalon\_interface.h and delay\_collatz\_avalon\_interface.c of the software driver, that are available in folder src of the attached archive the TCL scripts were written by analogy with the TCL script for the software driver of the Performance Counter, available in the Quartus Prime Lite 16.1 distribution under path \$SOPC KIT NIOS2/../ip/altera/sopc builder ip/altera avalon performance counter similarly, the C sources of the software driver were written by (more limited) analogy with the C sources of the software driver of the same IP Core, in folder HAL under the aforementioned path the motivation for this, perhaps unorthodox, way of producing the software driver lies in the twofold fact that - the Avalon interface of the custom component does not fit into any of the HAL generic device model classes defined in Chapter 7 of the Nios II Classic Software Developer's Handbook - neither does the Performance Counter Unit IP Core fit therein ... together with a somewhat reasonable level of operational analogy between the two components skimming through Chapter 7 of the handbook is recommended nonetheless, to get a better understanding of the software driver structure and contents DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 11 di 16 ## test and performance measurement programs (1) folder src in the attached archive contains the subject programs, which are to be copied in the provided folders for the creation of test and performance measurement projects under the Monitor Program, as follows: - by delay collatz sequential timing.c in codesign/amp s and in codesign/amp s o3 - b delay\_collatz\_pipelined\_timing.c in codesign/amp\_p and in codesign/amp\_p\_o3 project creation parameters are summarized in the attached file Monitor Notes.txt the DE1-SoC needs to be powered-up and connected to the PC, to program the FPGA at the end of each project creation main differences between the source of lab tutorial 09 and the present sequential version: - #include and #define directives relating to the custom component - > replacement of the input from the switches device with a constant - replacement of the body of function delay\_collatz with two instructions from the software driver of the custom component DMI - Graduate Course in Computer Science ## test and performance measurement programs (2) the *pipelined* version of the program exhibits much stronger differences with respect to the program of lab tutorial 09: the interaction with the custom hardware is made nonblocking by replacing the delay\_collatz function call with an *inlining* of its body, yet where the software computation of the next trajectory start point is placed in between the two inlined instructions, respectively to start the hardware computation and to read its result the synchronization mechanism is very simple, thanks to properties of the custom component and of the waitrequest signal of the Avalon MM protocol: - for trajectories faster than the software computation, the custom component keeps the result in its internal register while waiting the read command - for trajectories slower than the software computation, the read command is kept waiting by the Avalon interface by means of the waitrequest signal DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 13 di 16 ## test with blocking acceleration compilation, loading on the FPGA and execution of program delay\_collatz\_sequential\_timing.c, in the two projects codesign/amp\_s and codesign/amp\_s\_o3, produces the Performance Counter Reports in the figure the remarkable reduction of the execution time of section delay\_collatz in the second variant may be explained by the function *inlining* under compilation 03 | Terminal | | | | | | | | | |------------------------------------------------------------------------------------|------|------------|---------------|-------------|--|--|--|--| | Performance Counter Report<br>Total Time: 0.499495 seconds (24974733 clock-cycles) | | | | | | | | | | Section | 8 | Time (sec) | Time (clocks) | Occurrences | | | | | | traject_start | 38 | 0.19005 | 9502720 | 65536 | | | | | | delay_collatz | 44.4 | 0.22162 | 11081057 | 65536 | | | | | | Terminal | | | | | | | | |------------------------------------|------|------|-----|--------|------|----------|-------------| | Performance Co<br>Total Time: 0.34 | 6551 | seco | nds | (17327 | | | | | Section | 8 | į T | ime | (sec) | Time | (clocks) | Occurrences | | traject_start | 49. | 5 | 0. | 17170 | | 8585216 | 65536 | | delay_collatz | 35. | 7 | 0. | 12373 | | 6186326 | 65536 | | | | | | | | | | a speed-up by an order of magnitude, w.r.t. the software computation in lab tutorial 09, results from the performance data in that case, with the same optimization levels | | Terminal | | | | | | |------------------------------------------------------------------------------------|---------------|------|------------|---------------|-------------|--| | Performance Counter Report<br>Total Time: 7.52118 seconds (376058945 clock-cycles) | | | | | | | | | Section | % | Time (sec) | Time (clocks) | Occurrences | | | H | traject_start | 2.53 | 0.19005 | 9502720 | 65536 | | | Н | delay_collatz | 96.3 | 7.24331 | 362165262 | 65536 | | | Terminal | | | | | | | | | | |---------------|------------------------------------------------------------------------------------|---------|---------------|-------|--|--|--|--|--| | | Performance Counter Report<br>Total Time: 4.55965 seconds (227982443 clock-cycles) | | | | | | | | | | + | | | | | | | | | | | Section | | | Time (clocks) | | | | | | | | traject_start | 3.77 | 0.17170 | 8585216 | 65536 | | | | | | | delay_collatz | 95.1 | 4.33682 | 216841223 | 65536 | | | | | | DMI - Graduate Course in Computer Science ## test with nonblocking acceleration it is sensible to expect a further performance gain out of the nonblocking execution of the computation by the custom hardware the comparison of the following Performance Counter Reports with the corresponding data for the implementation with all computation done in software, yields a 21x speed-up with default optimization O1 and a 16x speed-up with optimization O3; the corresponding speed-up values with blocking acceleration are 15x with O1 and 13x with O3 N.B the speed-up is computed on the total time; section data are less significant with nonblocking acceleration because the execution threads of the two sections overlap in time | Terminal | | | | | |---------------|---------|-------------|------------------|-----------| | | 9145 se | onds (17957 | 243 clock-cycles | | | Section | % | Time (sec) | Time (clocks) Oc | currences | | traject_start | 52.9 | 0.19005 | 9502720 | 65536 | | delay_collatz | 86.5 | 0.31065 | 15532367 | 65536 | | Terminal | | | | | | | | |-------------------------------------|------|------|-----|-------|------|----------|-------------| | Performance Cou<br>Total Time: 0.28 | 5762 | seco | nds | (1428 | | | | | Section | 8 | j i | ime | (sec) | Time | (clocks) | Occurrences | | traject_start | 60. | ιį | 0. | 17170 | | 8585216 | 65536 | | delay_collatz | 89. | 4 | 0. | 25561 | | 12780693 | 65536 | DMI - Graduate Course in Computer Science Copyleft @ 2018 Giuseppe Scollo 15 di 16 ### references useful materials for the proposed lab experience: archive with source files for project reproduction Avalon® Interface Specifications, Ch. 1-3 MNL-AVABUSREF, Intel Corp., 2017.05.08 Making Qsys Components - For Quartus Prime 16.1 Intel Corp. - FPGA University Program, November 2016 Introduction to the Qsys System Integration Tool - For Quartus Prime 16.1, Intel Corp. - FPGA University Program, November 2016 Nios II Classic Software Developer's Handbook, Ch. 7 NII5V2, Altera Corp., 2015.05.14 Intel FPGA Monitor Program Tutorial for Nios II - For Quartus Prime 16.1 Intel Corp. - FPGA University Program, November 2016 DMI — Graduate Course in Computer Science