Ex. 4: DMA / TL-UL host implementation
The past exercises only used a “register bus” device interface for transferring data to/from your student module. In this exercise you will extend a simple TL-UL host. This will form the base for your own, application specific bus master.
Objectives
working knowledge of Tilelink Uncached Lightweight (TL-UL)
develop architectures using DMA controlled by simple descriptors
Preparation
Study the TileLink Uncached Lightweight (TL-UL) specification linked on the Resources page. Read at least chapters 1-4 and 7. Only topics related to TL-UL must be read, i.e. skip sections dealing with TL-C and Bursts.
Tasks
1. Explore the example
rvlab/src/rtl/student/student_dma.sv contains a TL-UL host implementing memset which fills a given memory area with a constant value. The module student_dma is already instantiated in rvlab/src/rtl/student/student.sv and connected as a host and device to the main TL-UL crossbar. Its CPU accessible registers are defined in rvlab/src/design/reggen/student_dma.json:
student_dma.status @ + 0x0 Status Reset default = 0x0, mask 0x3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bits | Type | Reset | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 1:0 | ro | 0x0 | val | 0 = idle, 1 = memset in progress, 2 = memcp in progress | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
student_dma.now_dadr @ + 0x4 Pointer to descriptor. Writing while idle will start the operation. Writes while not idle are ignored. Must be 32bit aligned [1:0]=0. Reset default = 0x0, mask 0xffffffff | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bits | Type | Reset | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31:0 | rw | 0x0 | val | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
student_dma.cmd @ + 0x8 Command. Reset default = 0x0, mask 0x0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bits | Type | Reset | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | wo | x | stop | 1 - abort current operation | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
student_dma.length @ + 0xc number of bytes remaining to be set / copied Reset default = 0x0, mask 0xffffffff | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bits | Type | Reset | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31:0 | ro | 0x0 | val | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
student_dma.src_adr @ + 0x10 memset: fill value. memcpy: current read address Reset default = 0x0, mask 0xffffffff | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bits | Type | Reset | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31:0 | ro | 0x0 | val | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
student_dma.dst_adr @ + 0x14 current write address Reset default = 0x0, mask 0xffffffff | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Bits | Type | Reset | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31:0 | ro | 0x0 | val | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Note: To simplify the operation of the bus master module only 32-bit accesses aligned to 4-byte boundaries are used. Smaller accesses would involve more complex logic.
The descriptor is as follows:
offset |
size |
name |
description |
|---|---|---|---|
0x0 |
0x4 |
operation |
0 = memset, 1 = memcpy |
0x4 |
0x4 |
length |
number of bytes to be set (multiple of 4) |
0x8 |
0x4 |
src_adr |
memset: fill value. memcpy: 1st address of the source buffer (32 bit word aligned) |
0xc |
0x4 |
dst_adr |
1st address of the destination buffer (32 bit word aligned) |
Simulate and test by running:
flow systb_dma.sim_rtl_questa
2. Hardware implementation
Extend the module to implement a function called memcpy, which copies one memory area to another. The register and descriptor definitions remain unchanged. The module should use the maximum bandwidth available from the TL-UL interface, i.e. there should always be simultaneous (pending) read and write transactions. Use a (short) FIFO (use e.g. src/rtl/prim/prim_fifo_sync.sv) between read and write processes to achieve this.
3. Software implementation
Sample implementations of the memset function are available in rvlab/src/sw/dma/memset.c/h. The function memset_soft is a complete software implementation of memset, memset_dma uses the hardware implementation to fill a memory area. rvlab/src/sw/dma/memcpy.c/h provides stubs for memcpy_soft and memcpy_dma. Extend memcpy_soft to contain a complete software implementation of mem copy with the same functionality as your hardware implementation. Complete memcpy_dma so it invokes your hardware implementation.
4. Test cases
Small test cases for memset_soft, memset_dma, memcpy_soft and and memcpy_dma are provided in dma/main.c.
If you encounter problems, feel free to modify the test cases or add own tests. Make sure you do not overwrite any substantial information, like the stack, data or code of your mini application.
5. Benchmarking
Compare the speed of the software implementation of memcpy and memset with the speed of the hardware component for the scenarios listed in the Deliverables. Measure the cycles from the write of the now_dadr register until the status register becomes 0 again.
Deliverables
All deliverables should be submitted in a single PDF file.
1. Source texts
Verilog of your TL-UL host (excluding any generated code)
C of memcpy_soft and memcpy_hard
C of your memcpy test cases
2. Wave views
The wave views should be zoomed in as much as possible to only show the sections specified below. They should contain at least the clk signal, all CPU readable registers (status, length, src_adr, dst_adr) and the TL-UL host interface.
memcyp_hard: zoomed in view showing the transfer of at least 3 words
memcyp_soft: zoomed in view showing the transfer of at least 1 word
3. Benchmarking results
operation |
software [cycles] |
hardware [cycles] |
ratio |
|---|---|---|---|
memset of 1kB in SRAM |
|||
memset of 1kB in DDR3 |
|||
memcpy of 1kB SRAM to SRAM |
|||
memcpy of 1kB SRAM to DDR3 |
|||
memcpy of 1kB DDR3 to SRAM |