Ex. 4: DMA / TL-UL host implementation

The past exercises only used a “register bus” device interface for transferring data to/from your student module. In this exercise you will extend a simple TL-UL host. This will form the base for your own, application specific bus master.

Objectives

  • working knowledge of Tilelink Uncached Lightweight (TL-UL)

  • develop architectures using DMA controlled by simple descriptors

Preparation

  • Study the TileLink Uncached Lightweight (TL-UL) specification linked on the Resources page. Read at least chapters 1-4 and 7. Only topics related to TL-UL must be read, i.e. skip sections dealing with TL-C and Bursts.

Tasks

1. Explore the example

rvlab/src/rtl/student/student_dma.sv contains a TL-UL host implementing memset which fills a given memory area with a constant value. The module student_dma is already instantiated in rvlab/src/rtl/student/student.sv and connected as a host and device to the main TL-UL crossbar. Its CPU accessible registers are defined in rvlab/src/design/reggen/student_dma.json:

student_dma.status @ + 0x0
Status
Reset default = 0x0, mask 0x3
31302928272625242322212019181716
 
1514131211109876543210
  val
BitsTypeResetNameDescription
1:0ro0x0val0 = idle, 1 = memset in progress, 2 = memcp in progress


student_dma.now_dadr @ + 0x4
Pointer to descriptor. Writing while idle will start the operation. Writes while not idle are ignored. Must be 32bit aligned [1:0]=0.
Reset default = 0x0, mask 0xffffffff
31302928272625242322212019181716
val...
1514131211109876543210
...val
BitsTypeResetNameDescription
31:0rw0x0val


student_dma.cmd @ + 0x8
Command.
Reset default = 0x0, mask 0x0
31302928272625242322212019181716
 
1514131211109876543210
  stop
BitsTypeResetNameDescription
0woxstop1 - abort current operation


student_dma.length @ + 0xc
number of bytes remaining to be set / copied
Reset default = 0x0, mask 0xffffffff
31302928272625242322212019181716
val...
1514131211109876543210
...val
BitsTypeResetNameDescription
31:0ro0x0val


student_dma.src_adr @ + 0x10
memset: fill value. memcpy: current read address
Reset default = 0x0, mask 0xffffffff
31302928272625242322212019181716
val...
1514131211109876543210
...val
BitsTypeResetNameDescription
31:0ro0x0val


student_dma.dst_adr @ + 0x14
current write address
Reset default = 0x0, mask 0xffffffff
31302928272625242322212019181716
val...
1514131211109876543210
...val
BitsTypeResetNameDescription
31:0ro0x0val


Note: To simplify the operation of the bus master module only 32-bit accesses aligned to 4-byte boundaries are used. Smaller accesses would involve more complex logic.

The descriptor is as follows:

offset

size

name

description

0x0

0x4

operation

0 = memset, 1 = memcpy

0x4

0x4

length

number of bytes to be set (multiple of 4)

0x8

0x4

src_adr

memset: fill value. memcpy: 1st address of the source buffer (32 bit word aligned)

0xc

0x4

dst_adr

1st address of the destination buffer (32 bit word aligned)

Simulate and test by running:

flow systb_dma.sim_rtl_questa

2. Hardware implementation

Extend the module to implement a function called memcpy, which copies one memory area to another. The register and descriptor definitions remain unchanged. The module should use the maximum bandwidth available from the TL-UL interface, i.e. there should always be simultaneous (pending) read and write transactions. Use a (short) FIFO (use e.g. src/rtl/prim/prim_fifo_sync.sv) between read and write processes to achieve this.

3. Software implementation

Sample implementations of the memset function are available in rvlab/src/sw/dma/memset.c/h. The function memset_soft is a complete software implementation of memset, memset_dma uses the hardware implementation to fill a memory area. rvlab/src/sw/dma/memcpy.c/h provides stubs for memcpy_soft and memcpy_dma. Extend memcpy_soft to contain a complete software implementation of mem copy with the same functionality as your hardware implementation. Complete memcpy_dma so it invokes your hardware implementation.

4. Test cases

Small test cases for memset_soft, memset_dma, memcpy_soft and and memcpy_dma are provided in dma/main.c.

If you encounter problems, feel free to modify the test cases or add own tests. Make sure you do not overwrite any substantial information, like the stack, data or code of your mini application.

5. Benchmarking

Compare the speed of the software implementation of memcpy and memset with the speed of the hardware component for the scenarios listed in the Deliverables. Measure the cycles from the write of the now_dadr register until the status register becomes 0 again.

Deliverables

All deliverables should be submitted in a single PDF file.

1. Source texts

  1. Verilog of your TL-UL host (excluding any generated code)

  2. C of memcpy_soft and memcpy_hard

  3. C of your memcpy test cases

2. Wave views

The wave views should be zoomed in as much as possible to only show the sections specified below. They should contain at least the clk signal, all CPU readable registers (status, length, src_adr, dst_adr) and the TL-UL host interface.

  1. memcyp_hard: zoomed in view showing the transfer of at least 3 words

  2. memcyp_soft: zoomed in view showing the transfer of at least 1 word

3. Benchmarking results

operation

software [cycles]

hardware [cycles]

ratio

memset of 1kB in SRAM

memset of 1kB in DDR3

memcpy of 1kB SRAM to SRAM

memcpy of 1kB SRAM to DDR3

memcpy of 1kB DDR3 to SRAM