The design includes a Linux software driver to set up the DMA transfers. The read DMA moves data from the system memory to the on-chip or external memory. The write DMA moves data from the on-chip or external memory to the system memory. The Linux software driver also measures the system performance. This reference design allows you to evaluate the performance of the PCIe® protocol in using the Avalon® -MM interface with an embedded, high-performance DMA.
The reference design runs on the following development kits:
- Intel® Arria® 10 GX FPGA Development Kit
- Intel® Cyclone® 10 GX FPGA Development Kit
- Intel® Stratix® 10 FPGA Development Kit
The reference design requires two computers:
- A computer with a PCIe® Gen3 x8 or x16 slot running Linux. This computer is computer number 1.
- A second computer with the Intel® Quartus® Prime software version 18.0 installed. This computer downloads the FPGA SRAM Object File (.sof) to the FPGA on the development kit. This computer is computer number 2.
- The reference design software installed on computer number 1. The reference designs are available in the Intel FPGA Design Store. The Intel® Quartus® Prime Pro Edition Platform Archive File (.par) includes the recommended synthesis, fitter, and timing analysis settings for the parameters specified in the reference designs.
- The Intel® Quartus® Prime software installed on computer number 2. You can download this software from Intel® Quartus® Prime Pro Edition Software Features/Download web page.
- The Linux driver configured specifically for these reference designs.
The Avalon-MM interface with DMA includes the following modules:
Read Data Mover
The Read Data Mover sends memory read Transaction Layer Packet (TLPs) upstream. After the Read Data Mover receives the Completion, the Read Data Mover writes the received data to the on-chip or external memory.
Write Data Mover
The Write Data Mover reads data from the on-chip or external memory and sends the data upstream using memory write TLPs on the PCIe® link.
DMA Descriptor Controller
The Descriptor Controller module manages the DMA read and write operations.
Host software programs internal registers in the Descriptor Controller with the location and size of the descriptor table residing in the host system memory through the Avalon® -MM RX master port. Based on this information, the Descriptor Controller directs the Read Data Mover to copy the entire table to local FIFOs for execution. The Descriptor Controller sends completion status upstream via the Avalon® TX slave (TXS) port.
You can also use your own external descriptor controller to manage the Read and Write Data Movers. However, you cannot change the interface between your own external controller and the Read and Write Data Movers embedded in the reference design.
The TX Slave module propagates Avalon® -MM reads and writes upstream. External Avalon® -MM masters, including the DMA control master, can access system memory using the TX Slave. The DMA Controller uses this path to update the DMA status upstream, using Message Signaled Interrupt (MSI) TLPs.
RX Master (Internal Port for BAR0 Control)
The RX Master module propagates single dword read and write TLPs from the Root Port to the Avalon® -MM domain via a 32-bit Avalon® -MM master port. Software instructs the RX Master to send control, status, and descriptor information to Avalon® -MM slaves, including the DMA control slave. The RX Master port is an internal port that not visible in Platform Designer.
The reference design uses the following directory structure:
- top—The top-level module.
- top_hw— Platform Designer top-level files. If you modify the design using Platform Designer, you must regenerate the system for the changes to take effect.
This reference design supports a 256-byte maximum payload size. The following tables list the values for all the parameters.
|Number of lanes||
Intel® Cyclone® 10 GX: x4
Intel® Arria® 10, Intel® Stratix® 10: x8
Intel® Cyclone® 10 GX: Gen2 (5.0 Gbps)
Intel® Arria® 10 Intel® Stratix® 10: Gen3 (8.0 Gbps)
|RX buffer credit allocation – performance for received request||
Intel® Arria® 10, Intel® Cyclone® 10 GX: Low
Intel® Stratix® 10: Not available
|BAR0||64-bit prefetchable memory|
64-bit prefetchable memory
BAR2 is disabled for Intel® Stratix® 10
64-bit prefetchable memory
BAR4 is disabled for Intel® Arria® 10 and Intel® Cyclone® 10 GX
|Subsystem Vendor ID||0x00000000|
|Subsystem Device ID||0x00000000|
|Maximum payload size||256 Bytes|
|Completion timeout range||None|
|Implement Completion Timeout Disable||Enabled|
|Advanced Error Reporting (AER)||Enabled|
|Link port number||1|
|Slot clock configuration||Enabled|
|Number of MSI messages requested||4|
|Table BAR indicator||0|
|Pending bit array (PBA) offset||0x0000000000000000|
|PBA BAR Indicator||0|
|Endpoint L0s acceptable latency||Maximum of 64 ns|
|Endpoint L1 acceptable latency||Maximum of 1 us|
|Address width of accessible PCIe memory space||40|
|Avalon® -MM port||
This is an Avalon® -MM master port. The PCIe® host accesses the memory through PCIe® BAR2 for Intel® Arria® 10 and Intel® Cyclone® 10 GX devices. The host accesses the memory through PCIe® BAR4 for Intel® Stratix® 10 devices. These BARs connect to both on-chip and external memory.
In a typical application, system software controls this BAR to initialize random data in the external memory. Software also reads the data back to verify correct operation.
|txs||TX Avalon® -MM Slave||
This is an Avalon® -MM slave port. In a typical application, an Avalon® -MM master controls this port to send memory reads and writes to the PCIe® domain.
When the DMA completes operation, the Descriptor Controller uses this port to write DMA status back to the descriptor table in the PCIe® domain. The Descriptor Controller also uses this port to send MSI interrupts upstream.
|dma_rd_master||Read Data Mover||
This is an Avalon® -MM master port. The Read Data Mover uses this Avalon® -MM master to move data from the PCIe® domain to either the on-chip or external memory. The Read Data Mover also uses this port to fetch descriptors from the PCIe® domain and write them to the FIFO in the Descriptor Controller.
The design includes separate descriptor tables for read and write descriptors. Consequently, the dma_rd_master port connects to wr_dts_slave for the write DMA descriptor FIFO and rd_dts_slave for the read DMA descriptor FIFO.
|dma_wr_master||Write Data Mover||
This is an Avalon® -MM master port. The Write Data Mover uses this Avalon® -MM master to read data from either the on-chip or external memory and then write data to the PCIe® domain.
The external memory controller is a single-port RAM. Consequently, the Write Data Mover and the Read Data Mover must share this port to assess external memory.
|FIFO in Descriptor Controller||
These are Avalon® -MM slave ports for the FIFOs in the Descriptor Controller. When the Read Data Mover fetches the descriptors from system memory, the Read Data Mover writes the descriptors to the FIFO using the wr_dts_slave and rd_dts_slave ports.
|Control module in Descriptor Controller||
The Descriptor Controller control module includes one transmit and one receive port for the read and write DMAs. The receive port connects to RXM_BAR0. The transmit port connects to the txs.
The receive path from the RXM_BAR0 connects internally. RXM_BAR0 is not shown in the Platform Designer connections panel. For the transmit path, both read and write DMA ports connect to the txs externally. These ports are visible in the Platform Designer connections panel.
|Internal connection, not shown||Avalon® -MM port||This Avalon® -MM master port passes the memory access from the PCIe® host to PCIe® BAR0. The host uses this port to program the Descriptor Controller. Because this reference design includes the Descriptor Controller as an internal module, Platform Designer does not display this port on the top-level connections panel.|
|onchip_memory2_0||64 KB Dual Port RAM||
This is a 64-KB dual-port on-chip memory. The address range is 0x0800_0000-0x0800_FFFF on the Avalon® -MM bus. This address is the source address for write DMAs or destination address for read DMAs.
To prevent data corruption, software divides the memory into separate regions for reads and writes. The regions do not overlap.
|Intel DDR3 or DDR4 controller||
DDR3 or DDR4 Controller
|This is a single-port DDR3 or DDR4 controller.|
Software running on the host completes the following steps to initiate the DMA and verify the results:
- Software allocates system memory for the descriptor table.
- Software allocates system memory for the DMA data transfers.
- Software writes the descriptors to the descriptor table in the
system memory. The DMA supports up to 128 read and 128 write descriptors. The
descriptor table records the following information:
- Descriptor ID, ranging from 0-127
- Source address
- Destination address
- For the read DMA, the software initializes the system memory space with random data. The Read Data Mover moves this data from the system memory to either the on-chip or external memory. For the write DMA, the software initializes the on-chip or external memory with random data. The Write Data Mover moves the data from the on-chip or external memory to the system memory.
- Software programs the registers in the Descriptor Controller's control module through BAR0. Programming specifies the base address of the descriptor table in system memory and the base address of the FIFO that stores the descriptors in the FPGA.
- To initiate the DMA, software writes the ID of the last descriptor to the Descriptor Controller's control logic. The DMA begins fetching descriptors. The DMA starts with descriptor ID 0 and finishes with the ID of the last descriptor.
- After data transfers for the last descriptor complete, the Descriptor Controller writes 1'b1 to the Done bit in the descriptor table entry corresponding to the last descriptor in the PCIe® domain using the txs port.
- Software polls the Done bit in the descriptor table entry corresponding to the last descriptor. After the DMA Controller writes the Done bit, the DMA Controller calculates throughput. Software compares the data in the system memory to the on-chip or external memory. The test passes if there are no errors.
- For simultaneous read and writes, the software begins the read DMA operation before the write DMA operation. The DMA completes when all the read and write DMAs finish.
- Power down computer number 1.
- Plug the FPGA Development Kit card into a PCIe® slot that supports Gen2 x4 or Gen3 x8.
- For the Intel® Stratix® 10 10 FPGA Development Kit, connectors J26 and J27 power the card. After inserting the card into an available PCIe slot, connect 2x4- and 2x3-pin PCIe® power cables from the power supply of computer number 1 to the J26 and J27 of the PCIe® card, respectively.
Connect a USB cable from computer number 2 to the FPGA
The Development Kit includes an on-board Intel® FPGA Download Cable for FPGA programming.
- To power up FPGA Development Kit via the PCIe® slot, power on computer number 1. Alternatively, you can power up FPGA Development Kit using the external power adapter that ships with the kit.
- For the Intel® Cyclone® 10 GX FPGA Development Kit, an on-board programmable oscillator is the clock source for hardware components. Follow the instructions in Setting Up the Intel® Cyclone® 10 GX FPGA Programmable Oscillator to program this oscillator.
On computer number 2, bring up the
Quartus® Prime programmer and configure the FPGA through an
Intel® FPGA Download Cable.
Note: You must reconfigure the FPGA whenever the FPGA Development Kit loses power.
To force system enumeration to discover the
device, restart computer 1.
If you are using the Intel® Stratix® 10 GX FPGA Development Kit, you might get the following error message during BIOS initialization if the memory mapped I/O is only 4 GB: Insufficient PCI Resources Detected. To work around this issue, enable Above 4G Decoding in the BIOS Boot menu.
- Locate the Kit Collateral (zip) link in the Documentation area of the Intel® Cyclone® 10 GX FPGA Development Kit web page.
- Use this link to download cyclone-10-gx-kit-collateral.zip
- Unzip cyclone-10-gx-kit-collateral.zip to a working directory on computer number 2.
To bring up the Clock Controller
dialog box, type the following commands:
% cd <install_dir>/cyclone-10-gx-collateral/examples/board_test_system/ % ./ClockController.shFigure 6. Clock Controller GUI in Initial State
- In the Clock Controller GUI, click Import.
- Browse to the <install_dir>/cyclone-10-gx-collateral/examples/board_test_system/ directory and select U64-Registers.txt.
To import the register settings, click Open.
The message, Si5332 Register Map is imported successfully displays. You should see the clock settings shown below.Figure 7. Clock Settings for Intel® Cyclone® 10 GX FPGA Development Kit
In a terminal window on computer 1, change to the DMA driver
directory and extract AN829_driver.tar by
typing the following commands:
cd % <install_dir>/<device>/_PCIe<GenxN>DMA_<QuartusVer>_project/driver % tar -xvf AN829_driver.tar
To install the Linux driver for the appropriate device family,
type the command:
% sudo./install <device_family>Valid values for <device_family> are arria10, cyclone10, and stratix10.
To run the DMA application, type the following command:
The application prints the commands available to specify the DMA traffic. By default, the software enables DMA reads, DMA writes, and Simultaneous DMA reads and writes. The following table lists the available commands:
Table 11. DMA Test Commands Command Number Function 1 Start the DMA. 2 Enable or disable read DMA. 3 Enable or disable write DMA. 4 Enable or disable simultaneous read and write DMA. 5 Set the number of dwords per descriptor. The legal range is 256-4096 dwords. 6 Set the number of descriptors. The legal range is 1-127 descriptors. 7 By default, the reference design selects on-chip memory. If select this command consecutive runs switch between on-chip and external memory. 8 Run a the DMA in a continuous loop. 10 Exit
For example, type the following commands to specify 4096 dwords per descriptor and 127 descriptors:
% 5 4096 % 6 127 % 8The following figures show the throughput for DMA reads, DMA writes, and simultaneous DMA reads and writes:Figure 8. Intel® Arria® 10 DMA ThroughputFigure 9. Intel® Cyclone® 10 GX DMA ThroughputFigure 10. Intel® Stratix® 10 DMA Throughput
The throughput in a PCI Express system depends on the following factors:
- Protocol overhead
- Payload size
- Completion latency
- Flow control update latency
- Devices forming the link
Protocol overhead includes the following three components:
- 128b/130b Encoding and Decoding—Gen3 links use 128b/130b encoding. This encoding adds two synchronization (sync) bits to each 128-bit data transfer. Consequently, the encoding and decoding overhead is very small at 1.56%. The effective data rate of a Gen3 x8 link is about 8 gigabytes per second (GBps).
- Data Link Layer Packets (DLLPs) and Physical Layer Packets (PLPs)—An active link also transmits DLLPs and PLPs. The PLPs consist of SKP ordered sets which are 16-24 bytes. The DLLPs are two dwords. The DLLPs implement flow control and the ACK/NAK protocol.
- TLP Packet Overhead—The
overhead associated with a single TLP ranges from 5-7 dwords if the optional
ECRC is not included. The overhead includes the following fields:
- The Start and End Framing Symbols
- The Sequence ID
- A 3- or 4-dword TLP header
- The Link Cyclic Redundancy Check (LCRC)
- 0-1024 dwords of data payload
The theoretical maximum throughput calculation uses the following formula:
Throughput = payload size / (payload size + overhead) * link data rate
Maximum throughput = 8 cycles/9 cycles = 88.88% * 8 GBps = 7.2 GBps
The Device Control register, bits [7:5], specifies the maximum TLP payload size of the current system. The Maximum Payload Size field of the Device Capabilities register, bits [2:0], specifies the maximum permissible value for the payload. You specify this read-only parameter, called Maximum Payload Size, using the parameter editor. After determining the maximum TLP payload for the current system, software records that value in the Device Control register. This value must be less than the maximum payload specified in the Maximum Payload Size field of the Device Capabilities register.
Understanding Flow Control for PCI Express
Flow control guarantees that a TLP is not transmitted unless the receiver has enough buffer space to accept the TLP. There are separate credits for headers and payload data. A device needs sufficient header and payload credits before sending a TLP. When the Application Layer in the completer accepts the TLP, it frees up the RX buffer space in the completer’s Transaction Layer. The completer sends a flow control update packet (FC Update DLLP) to replenish the consumed credits to the initiator. When a device consumes all its credits, the rate of FC Update DLLPs to replenish header and payload credits limits throughput. The flow control updates depend on the maximum payload size and the latencies of two connected devices.
PCI Express uses a split transaction model for reads. The read transaction includes the following steps:
- The requester sends a Memory Read Request.
- The completer sends out the ACK DLLP to acknowledge the Memory Read Request.
- The completer returns a Completion with Data. The completer can split the Completion into multiple completion packets.
Read throughput is typically lower than write throughput because reads require two transactions instead of a single write for the same amount of data. The read throughput also depends on the round trip delay between the time when the Application Layer issues a Memory Read Request and the time when the requested data returns. To maximize the throughput, the application must issue enough outstanding read requests to cover this delay.
The figures below show the timing for Memory Read Requests (MRd) and Completions with Data (CplD). The first figure shows the requester waiting for the completion before issuing the subsequent requests. Waiting results in lower throughput. The second figure shows the requester making multiple outstanding read requests to eliminate the delay after the first data returns. Eliminating delays results in higher throughput.
To maintain maximum throughput for the completion data packets, the requester must optimize the following settings:
- The number of completions in the RX buffer
- The rate at which the Application Layer issues read requests and processes the completion data
Read Request Size
Another factor that affects throughput is the read request size. If a requester requires 4 KB data, the requester can issue four, 1 KB read requests or a single 4 KB read request. The 4 KB request results in higher throughput than the four, 1 KB reads. The Maximum Read Request Size value in Device Control register, bits [14:12], specifies the read request size.
Outstanding Read Requests
A final factor that can affect the throughput is the number of outstanding read requests. If the requester sends multiple read requests to improve throughput, the number of available header tags limits the number of outstanding read requests. To achieve higher performance, Intel® Arria® 10 and Intel® Cyclone® 10 GX read DMA can use up to 16 header tags. The Intel® Stratix® 10 read DMA can use up to 32 header tags.
To measure throughput, the software driver takes two timestamps. Software takes the first timestamp shortly after the you type the ./run command. Software takes the second timestamp after the DMA completes and returns the required completion status, EPLAST. If read DMA, write DMA and simultaneous read and write DMAs are all enabled, the driver takes six timestamps to make the three measurements.
This reference design provides a choice between on-chip memory implemented in the FPGA fabric and external memory available on the PCB. The on-chip memory supports separate read and write ports. Consequently, this memory supports simultaneous read and the write DMAs.
The external memory supports a single port. Consequently, the external memory does not support simultaneous read DMA and write DMA accesses. In addition, the latency of external memory is higher than the latency of on-chip memory. These two differences between the on-chip and external memory result in lower throughput for the external memory implementation.
To compare the throughput for on-chip and external memory, select command 7 for consecutive runs to switch between on-chip and external memory.
|Document Version||Intel® Quartus® Prime Version||Changes|