FFT IP Core User Guide
About This IP Core
Altera DSP IP Core Features
 Avalon^{®} Streaming (AvalonST) interfaces
 DSP Builder ready
 Testbenches to verify the IP core
 IP functional simulation models for use in Alterasupported VHDL and Verilog HDL simulators
FFT IP Core Features
 Bitaccurate MATLAB models
 Variable streaming FFT:
 Singleprecision floatingpoint or fixedpoint representation
 Radix4, mixed radix4/2 implementations (for floatingpoint FFT), and radix2^{2} single delay feedback implementation (for fixedpoint FFT)
 Input and output orders: natural order, bitreversed or digitreversed, and DCcentered (N/2 to N/2)
 Reduced memory requirements
 Support for 8 to 32bit data and twiddle width (foxedpoint FFTs)
 Fixed transform size FFT that implements block
floatingpoint FFTs and maintains the maximum dynamic range of data during processing
(not for variable streaming FFTs)
 Multiple I/O data flow options: streaming, buffered burst, and burst
 Uses embedded memory
 Maximum system clock frequency more than 300 MHz
 Optimized to use Stratix series DSP blocks and TriMatrix memory
 High throughput quadoutput radix 4 FFT engine
 Support for multiple singleoutput and quadoutput engines in parallel
 User control over optimization in DSP blocks or in speed in Stratix V devices, for streaming, buffered burst, burst, and variable streaming fixedpoint FFTs
 Avalon Streaming (AvalonST) compliant input and output interfaces
 Parameterizationspecific VHDL and Verilog HDL testbench generation
 Transform direction (FFT/IFFT) specifiable on a perblock basis
General Description
The FFT MegaCore function implements:
 Fixed transform size FFT
 Variable streaming FFT
Fixed Transform Size FFT
The fixed transform FFT accepts a two's complement format complex data vector of length N inputs, where N is the desired transform length in natural order. The function outputs the transformdomain complex vector in natural order. The FFT produces an accumulated block exponent to indicate any data scaling that has occurred during the transform to maintain precision and maximize the internal signaltonoise ratio. You can specify the transform direction on a perblock basis using an input port.
Variable Streaming FFT
The fixedpoint representation grows the data widths naturally from input through to output thereby maintaining a high SNR at the output. The single precision floatingpoint representation allows a large dynamic range of values to be represented while maintaining a high SNR at the output.
The order of the input data vector of size N can be natural, bit or digitreversed, or N/2 to N/2 (DCcentered). The fixedpoint representation supports a natural, bitreversed, or DCcentered order and the floating point representation supports a natural, digitreversed order. The FFT outputs the transformdomain complex vector in natural, bitreversed, or digitreversed order. You can specify the transform direction on a perblock basis using an input port.
DSP IP Core Device Family Support
Altera^{®} offers the following device support levels for Altera^{®} IP cores:
 Preliminary support—Altera^{®} verifies the IP core with preliminary timing models for this device family. The IP core meets all functional requirements, but might still be undergoing timing analysis for the device family. You can use it in production designs with caution.
 Final support—Altera^{®} verifies the IP core with final timing models for this device family. The IP core meets all functional and timing requirements for the device family. You can use it in production designs.
Device Family  Support 

Arria^{®} II GX  Final 
Arria II GZ  Final 
Arria V  Final 
Arria 10  Final 
Cyclone^{®} IV  Final 
Cyclone V  Final 
MAX^{®} 10 FPGA  Final 
Stratix^{®} IV GT  Final 
Stratix IV GX/E  Final 
Stratix V  Final 
Other device families  No support 
DSP IP Core Verification
FFT IP Core Release Information
Item  Description 

Version  16.0 
Release Date  May 2016 
Ordering Code  IPFFT 
Product ID  0034 
Vendor ID  6AF7 
Performance and Resource Utilization
Device  Parameters  ALM  DSP Blocks  Memory  Registers  f_{MAX} (MHz)  

Type  Length  Engines  M10K  M20K  Primary  Secondary  
Arria V  Buffered Burst  1,024  1  1,572  6  16    3,903  143  275 
Arria V  Buffered Burst  1,024  2  2,512  12  30    6,027  272  274 
Arria V  Buffered Burst  1,024  4  4,485  24  59    10,765  426  262 
Arria V  Buffered Burst  256  1  1,532  6  16    3,713  136  275 
Arria V  Buffered Burst  256  2  2,459  12  30    5,829  246  245 
Arria V  Buffered Burst  256  4  4,405  24  59    10,539  389  260 
Arria V  Buffered Burst  4,096  1  1,627  6  59    4,085  130  275 
Arria V  Buffered Burst  4,096  2  2,555  12  59    6,244  252  275 
Arria V  Buffered Burst  4,096  4  4,526  24  59    10,986  438  265 
Arria V  Burst Quad Output  1,024  1  1,565  6  8    3,807  147  273 
Arria V  Burst Quad Output  1,024  2  2,497  12  14    5,952  225  275 
Arria V  Burst Quad Output  1,024  4  4,461  24  27    10,677  347  257 
Arria V  Burst Quad Output  256  1  1,527  6  8    3,610  153  272 
Arria V  Burst Quad Output  256  2  2,474  12  14    5,768  233  275 
Arria V  Burst Quad Output  256  4  4,403  24  27    10,443  437  257 
Arria V  Burst Quad Output  4,096  1  1,597  6  27    3,949  151  275 
Arria V  Burst Quad Output  4,096  2  2,551  12  27    6,119  223  275 
Arria V  Burst Quad Output  4,096  4  4,494  24  27    10,844  392  256 
Arria V  Burst Single Output  1,024  1  672  2  6    1,488  101  275 
Arria V  Burst Single Output  1,024  2  994  4  10    2,433  182  275 
Arria V  Burst Single Output  256  1  636  2  3    1,442  95  275 
Arria V  Burst Single Output  256  2  969  4  8    2,375  152  275 
Arria V  Burst Single Output  4,096  1  702  2  19    1,522  126  270 
Arria V  Burst Single Output  4,096  2  1,001  4  25    2,521  156  275 
Arria V  Streaming  1,024  —  1,880  6  20    4,565  167  275 
Arria V  Streaming  256  —  1,647  6  20    3,838  137  275 
Arria V  Streaming  4,096  —  1,819  6  71    4,655  137  275 
Arria V  Variable Streaming Floating Point  1,024  —  11,195  48  89    18,843  748  163 
Arria V  Variable Streaming Floating Point  256  —  8,639  36  62    15,127  609  161 
Arria V  Variable Streaming Floating Point  4,096  —  13,947  60  138    22,598  854  162 
Arria V  Variable Streaming  1,024  —  2,535  11  14    6,269  179  223 
Arria V  Variable Streaming  256  —  1,913  8  8    4,798  148  229 
Arria V  Variable Streaming  4,096  —  3,232  15  31    7,762  285  210 
Cyclone V  Buffered Burst  1,024  1  1,599  6  16    3,912  114  226 
Cyclone V  Buffered Burst  1,024  2  2,506  12  30    6,078  199  219 
Cyclone V  Buffered Burst  1,024  4  4,505  24  59    10,700  421  207 
Cyclone V  Buffered Burst  256  1  1,528  6  16    3,713  115  227 
Cyclone V  Buffered Burst  256  2  2,452  12  30    5,833  211  232 
Cyclone V  Buffered Burst  256  4  4,487  24  59    10,483  424  221 
Cyclone V  Buffered Burst  4,096  1  1,649  6  59    4,060  138  223 
Cyclone V  Buffered Burst  4,096  2  2,555  12  59    6,254  199  227 
Cyclone V  Buffered Burst  4,096  4  4,576  24  59    10,980  377  214 
Cyclone V  Burst Quad Output  1,024  1  1,562  6  8    3,810  122  225 
Cyclone V  Burst Quad Output  1,024  2  2,501  12  14    5,972  196  231 
Cyclone V  Burst Quad Output  1,024  4  4,480  24  27    10,643  372  216 
Cyclone V  Burst Quad Output  256  1  1,534  6  8    3,617  120  226 
Cyclone V  Burst Quad Output  256  2  2,444  12  14    5,793  153  224 
Cyclone V  Burst Quad Output  256  4  4,443  24  27    10,402  379  223 
Cyclone V  Burst Quad Output  4,096  1  1,590  6  27    3,968  120  237 
Cyclone V  Burst Quad Output  4,096  2  2,547  12  27    6,135  209  227 
Cyclone V  Burst Quad Output  4,096  4  4,512  24  27    10,798  388  210 
Cyclone V  Burst Single Output  1,024  1  673  2  6    1,508  83  222 
Cyclone V  Burst Single Output  1,024  2  984  4  10    2,475  126  231 
Cyclone V  Burst Single Output  256  1  639  2  3    1,382  159  229 
Cyclone V  Burst Single Output  256  2  967  4  8    2,353  169  240 
Cyclone V  Burst Single Output  4,096  1  695  2  19    1,540  105  237 
Cyclone V  Burst Single Output  4,096  2  1,009  4  25    2,536  116  240 
Cyclone V  Streaming  1,024  —  1,869  6  20    4,573  132  211 
Cyclone V  Streaming  256  —  1,651  6  20    3,878  85  226 
Cyclone V  Streaming  4,096  —  1,822  6  71    4,673  124  199 
Cyclone V  Variable Streaming Floating Point  1,024  —  11,184  48  89    18,830  628  133 
Cyclone V  Variable Streaming Floating Point  256  —  8,611  36  62    15,156  467  133 
Cyclone V  Variable Streaming Floating Point  4,096  —  13,945  60  138    22,615  701  132 
Cyclone V  Variable Streaming  1,024  —  2,533  11  14    6,254  240  179 
Cyclone V  Variable Streaming  256  —  1,911  8  8    4,786  176  180 
Cyclone V  Variable Streaming  4,096  —  3,226  15  31    7,761  320  176 
Stratix V  Buffered Burst  1,024  1  1,610  6    16  4,141  107  424 
Stratix V  Buffered Burst  1,024  2  2,545  12    30  6,517  170  427 
Stratix V  Buffered Burst  1,024  4  4,554  24    59  11,687  250  366 
Stratix V  Buffered Burst  256  1  1,546  6    16  3,959  110  493 
Stratix V  Buffered Burst  256  2  2,475  12    30  6,314  134  440 
Stratix V  Buffered Burst  256  4  4,480  24    59  11,477  281  383 
Stratix V  Buffered Burst  4,096  1  1,668  6    30  4,312  122  432 
Stratix V  Buffered Burst  4,096  2  2,602  12    30  6,718  176  416 
Stratix V  Buffered Burst  4,096  4  4,623  24    59  11,876  249  392 
Stratix V  Burst Quad Output  1,024  1  1,550  6    8  4,037  115  455 
Stratix V  Burst Quad Output  1,024  2  2,444  12    14  6,417  164  433 
Stratix V  Burst Quad Output  1,024  4  4,397  24    27  11,548  330  416 
Stratix V  Burst Quad Output  256  1  1,487  6    8  3,868  83  477 
Stratix V  Burst Quad Output  256  2  2,387  12    14  6,211  164  458 
Stratix V  Burst Quad Output  256  4  4,338  24    27  11,360  307  409 
Stratix V  Burst Quad Output  4,096  1  1,593  6    14  4,222  93  448 
Stratix V  Burst Quad Output  4,096  2  2,512  12    14  6,588  154  470 
Stratix V  Burst Quad Output  4,096  4  4,468  24    27  11,773  267  403 
Stratix V  Burst Single Output  1,024  1  652  2    4  1,553  111  500 
Stratix V  Burst Single Output  1,024  2  1,011  4    8  2,687  149  476 
Stratix V  Burst Single Output  256  1  621  2    3  1,502  132  500 
Stratix V  Burst Single Output  256  2  978  4    8  2,555  173  500 
Stratix V  Burst Single Output  4,096  1  681  2    9  1,589  149  500 
Stratix V  Burst Single Output  4,096  2  1,039  4    14  2,755  161  476 
Stratix V  Streaming  1,024  —  1,896  6    20  4,814  144  490 
Stratix V  Streaming  256  —  1,604  6    20  4,062  99  449 
Stratix V  Streaming  4,096  —  1,866  6    38  4,889  118  461 
Stratix V  Variable Streaming Floating Point  1,024  —  11,607  32    87  19,031  974  355 
Stratix V  Variable Streaming Floating Point  256  —  8,850  24    59  15,297  820  374 
Stratix V  Variable Streaming Floating Point  4,096  —  14,335  40    115  22,839  1,047  325 
Stratix V  Variable Streaming  1,024  —  2,334  14    13  5,623  201  382 
Stratix V  Variable Streaming  256  —  1,801  10    8  4,443  174  365 
Stratix V  Variable Streaming  4,096  —  2,924  18    23  6,818  238  355 
FFT IP Core Getting Started
Licensing IP Cores
OpenCore Plus IP Evaluation
 Simulate the behavior of a licensed IP core in your system.
 Verify the functionality, size, and speed of the IP core quickly and easily.
 Generate timelimited device programming files for designs that include IP cores.
 Program a device with your IP core and verify your design in hardware.
OpenCore Plus evaluation supports the following two operation modes:
 Untethered—run the design containing the licensed IP for a limited time.
 Tethered—run the design containing the licensed IP for a longer time or indefinitely. This requires a connection between your board and the host computer.
FFT IP Core OpenCore Plus Timeout Behavior
For IP cores, the untethered timeout is 1 hour; the tethered timeout value is indefinite. Your design stops working after the hardware evaluation time expires. The Quartus Prime software uses OpenCore Plus Files (.ocp) in your project directory to identify your use of the OpenCore Plus evaluation program. After you activate the feature, do not delete these files..
When the evaluation time expires, the source_real, source_imag, and source_exp signals go low.
IP Catalog and Parameter Editor
 Filter IP Catalog to Show IP for active device family or Show IP for all device families. If you have no project open, select the Device Family in IP Catalog.
 Type in the Search field to locate any full or partial IP core name in IP Catalog.
 Rightclick an IP core name in IP Catalog to display details about supported devices, open the IP core's installation folder, and click links to IP documentation.
 Click Search for Partner IP, to access partner IP information on the Altera website.
The parameter editor prompts you to specify an IP variation name, optional ports, and output file generation options. The parameter editor generates a toplevel Qsys system file (.qsys) or Quartus^{®} Prime IP file (.qip) representing the IP core in your project. You can also parameterize an IP variation without an open project.
The IP Catalog is also available in Qsys (View > IP Catalog). The Qsys IP Catalog includes exclusive system interconnect, video and image processing, and other systemlevel IP that are not available in the Quartus^{®} Prime IP Catalog. For more information about using the Qsys IP Catalog, refer to Creating a System with Qsys in Volume 1 of the Quartus^{®} Prime Handbook.
Generating IP Cores
 In the IP Catalog (Tools > IP Catalog), locate and doubleclick the name of the IP core to customize. The parameter editor appears.
 Specify a toplevel name for your custom IP variation. The parameter editor saves the IP variation settings in a file named <your_ip> .qsys. Click OK. Do not include spaces in IP variation names or paths.

Specify the parameters and options for your IP variation in the parameter
editor, including one or more of the following:
 Optionally select preset parameter values if provided for your IP core. Presets specify initial parameter values for specific applications.
 Specify parameters defining the IP core functionality, port configurations, and devicespecific features.
 Specify options for processing the IP core files in other EDA tools.
Note: Refer to your IP core user guide for information about specific IP core parameters.  Click Generate HDL. The Generation dialog box appears.
 Specify output file generation options, and then click Generate. The IP variation files synthesis and/or simulation files generate according to your specifications.
 To generate a simulation testbench, click Generate > Generate Testbench System. Specify testbench generation options, and then click Generate.
 To generate an HDL instantiation template that you can copy and paste into your text editor, click Generate > Show Instantiation Template.

Click Finish. Click
Yes if prompted to add files representing the IP
variation to your project. Optionally turn on the option to
Automatically add Quartus Prime IP Files to All
Projects. Click Project > Add/Remove Files in Project to add IP files at any time.
Figure 2. Adding IP Files to Project
Note:For Arria 10 devices, the generated .qsys file must be added to your project to represent IP and Qsys systems. For devices released prior to Arria 10 devices, the generated .qip and .sip files must be added to your project for IP and Qsys systems.
The generated .qsys file must be added to your project to represent IP and Qsys systems.

After generating
and instantiating your IP variation, make appropriate pin assignments to
connect ports.
Note: Some IP cores generate different HDL implementations according to the IP core parameters. The underlying RTL of these IP cores contains a unique hash code that prevents module name collisions between different variations of the IP core. This unique code remains consistent, given the same IP settings and software version during IP generation. This unique code can change if you edit the IP core's parameters or upgrade the IP core version. To avoid dependency on these unique codes in your simulation environment, refer to Generating a Combined Simulator Setup Script.
Files Generated for Altera IP Cores and Qsys Systems
File Name 
Description 

<my_ip>.qsys 
The Qsys system or toplevel IP variation file. 
<system>.sopcinfo 
Describes the connections and IP component parameterizations in your Qsys system. You can parse the contents of this file to get requirements when you develop software drivers for IP components. Downstream tools such as the Nios II tool chain use this file. The .sopcinfo file and the system.h file generated for the Nios II tool chain include address map information for each slave relative to each master that accesses the slave. Different masters may have a different address map to access a particular slave component. 
<my_ip>.cmp  The VHDL Component Declaration (.cmp) file is a text file that contains local generic and port definitions that you can use in VHDL design files. 
<my_ip>.html 
A report that contains connection information, a memory map showing the slave address with respect to each master that the slave connects to, and parameter assignments. 
<my_ip>_generation.rpt  IP or Qsys generation log file. A summary of the messages during IP generation. 
<my_ip>.debuginfo  Contains postgeneration information. Passes System Console and Bus Analyzer Toolkit information about the Qsys interconnect. The Bus Analysis Toolkit uses this file to identify debug components in the Qsys interconnect. 
<my_ip>.qip 
Contains all the required information about the IP component to integrate and compile the IP component in the Quartus^{®} Prime software. 
<my_ip>.csv  Contains information about the upgrade status of the IP component. 
<my_ip>.bsf 
A Block Symbol File (.bsf) representation of the IP variation for use in Quartus^{®} Prime Block Diagram Files (.bdf). 
<my_ip>.spd 
Required input file for ipmakesimscript to generate simulation scripts for supported simulators. The .spd file contains a list of files generated for simulation, along with information about memories that you can initialize. 
<my_ip>.ppf  The Pin Planner File (.ppf) stores the port and node assignments for IP components created for use with the Pin Planner. 
<my_ip>_bb.v  You can use the Verilog blackbox (_bb.v) file as an empty module declaration for use as a blackbox. 
<my_ip>.sip  Contains information required for NativeLink simulation of IP components. You must add the .sip file to your Quartus project to enable NativeLink for Arria II, Arria V, Cyclone IV, Cyclone V, MAX 10, MAX II, MAX V, Stratix IV, and Stratix V devices. The Quartus^{®} Prime Pro Edition does not support NativeLink simulation. 
<my_ip>_inst.v or _inst.vhd  HDL example instantiation template. You can copy and paste the contents of this file into your HDL file to instantiate the IP variation. 
<my_ip>.regmap  If the IP contains register information, the Quartus^{®} Prime software generates the .regmap fil. The .regmap file describes the register map information of master and slave interfaces. This file complements the .sopcinfo file by providing more detailed register information about the system. This file enables register display views and user customizable statistics in System Console. 
<my_ip>.svd 
Allows HPS System Debug tools to view the register maps of peripherals connected to HPS within a Qsys system. During synthesis, the Quartus^{®} Prime software stores the .svd files for slave interface visible to the System Console masters in the .sof file in the debug session. System Console reads this section, which Qsys can query for register map information. For system slaves, Qsys can access the registers by name. 
<my_ip>.v <my_ip>.vhd  HDL files that instantiate each submodule or child IP core for synthesis or simulation. 
mentor/ 
Contains a ModelSim^{®} script msim_setup.tcl to set up and run a simulation. 
aldec/ 
Contains a RivieraPRO script rivierapro_setup.tcl to setup and run a simulation. 
/synopsys/vcs /synopsys/vcsmx 
Contains a shell script vcs_setup.sh to set up and run a VCS^{®} simulation. Contains a shell script vcsmx_setup.sh and synopsys_ sim.setup file to set up and run a VCS MX^{®} simulation. 
/cadence 
Contains a shell script ncsim_setup.sh and other setup files to set up and run an NCSIM simulation. 
/submodules  Contains HDL files for the IP core submodule. 
<IP submodule>/  For each generated IP submodule directory, Qsys generates /synth and /sim subdirectories. 
Generating IP Cores (Legacy Editors)
 In the IP Catalog (Tools > IP Catalog), locate and doubleclick the name of the IP core to customize. The parameter editor appears.
 Specify a toplevel name and output HDL file type for your IP variation. This name identifies the IP core variation files in your project. Click OK. Do not include spaces in IP variation names or paths.
 Specify the parameters and options for your IP variation in the parameter editor. Refer to your IP core user guide for information about specific IP core parameters.

Click
Finish or
Generate (depending on the parameter editor
version). The parameter editor generates the files for your IP variation
according to your specifications. Click
Exit if prompted when generation is complete.
The parameter editor adds the toplevel
.qip file to the current project
automatically.
Note: For devices released prior to Arria 10 devices, the generated .qip and .sip files must be added to your project to represent IP and Qsys systems. To manually add an IP variation generated with legacy parameter editor to a project, click Project > Add/Remove Files in Project and add the IP variation .qip file.Note: Some IP cores generate different HDL implementations according to the IP core parameters. The underlying RTL of these IP cores contains a unique code that prevents module name collisions between different variations of the IP core. This unique code remains consistent, given the same IP settings and software version during IP generation. This unique code can change if you edit the IP core's parameters or upgrade the IP core version. To avoid dependency on these unique codes in your simulation environment, refer to Generating a Combined Simulator Setup Script.
Files Generated for Altera IP Cores (Legacy Parameter Editors)
Simulating Altera IP Cores
The Quartus^{®} Prime software provides integration with your simulator and supports multiple simulation flows, including your own scripted and custom simulation flows. Whichever flow you chose, IP core simulation involves the following steps:
 Generate simulation model, testbench (or example design), and simulator setup script files.
 Set up your simulator environment and any simulation script(s).
 Compile simulation model libraries.
 Run your simulator.
The Quartus^{®} Prime software integrates with your preferred simulation environment. This section describes how to setup and run typical scripted and NativeLink simulation flows. The Quartus^{®} Prime Pro Edition software does not support NativeLink simulation.
Simulating the FixedTransform FFT IP Core in the MATLAB Software
The model takes a complex vector as input and it outputs the transformdomain complex vector and corresponding block exponent values. The length and direction of the transform (FFT/IFFT) are also passed as inputs to the model. If the input vector length is an integral multiple of N, the transform length, the length of the output vector(s) is equal to the length of the input vector. However, if the input vector is not an integral multiple of N, it is zeropadded to extend the length to be so. The wizard also creates the MATLAB testbench file <variation name>_tb.m. This file creates the stimuli for the MATLAB model by reading the input complex random data from generated files. If you selected Floating point data representation, the IP core generates the input data in hexadecimal format.
 Run the MATLAB software.

Simulate the desgn:

Type help <variation name>_model at the command prompt to view the
input and output vectors that are required to run the MATLAB model as a
standalone Mfunction. Create your input vector and make a function call
to <variation name>_model. For example:
N=2048; INVERSE = 0; % 0 => FFT 1=> IFFT x = (2^12)*rand(1,N) + j*(2^12)*rand(1,N); [y,e] = <variation name>_model(x,N,INVERSE);
 Alternatively, run the provided testbench by typing the name of the testbench, <variation name>_tb at the command prompt.

Type help <variation name>_model at the command prompt to view the
input and output vectors that are required to run the MATLAB model as a
standalone Mfunction. Create your input vector and make a function call
to <variation name>_model. For example:


Simulating the Variable Streaming FFT IP Core in the MATLAB Software
The model takes a complex vector as input and it outputs the transformdomain complex vector. The lengths and direction of the transforms (FFT/IFFT) (specified as one entry per block) are also passed as an input to the model. You must ensure that the length of the input vector is at least as large as the sum of the transform sizes for the model to function correctly. The wizard also creates the MATLAB testbench file <variation name>_tb.m. This file creates the stimuli for the MATLAB model by reading the input complex random data from the generated files.
 Run the MATLAB software.
 In the MATLAB command window, change to the working directory for your project.

Simulate the design:

Type help
<variation name>_model
at the command prompt to view the input and output vectors that are
required to run the MATLAB model as a standalone Mfunction. Create your
input vector and make a function call to
<variation
name>_model. For example:
nps=[256,2048]; inverse = [0,1]; % 0 => FFT 1=> IFFT x = (2^12)*rand(1,sum(nps)) + j*(2^12)*rand(1,sum(nps)); [y] = <variation name>_model(x,nps,inverse);

Alternaitvely, run the provided testbench by typing the name of the
testbench,
<variation name>_tb at
the command prompt.
Note: If you select bitreversed output order, you can reorder the data with the following MATLAB code:
y = y(bit_reverse(0:(FFTSIZE1), log2(FFTSIZE)) + 1);
where bit_reverse is:
function y = bit_reverse(x, n_bits) y = bin2dec(fliplr(dec2bin(x, n_bits)));
Note: If you select digitreversed output order, you can reorder the data with the following MATLAB code:y = y(digit_reverse(0:(FFTSIZE1), log2(FFTSIZE)) + 1);
where digit_reverse is:
function y = digit_reverse(x, n_bits) if mod(n_bits,2) z = dec2bin(x, n_bits); for i=1:2:n_bits1 p(:,i) = z(:,n_bitsi); p(:,i+1) = z(:,n_bitsi+1); end p(:,n_bits) = z(:,1); y=bin2dec(p); else y=digitrevorder(x,4); end

Type help
<variation name>_model
at the command prompt to view the input and output vectors that are
required to run the MATLAB model as a standalone Mfunction. Create your
input vector and make a function call to
<variation
name>_model. For example:
DSP Builder Design Flow
This IP core supports DSP Builder. Use the DSP Builder flow if you want to create a DSP Builder model that includes an IP core variation; use IP Catalog if you want to create an IP core variation that you can instantiate manually in your design.
FFT IP Core Functional Description
Fixed Transform FFTs
To maintain a high signaltonoise ratio throughout the transform computation, the fixed transform FFTs use a blockfloatingpoint architecture, which is a tradeoff point between fixedpoint and fullfloatingpoint architectures.
Variable Streaming FFTs
If you select the fixedpoint data representation, the FFT variation uses a radix 2^{2} single delay feedback, which is fully pipelined. If you select the floating point representation, the FFT variation uses a mixed radix4/2. For a length N transform, log_{4}(N) stages are concatenated together. The radix 2^{2} algorithm has the same multiplicative complexity of a fully pipelined radix4 FFT, but the butterfly unit retains a radix2 FFT. The radix4/2 algorithm combines radix4 and radix2 FFTs to achieve the computational advantage of the radix4 algorithm while supporting FFT computation with a wider range of transform lengths. The butterfly units use the DIF decomposition.
Fixed point representation allows for natural word growth through the pipeline. The maximum growth of each stage is 2 bits. After the complex multiplication the data is rounded down to the expanded data size using convergent rounding. The overall bit growth is less than or equal to log_{2}(N)+1.
The floating point internal data representation is singleprecision floatingpoint (32bit, IEEE 754 representation). Floatingpoint operations provide more precise computation results but are costly in hardware resources. To reduce the amount of logic required for floating point operations, the variable streaming FFT uses fused floating point kernels. The reduction in logic occurs by fusing together several floating point operations and reducing the number of normalizations that need to occur.
FixedPoint Variable Streaming FFTs
Log_{2}(N) stages each containing a single butterfly unit and a feedback delay unit that delays the incoming data by a specified number of cycles, halved at every stage. These delays effectively align the correct samples at the input of the butterfly unit for the butterfly calculations. Every second stage contains a modified radix2 butterfly whereby a trivial multiplication by j is performed before the radix2 butterfly operations. The output of the pipeline is in bitreversed order.
The following scheduled operations occur in the pipeline for an FFT of length N = 16.
 For the first 8 clock cycles, the samples are fed unmodified through the butterfly unit to the delay feedback unit.
 The next 8 clock cycles perform the butterfly calculation using the data from the delay feedback unit and the incoming data. The higher order calculations are sent through to the delay feedback unit while the lower order calculations are sent to the next stage.
 The next 8 clock cycles feed the higher order calculations stored in the delay feedback unit unmodified through the butterfly unit to the next stage.
Subsequent data stages use the same principles. However, the delays in the feedback path are adjusted accordingly.
FloatingPoint Variable Streaming FFTs
The FFT has ceiling(log _{4} (N)) stages. If transform length is an integral power of four, a radix4 FFT implements all of the log _{4} (N) stages. If transform length is not an integral power of four, the FFT implements ceiling(log _{4} (N)) 1 of the stages in a radix4, and implements the remaining stage using a radix2.
Each stage contains a single butterfly unit and a feedback delay unit. The feedback delay unit delays the incoming data by a specified number of cycles; in each stage the number of cycles of delay is one quarter of the number of cycles of delay in the previous stage. The delays align the butterfly input samples correctly for the butterfly calculations. The output of the pipeline is in indexreversed order.
Input and Output Orders
You can select input and output orders generated by the FFT.
Input Order  Output Order  Mode  Comments 

Natural  Bit reversed  Engineonly  Requires minimum memory and minimum latency. 
Bit reversed  Natural  
DCcentered  Bitreversed  
Natural  Natural  Engine with bitreversal  At the output, requires an extra N complex memory words and an additional N clock cycles latency, where N is the size of the transform. 
Bit reversed  Bit reversed  
DCcentered  Natural 
Some applications for the FFT require an FFT > user operation > IFFT chain. In this case, choosing the input order and output order carefully can lead to significant memory and latency savings. For example, consider where the input to the first FFT is in natural order and the output is in bitreversed order (FFT is operating in engineonly mode). In this example, if the IFFT operation is configured to accept bitreversed inputs and produces natural order outputs (IFFT is operating in engineonly mode), only the minimum amount of memory is required, which provides a saving of N complex memory words, and a latency saving of N clock cycles, where N is the size of the current transform.
FFT Processor Engines
QuadOutput FFT Engine
The FFT reads complex data samples x[k,m] from internal memory in parallel and reorders by switch (SW). Next, the radix4 butterfly processor processes the ordered samples to form the complex outputs G[k,m]. Because of the inherent mathematics of the radix4 DIF decomposition, only three complex multipliers perform the three nontrivial twiddlefactor multiplications on the outputs of the butterfly processor. To discern the maximum dynamic range of the samples, the blockfloating point units (BFPU) evaluate the four outputs in parallel. The FFT discards the appropriate LSBs and rounds and reorders the complex values before writing them back to internal memory.
SingleOutput FFT Engine
I/O Data Flow
Streaming FFT
The streaming FFT generates a design with a quad output FFT engine and the minimum number of parallel FFT engines for the required throughput.
A single FFT engine provides enough performance for up to a 1,024point streaming I/O data flow FFT.
Using the Streaming FFT
When the final sample loads, the source asserts sink_eop and sink_valid for the last data transfer.
 Deassert the system reset, The data source asserts sink_valid to indicate to the FFT function that valid data is available for input.
 Assert both the sink_valid and the sink_ready for a successful data transfer.
Changing the Direction on a BlockbyBlock Basis
When the FFT completes the transform of the input block, it asserts source_valid and outputs the complex transform domain data block in natural order. The FFT function asserts source_sop to indicate the first output sample.
After N data transfers, the FFT asserts source_eop to indicate the end of the output data block
Enabling the Streaming FFT
 You must assert the sink_valid signal for the FFT to assert source_valid (and a valid data output).
 To extract the final frames of data from the FFT, you need to provide several frames where the sink_valid signal is asserted and apply the sink_sop and sink_eop signals in accordance with the AvalonST specification.
Variable Streaming
Changing Block Size
To change the size of the FFT on a blockbyblock basis, change the value of the fftpts simultaneously with the application of the sink_sop pulse (concurrent with the first input data sample of the block). fftpts uses a binary representation of the size of the transform, therefore for a block with maximum transfer size of 1,024. Table 3–2 shows the value of the fftpts signal and the equivalent transform size.
fftpts  Transform Size 

10000000000  1,024 
01000000000  512 
00100000000  256 
00010000000  128 
00001000000  64 
Changing Direction
When the FFT completes the transform of the input block, it asserts source_valid and outputs the complex transform domain data block. The FFT function asserts the source_sop to indicate the first output sample. The order of the output data depends on the output order that you select in IP Toolbench. The output of the FFT may be in natural order or bitreversed order. Figure 3–6 shows the output flow control when the output order is bitreversed. If the output order is natural order, data flow control remains the same, but the order of samples at the output is in sequential order 1..N.
I/O Order
Order  Description 

Natural order  The FFT requires the order of the input samples to be sequential (1, 2 …, n – 1, n) where n is the size of the current transform. 
Bit reverse order  The FFT requires the input samples to be in bitreversed order. 
Digit Reverse Order  The FFT requires the input samples to be in digitreversed order. 
–N/2 to N/2  The FFT requires the input samples to be in the order –N/2 to (N/2) – 1 (also known as DCcentered order) 
Similarly the output order specifies the order in which the FFT generates the output. Whether you can select Bit Reverse Order or Digit Reverse Order depends on your Data Representation (Fixed Point or Floating Point). If you select Fixed Point, the FFT variation implements the radix22 algorithm and the reverse I/O order option is Bit Reverse Order. If you select Floating Point, the FFT variation implements the mixed radix4/2 algorithm and the reverse I/O order option is Digit Reverse Order.
For sample digitreversed order, if n is a power of four, the order is radix4 digitreversed order, in which twobit digits in the sample number are units in the reverse ordering. For example, if n = 16, sample number 4 becomes the second sample in the sample stream (by reversal of the digits in 0001, the location in the sample stream, to 0100). However, in mixed radix4/2 algorithm, n need not be a power of four. If n is not a power of four, the twobit digits are grouped from the least significant bit, and the most significant bit becomes the least significant bit in the digitreversed order. For example, if n = 32, the sample number 18 (10010) in the natural ordering becomes sample number 17 (10001) in the digitreversed order.
Enabling the Variable Streaming FFT
 Assert sink_valid.
 Transfer valid data to the FFT. The FFT processes data.
FFT Behavior When sink_valid is Deasserted
Dynamically Changing the FFT Size
I/O Order
If the FFT operates in engineonly mode, the output data is available after approximately N + latency clocks cycles after the first sample was input to the FFT. Latency represents a small latency through the FFT core and depends on the transform size. For engine with bitreversal mode, the output is available after approximately 2N + latency cycles.
Buffered Burst
Enabling the Buffered Burst FFT
When the FFT completes the transform of the input block, it asserts the source_valid and outputs the complex transform domain data block in natural order .
Signals source_sop and source_eop indicate the startofpacket and endofpacket for the output block data respectively.
 Deassert the system reset.
 Asserts sink_valid to indicate to the FFT function that valid data is available for input. A successful data transfer occurs when both the sink_valid and the sink_ready are asserted.
 Load the first complex data sample into the FFT function and simultaneously asserts sink_sop to indicate the start of the input block.
 On the next clock cycle, sink_sop is deasserted and you must load the following N – 1 complex input data samples in natural order.
 On the last complex data sample, assert sink_eop.
 When you load the input block, the FFT function begins computing the transform on the stored input block. Hold the sink_ready signal high as you can transfer the first few samples of the subsequent frame into the small FIFO at the input. If this FIFO buffer is filled, the FFT deasserts the sink_ready signal. It is not mandatory to transfer samples during sink_ready cycles.
FFT Buffered Burst Data Flow Simulation Waveform
Burst
In a burst I/O data flow FFT, the FFT can process a single input block only. A small FIFO buffer at the sink of the block and sink_ready is not deasserted until this FIFO buffer is full. You can provide a small number of additional input samples associated with the subsequent input block. You don’t have to provide data to the FFT during sink_ready cycles. The burst FFT can load the rest of the subsequent FFT frame only when the previous transform is fully unloaded.
FFT IP Core Parameters
Parameter  Value  Description 

Transform Length  64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, or 65536. Variable streaming also allows 8, 16, 32, 131072, and 262144.  The transform length. For variable streaming, this value is the maximum FFT length. 
Transform Direction  Forward, reverse, bidirectional  The transform direction. 
I/O Data Flow  Streaming Variable Streaming Buffered Burst Burst  If you select Variable Streaming and Floating Point, the precision is automatically set to 32, and the reverse I/O order options are Digit Reverse Order. 
I/O Order  Bit Reverse Order, Digit Reverse Order, Natural Order, N/2 to N/2  The input and output order for data entering and leaving the FFT (variable streaming FFT only). The Digit Reverse Order option replaces the Bit Reverse Order in variable streaming floating point variations. 
Data Representation  Fixed point or single floating point, or block floating point  The internal data representation type (variable streaming FFT only), either fixed point with natural bitgrowth or single precision floating point. Floatingpoint bidirectional IP cores expect input in natural order for forward transforms and digit reverse order for reverse transforms. The output order is digit reverse order for forward transforms and natural order for reverse transforms. 
Data Width  8, 10, 12, 14, 16, 18, 20, 24, 28, 32  The data precision. The values 28 and 32 are available for variable streaming only. 
Twiddle Width  8, 10, 12, 14, 16, 18, 20, 24, 28, 32  The twiddle precision. The values 28 and 32 are available for variable streaming only. Twiddle factor precision must be less than or equal to data precision. 
The FFT IP core's advanced parameters.
Parameter  Value  Description 

FFT Engine Architecture  Quad Output, Single Output  Choose between one,
two, and four quadoutput FFT engines working in parallel. Alternatively, if
you have selected a singleoutput FFT engine architecture, you may choose to
implement one or two engines in parallel. Multiple parallel engines reduce
transform time at the expense of device resources, which allows you to
select the desired area and throughput tradeoff point.
Not available for variable streaming or streaming FFTs. 
Number of Parallel FFT Engines  1, 2, 4  
DSP Block Resource Optimization  On or Off  Turn on for multiplier structure optimizations. These optimizations use different DSP block configurations to pack multiply operations and reduce DSP resource requirements. This optimization may reduce F_{MAX} because of the structure of the specific configurations of the DSP blocks when compared to the basic operation. Specifically, on Stratix V devices, this optimization may also come at the expense of accuracy. You can evaluate it using the MATLAB model provided and bit wise accurate simulation models. If you turn on DSP Block Resource Optimization and your variation has data precision between 18 and 25 bits, inclusive, and twiddle precision less than or equal to 18 bits, the FFT MegaCore function configures the DSP blocks in complex 18 x 25 multiplication mode. 
Enable Hard Floating Point Blocks  On or off  For Arria 10 devices and singlefloatingpoint FFTs only. 
FFT IP Core Interfaces and Signals
The FFT MegaCore function has a READY_LATENCY value of zero.
AvalonST Interfaces in DSP IP Cores
The input interface is an AvalonST sink and the output interface is an AvalonST source. The AvalonST interface supports packet transfers with packets interleaved across multiple channels.
AvalonST interface signals can describe traditional streaming interfaces supporting a single stream of data without knowledge of channels or packet boundaries. Such interfaces typically contain data, ready, and valid signals. AvalonST interfaces can also support more complex protocols for burst and packet transfers with packets interleaved across multiple channels. The AvalonST interface inherently synchronizes multichannel designs, which allows you to achieve efficient, timemultiplexed implementations without having to implement complex control logic.
AvalonST interfaces support backpressure, which is a flow control mechanism where a sink can signal to a source to stop sending data. The sink typically uses backpressure to stop the flow of data when its FIFO buffers are full or when it has congestion on its output.
FFT IP Core AvalonST Signals
Signal Name  Direction  AvalonST Type  Size  Description 

clk  Input  clk  1  Clock signal that clocks all internal FFT engine components. 
reset_n  Input  reset_n  1  Activelow asynchronous reset signal.This signal can be asserted asynchronously, but must remain asserted at least one clk clock cycle and must be deasserted synchronously with clk. 
sink_eop  Input  endofpacket  1  Indicates the end of the incoming FFT frame. 
sink_error  Input  error  2  Indicates an error has occurred in an upstream module, because of an illegal usage of the AvalonST protocol. The following
errors are defined:

sink_imag  Input  data  data precision width  Imaginary input data, which represents a signed number of data precision bits. 
sink_ready  Output  ready  1  Asserted by the FFT engine when it can accept data. It is not mandatory to provide data to the FFT during ready cycles. 
sink_real  Input  data  data precision width  Real input data, which represents a signed number of data precision bits. 
sink_sop  Input  startofpacket  1  Indicates the start of the incoming FFT frame. 
sink_valid  Input  valid  1  Asserted when data on the data bus is valid. When sink_valid and sink_ready are asserted, a data transfer takes place.. 
sink_data  Input  data  Variable 
In Qsys systems, this AvalonSTcompliant data bus includes all the AvalonST input data signals from MSB to LSB:

source_eop  Output  endofpacket  1  Marks the end of the outgoing FFT frame. Only valid when source_valid is asserted. 
source_error  Output  error  2  Indicates an error has occurred either in an upstream module or within the FFT module (logical OR of sink_error with errors generated in the FFT). 
source_exp  Output  data  6  Streaming, burst, and buffered burst FFTs only. Signed block exponent: Accounts for scaling of internal signal values during FFT computation. 
source_imag  Output  data  (data precision width + growth)  Imaginary output data. For burst, buffered burst, streaming, and variable streaming floating point FFTs, the output data width is equal to the input data width. For variable streaming fixed point FFTs, the size of the output data is dependent on the number of stages defined for the FFT and is 2 bits per radix 2^{2} stage. 
source_ready  Input  ready  1  Asserted by the downstream module if it is able to accept data. 
source_real  Output  data  (data precision width + growth)  Real output data. For burst, buffered burst, streaming, and variable streaming floating point FFTs, the output data width is equal to the input data width. For variable streaming fixed point FFTs, the size of the output data is dependent on the number of stages defined for the FFT and is 2 bits per radix 2^{2} stage. Variable streaming fixed point FFT only. Growth is log_{2}(N)+1. 
source_sop  Output  startofpacket  1  Marks the start of the outgoing FFT frame. Only valid when source_valid is asserted. 
source_valid  Output  valid  1  Asserted by the FFT when there is valid data to output. 
source_data  Output  data  Variable 
In Qsys systems, this AvalonSTcompliant data bus includes all the AvalonST output data signals from MSB to LSB:

Component Specific Signals
Signal Name  Direction  Size  Description 

fftpts_in  Input  log_{2}(maximum number of points)  The number of points in this FFT frame. If this value is not specified, the FFT can not be a variable length. The default behavior is for the FFT to have fixed length of maximum points. Only sampled at SOP. 
fftpts_out  Output  log_{2}(maximum number of points)  The number of points in this FFT frame synchronized to the AvalonST source interface. Variable streaming only. 
inverse  Input  1  Inverse FFT calculated if asserted. Only sampled at SOP. 
Incorrect usage of the AvalonST interface protocol on the sink interface results in a error on source_error. Table 3–8 defines the behavior of the FFT when an incorrect AvalonST transfer is detected. If an error occurs, the behavior of the FFT is undefined and you must reset the FFT with reset_n.
Error  source_error  Description 

Missing SOP  01  Asserted when valid goes high, but there is no start of frame. 
Missing EOP  10  Asserted if the FFT accepts N valid samples of an FFT frame, but there is no EOP signal. 
Unexpected EOP  11  Asserted if EOP is asserted before N valid samples are accepted. 
Block Floating Point Scaling
In fixedpoint FFTs, the data precision needs to be large enough to adequately represent all intermediate values throughout the transform computation. For large FFT transform sizes, an FFT fixedpoint implementation that allows for word growth can make either the data width excessive or can lead to a loss of precision.
Floatingpoint FFTs represents each number as a mantissa with an individual exponent. The improved precision is offset by demand for increased device resources.
In a blockfloating point FFT, all of the values have an independent mantissa but share a common exponent in each data block. Data is input to the FFT function as fixed point complex numbers (even though the exponent is effectively 0, you do not enter an exponent).
The blockfloating point FFT ensures full use of the data width within the FFT function and throughout the transform. After every pass through a radix4 FFT, the data width may grow up to log_{2} (42) = 2.5 bits. The data scales according to a measure of the block dynamic range on the output of the previous pass. The FFT accumulates the number of shifts and then outputs them as an exponent for the entire block. This shifting ensures that the minimum of least significant bits (LSBs) are discarded prior to the rounding of the postmultiplication output. In effect, the blockfloating point representation is as a digital automatic gain control. To yield uniform scaling across successive output blocks, you must scale the FFT function output by the final exponent.
In comparing the blockfloating point output of the Altera FFT MegaCore function to the output of a full precision FFT from a tool like MATLAB, you must scale the output by 2 (–exponent_out) to account for the discarded LSBs during the transform.
Unlike an FFT block that uses floating point arithmetic, a blockfloatingpoint FFT block does not provide an input for exponents. Internally, a complex value integer pair is represented with a single scale factor that is typically shared among other complex value integer pairs. After each stage of the FFT, the largest output value is detected and the intermediate result is scaled to improve the precision. The exponent records the number of left or right shifts used to perform the scaling. As a result, the output magnitude relative to the input level is:
output*2exponent
For example, if exponent = –3, the input samples are shifted right by three bits, and hence the magnitude of the output is output*23.
After every pass through a radix2 or radix4 engine in the FFT core, the addition and multiplication operations cause the data bits width to grow. In other words, the total data bits width from the FFT operation grows proportionally to the number of passes. The number of passes of the FFT/IFFT computation depends on the logarithm of the number of points.
A fixedpoint FFT needs a huge multiplier and memory block to accommodate the large bit width growth to represent the high dynamic range. Though floatingpoint is powerful in arithmetic operations, its power comes at the cost of higher design complexity such as a floatingpoint multiplier and a floatingpoint adder. BFP arithmetic combines the advantages of floatingpoint and fixedpoint arithmetic. BFP arithmetic offers a better signaltonoise ratio (SNR) and dynamic range than does floatingpoint and fixedpoint arithmetic with the same number of bits in the hardware implementation.
In a blockfloatingpoint FFT, the radix2 or radix4 computation of each pass shares the same hardware, with the data being read from memory, passed through the core engine, and written back to memory. Before entering the next pass, each data sample is shifted right (an operation called "scaling") if there is a carryout bit from the addition and multiplication operations. The number of bits shifted is based on the difference in bit growth between the data sample and the maximum data sample detected in the previous stage. The maximum bit growth is recorded in the exponent register. Each data sample now shares the same exponent value and data bit width to go to the next core engine. The same core engine can be reused without incurring the expense of a larger engine to accommodate the bit growth.
The output SNR depends on how many bits of right shift occur and at what stages of the radix core computation they occur. In other words, the signaltonoise ratio is data dependent and you need to know the input signal to compute the SNR.
Possible Exponent Values
P = ceil{log_{4}N}, where N is the transform length
R = 0 if log_{2}N is even, otherwise R = 1
Single output range = (–3P+R, P+R–4)
Quad output range = (–3P+R+1, P+R–7)
These equations translate to the values in Table A–1.
N  P  Single Output Engine  Quad Output Engine  

Max ^{ }(2)  Min ^{ }(2)  Max ^{ }(2)  Min ^{ }(2)  
64  3  –9  –1  –8  –4 
128  4  –11  1  –10  –2 
256  4  –12  0  –11  –3 
512  5  –14  2  –13  –1 
1,024  5  –15  1  –14  –2 
2,048  6  –17  3  –16  0 
4,096  6  –18  2  –17  –1 
8,192  7  –20  4  –19  1 
16,384  7  –21  3  –20  0 
Note to
Table A–1
:

Implementing Scaling
 Determine the length of the resulting full scale dynamic range storage register. To get the length, add the width of the data to the number of times the data is shifted. For example, for a 16bit data, 256point Quad Output FFT/IFFT with Max = –11 and Min = –3. The Max value indicates 11 shifts to the left, so the resulting full scaled data width is 16 + 11, or 27 bits.
 Map the output data to the appropriate location within the expanded dynamic range register based upon the exponent value. To continue the above example, the 16bit output data [15..0] from the FFT/IFFT is mapped to [26..11] for an exponent of –11, to [25..10] for an exponent of –10, to [24..9] for an exponent of –9, and so on.
 Sign extend the data within the full scale register.
Example of Scaling
case (exp) 6'b110101 : //11 Set data equal to MSBs begin full_range_real_out[26:0] <= {real_in[15:0],11'b0}; full_range_imag_out[26:0] <= {imag_in[15:0],11'b0}; end 6'b110110 : //10 Equals left shift by 10 with sign extension begin full_range_real_out[26] <= {real_in[15]}; full_range_real_out[25:0] <= {real_in[15:0],10'b0}; full_range_imag_out[26] <= {imag_in[15]}; full_range_imag_out[25:0] <= {imag_in[15:0],10'b0}; end 6'b110111 : //9 Equals left shift by 9 with sign extension begin full_range_real_out[26:25] <= {real_in[15],real_in[15]}; full_range_real_out[24:0] <= {real_in[15:0],9'b0}; full_range_imag_out[26:25] <= {imag_in[15],imag_in[15]}; full_range_imag_out[24:0] <= {imag_in[15:0],9'b0}; end . . . endcase
In this example, the output provides a full scale 27bit word. You must choose how many and which bits must be carried forward in the processing chain. The choice of bits determines the absolute gain relative to the input sample level.
Figure A–1 on page A–5 demonstrates the effect of scaling for all possible values for the 256point quad output FFT with an input signal level of 0x5000. The output of the FFT is 0x280 when the exponent = –5. The figure illustrates all cases of valid exponent values of scaling to the full scale storage register [26..0]. Because the exponent is –5, you must check the register values for that column. This data is shown in the last two columns in the figure. Note that the last column represents the gain compensated data after the scaling (0x0005000), which agrees with the input data as expected. If you want to keep 16 bits for subsequent processing, you can choose the bottom 16 bits that result in 0x5000. However, if you choose a different bit range, such as the top 16 bits, the result is 0x000A. Therefore, the choice of bits affects the relative gain through the processing chain.
Because this example has 27 bits of full scale resolution and 16 bits of output resolution, choose the bottom 16 bits to maintain unity gain relative to the input signal. Choosing the LSBs is not the only solution or the correct one for all cases. The choice depends on which signal levels are important. One way to empirically select the proper range is by simulating test cases that implement expected system data. The output of the simulations must tell what range of bits to use as the output register. If the full scale data is not used (or just the MSBs), you must saturate the data to avoid wraparound problems.
Unity Gain in an IFFT+FFT Pair
BFP arithmetic does not provide an input for the exponent, so you must keep track of the exponent from the IFFT block if you are feeding the output to the FFT block immediately thereafter and divide by N at the end to acquire the original signal magnitude.
where:
x0 = Input data to IFFT
X0 = Output data from IFFT
N = number of points
data1 = IFFT output data and FFT input data
data2 = FFT output data
exp1 = IFFT output exponent
exp2 = FFT output exponent
IFFTa = IFFT
FFTa = FFT
Any scaling operation on X0 followed by truncation loses the value of exp1 and does not result in unity gain at x0. Any scaling operation must be done on X0 only when it is the final result. If the intermediate result X0 is first padded with exp1 number of zeros and then truncated or if the data bits of X0 are truncated, the scaling information is lost.
One way to keep unity gain is by passing the exp1 value to the output of the FFT block. The other way is to preserve the full precision of data1×2^{–}exp1 and use this value as input to the FFT block. The disadvantage of the second method is a large size requirement for the FFT to accept the input with growing bit width from IFFT operations. The resolution required to accommodate this bit width will, in most cases, exceed the maximum data width supported by the core.
RL** For more information, refer to the Achieving Unity Gain in Block Floating Point IFFT+FFT Pair design example under DSP Design Examples at www.altera.com .
Document Revision History
Date  Version  Changes Made 

2016.05.01  16.0  Added MATLAB simulation flow. 
2015.10.01  15.1  Added more info to sink_data and source_data signals 
2014.12.15  14.1 

August 2014  14.0 Arria 10 Edition 

June 2014  14.0 

November 2013  13.1 

November 2012  12.1  Added support for Arria V GZ devices. 
FFT IP Core User Guide Document Archive
IP Core Version  User Guide 

15.1  FFT IP Core User Guide 
15.0  FFT IP Core User Guide 
14.1  FFT IP Core User Guide 