See Vitis™ Development Environment on xilinx.com See Vitis™ AI Development Environment on xilinx.com |
Version: Vitis 2024.1
The Prime Factor Algorithm (PFA) [1] is a Fast Fourier Transform (FFT) algorithm [2] discovered by Good & Thomas before the more popular Cooley-Tukey algorithm with some interesting properties. The PFA is another "divide and conquer" approach for computing a Discrete Fourier Transform (DFT) of size
A second advantage of the PFA approach is that unlike the popular Cooley-Tukey FFT, no extra multiplications by "twiddle factors" need be performed between stages. This fact falls out of the DFT factorization when
This tutorial illustrates these concepts by designing a PFA-1008 transform in Versal using both AI Engine and PL elements working cooperatively. The PFA approach may be applied here since
The figure below shows a block diagram of a 2D PFA transform. It's corresponding Matlab model is shown immediately below. The algorithm consists of the following five steps:
- An input permutation is applied to the input data. The specific input permutation depends on the values of the relative prime factors
$N_1$ and$N_2$ as outlined below. - The data is organized into an
$N_2 \times N_1$ matrix and 1D FFT's are performed along the rows of that matrix. - A "matrix transpose" operation converts the data flow from "row-wise" to "column-wise" ordering for the subsequent column-based transforms. This amounts to delivering the data using a "strided" addressing pattern.
- A second set of 1D FFT's are performed along the columns of the original matrix.
- An output permutation is applied to the output data after being read column-wise out of its 2D matrix form. The specific output permutation depends on the values of
$N_1$ and$N_2$ as outlined below.
The Matlab code below shows all of these five steps clearly. The routine compute_perm_2d()
computes the input permutation P_i
and output permutation P_o
applied based on the values of
function [sig_o] = fft_pfa_2d( sig_i, N1, N2 )
[P_i,P_o,N] = compute_perm_2d(N1,N2);
% Input permutation
data = reshape(sig_i(1+P_i),N2,N1);
% Row transform
for rr = 1 : N2
data(rr,:) = fft(data(rr,:));
end
% Col transform
for cc = 1 : N1
data(:,cc) = fft(data(:,cc));
end
% Output permutation
data_o = reshape(data,1,[]);
sig_o(1+P_o) = data_o;
end
The figure below shows a block diagram of a 3D PFA transform. It's corresponding Matlab model is shown immediately below. The algorithm consists of the same steps similar to the 2D case above with the following differences:
- The I/O permutations now depend on three relatively prime factors
$N_1$ ,$N_2$ and$N_3$ . The mathematics specific to these 2D and 3D permutations is given in detail below. - The data is now organized in a
$N_1 \times N_2 \times N_3$ cube instead of an$N_1 \times N_2$ rectangle. Transforms are taken in 1D along each of these dimensions in order, first along$N_1$ , then along$N_2$ and finally along$N_3$ . - The "matrix transpose" operations required to extract data in the
$N_2$ and$N_3$ dimensions involve slightly more complicated "stride" patterns. These patterns are computed by the Matlab routinecompute_addr_3d.m
.
function [sig_o] = fft_pfa_3d( sig_i, N1, N2, N3 )
[P_i,P_o,N] = compute_perm_3d(N1,N2,N3);
% Input permutation:
data = reshape(sig_i(1+P_i),N1,N2,N3);
% Transforms
for cc = 1 : N2
for dd = 1 : N3
data(:,cc,dd) = fft(data(:,cc,dd));
end
end
for rr = 1 : N1
for dd = 1 : N3
data(rr,:,dd) = fft(data(rr,:,dd));
end
end
for rr = 1 : N1
for cc = 1 : N2
data(rr,cc,:) = fft(data(rr,cc,:));
end
end
% Output permutation:
data_o = reshape(data,1,[]);
sig_o(1+P_o) = data_o;
end
The full suite of Matlab models illustrating the operation of the PFA transforms is given in the matlab
folder of the repo.
The figure below illustrates how to compute the I/O permutations required for a 2D PFA solution. The input permutation relies on a simple modulo computation with the relatively prime factors
The output permutation mapping relies on a similar modulo computation but with factors
The I/O permutations for a 3D PFA solution are solved in a manner similar to the 2D case as shown in the figure below. Essentially, the modulo computations are extended to include a third dimension, and three modulo-inverses are required to solve for the output permutation. These permutations may be expressed in a 3D matrix form, or as a 1D address permutation
The figure below shows a block diagram of a 3D PFA-1008 hardware design implemented in Versal using AI Engines and PL. The design targets a 2 Gsps throughput (SSR=2). AI Engines implement the three DFT kernels, specifically DFT-7, DFT-9 and DFT-16, using a vector-matrix multiplication approach. The design implements the I/O permutation and matrix transpose kernels using Vitis HLS targeting in PL.
Some details on each kernel design is given in the sections below.
This PL kernel is implemented in HLS @ 312.5 MHz (SSR=8). Samples arriving in polyphase order on two 128-bit streams are written are written into a ping/pong buffer in 4X duplicate fashion. This is required since each buffer supports two R/W ports and we must read or write 8 samples per cycle. The input permutation
This AI Engine kernel implements the DFT-7 using a vector-matrix approach based on mac8() intrinsics to compute two [1x1] x [1x4] OPs in parallel in a single cycle. Two compute tiles and 1 combiner tile are required. The first compute tile is 100% efficient whereas the second compute tile is 75% efficient. Input samples for two separate transforms arrive over two streams, one transform on each stream, over a period of seven cycles. The output samples are produced alternately for each transform on the two output streams over the same period. The graph for this AI Engine kernel is shown below.
This PL kernel implements in PL the matrix transpose operation required to feed the proper 9-point input samples to the DFT-9 on the second dimension of the 3D cube. The design uses HLS @ 312.5 MHz (SSR=8). The input 7-pt transforms arrive over two streams as outlined above. Samples are written linearly into a ping-pong buffer and read back using a strided order (seven points apart) producing output 9-pt transforms alternately on each of two output streams. The design has a latency of 1008/8+2 cycles.
This AI Engine kernel implements the DFT-9 using a similar approach to the DFT-7 kernel above. This kernel requires 3 compute tiles and 1 combiner tiles and operates in a manner very similar to the DFT-7 kernel. I/O transforms are transmitted in alternating fashion over two streams as above. The first two compute kernels are 100% efficient, whereas the third tile is only 25% efficient. The graph for this AI Engine kernel is shown below.
This PL kernel implements in PL the matrix transpose operation required to feed the proper 16-point input samples to the DFT-16 on the third dimension of the 3D cube. The design uses HLS @ 312.5 MHz (SSR=8). The 9-pt transforms arrive on alternate streams. Samples are written into a ping/pong buffer. Samples are read back in a transposed (stride-63) order. The kernel produces 4 samples alternately on two output streams. The design has a latency of 1008/8+1 cycles.
This AI Engine kernel implements the DFT-16 using a similar approach to the DFT-7 and DFT-9 kernels above. Unlike those kernels, the I/O samples for the DFT-16 are transmitted four consecutive samples each on alternate streams (as compared to full transforms transmitted on alternate streams in the earlier kernels). The graph for this AI Engine kernel is shown below.
This PL kernel is implemented in HLS @ 312.5 MHz (SSR=8). Samples arriving in 4 samples alternately on two 128-bit streams are written are written into a ping/pong buffer in 4X duplicate fashion in a manner similar to the INPUT PERMUTE kernel. The output permutation
The figure below summarizes the AI Engine and PL resources required to implement the design in the VC1902 device on the VCK190 eval board. The design is using 12 AI Engine tiles for compute and 19 tiles for buffering. The PL portion requires ~100 BRAMs to provide the I/O permutes and matrix transpose operations. This design illustrates how Versal AI Engine and PL may be crafted to create a high performance tightly coupled custom data path tailored directly to the algorithm of interest.
IMPORTANT: Before beginning the tutorial ensure you have installed Vitis™ 2024.1 software. Ensure you have downloaded the Common Images for Embedded Vitis Platforms from this link.
Set the environment variable COMMON_IMAGE_VERSAL
to the full path where you have downloaded the Common Images. The remaining environment variables are configured in the top level Makefile <path-to-design>/05-Prime-Factor-FFT/Makefile
file. The tutorial will build its own custom platform
[shell]% cd <path-to-design>/05-Prime-Factor-FFT
[shell]% make all TARGET=hw_emu
This will take about 90 minutes to run. The build process will generate a folder 05-Prime-Factor-FFT/package
containing all the files required for hardware emulation. This can be run as shown below. An optional -g
can be applied to the launch_hw_emu.sh
command to launch the Vivado waveform GUI to observe the top-level AXI signal ports in the design.
[shell]% cd <path-to-design>/05-Prime-Factor-FFT/package
[shell]% ./launch_hw_emu.sh -run-app embedded_exec.sh
The design can be built for the VCK190 board using the Makefile as follows:
[shell]% cd <path-to-design>/05-Prime-Factor-FFT
[shell]% make all TARGET=hw
The build process will generate the SD card image in the <path-to-design>/05-Prime-Factor-FFT/package/sd_card
folder.
[1] Wikipedia, "Prime Factor FFT Algorithm"
[2] C. Sidney Burrus, "Fast Fourier Transforms"
GitHub issues will be used for tracking requests and bugs. For questions, go to support.xilinx.com.
Copyright © 2023-2024 Advanced Micro Devices, Inc