

# Two Parallel Pipelined FFT Architecture After Third Stage for Low Complexity and Latency

# G. Prasanna Kumar, Pushpa Kotipalli, B.T.Krishna



Abstract: Complexity is the major issue in the development of pipelined Fast Fourier Transform (FFT) structure. A novel parallel pipelined FFT design constructed for low complexity and Latency. The proposed structure efficiently uses pipeline and parallel pipeline in the design. The first three stages of the structure follow the pipelining and from fourth stage there are two parallel paths simultaneously compute four complex additions, the first three single path stages that reduces complexity, parallel pipeline stages at the end of structure also reduces the Latency. The proposed structure is an optimized solution to reduce the trade-off between Latency and complexity. The comparative performance of the proposed structure with several other existing structures shows that it has Low latency with small cost of hardware.

Index Terms: complexity, FFT, parallelism, pipelined

#### I. INTRODUCTION

Major transformations in signal processing, image processing and other communication applications are possible because of the advancements in FFT. Pipelined FFT are very important and efficient architectures in the implementation of FFT. Discrete Fourier transform is defined with exponential kernel function in (1)

$$F[k] = \sum_{n=0}^{N-1} f[n] w_N^{kn}$$
 (1)

$$\sum_{k}^{k} n = a^{\frac{-j2\pi k}{N}}$$

Where  $w_N^{kn} = e$ Cooley and Tukey [1] who behind the invention of FFT, the

efficient transformation to enumerate DFT in an operative way in 1965. Pipelined and parallel pipelined FFT structure proposed in[2] is the evolution in VLSI implementation of FFT. CORDIC iteration method proposed in[3] is organized method for implementation of FFT. Pipelined FFT architectures for speech processing or

applications are proposed in[4], FFT for EEG[5] is introduced, Split-Radix FFT for signal processing applications implemented in [6], low power variable length FFT proposed in[7]. Several pipelined and parallel pipelined architectures are proposed in the resent scenario [8],[9]. FFT design when inputs are is RFFT is introduced in [10]. Pipelined architecture for RFFT in [11] has an advantage of power and delay efficiency.

#### Manuscript published on 30 September 2019.

\*Correspondence Author(s)

Prasanna Kumar G Associate professor in Vishnu Institute of Technology Bhimavaram,

Krishna B.T Ph.D. degree in Electronics and Communication Engineering from Andhra University, Vishakhapatnam, Andhra Pradesh, India

Pushpa Kotipalli, Professor in ECE Department of Shri Vishnu Engineering College for Women, Bhimavaram, Andhra Pradesh, India.

© The Authors. Published by Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license <a href="http://creativecommons.org/licenses/by-nc-nd/4.0/">http://creativecommons.org/licenses/by-nc-nd/4.0/</a>

FPGA implementation of pipelined FFT architectures are proposed in [12]. several other parallel pipeline designs in recent years proposed in [13],[14],[15].

Here is a new pipelined parallel FFT architecture, where initial three stages follow pipelined structure and from fourth stage onwards it follows the parallel pipelined structure.A novel structure of pipelined Fast Fourier Transform is implemented for two radix structures those are radix-2 and radix-2<sup>2</sup>.

The five sub sections of the paper as follows, Section I gives the introduction of the Pipelined architecture for Fast Fourier Transform. Also, several existing works in the recent years on FFT optimization are described in this section. Section II gives the detailed analysis of proposed structure with diagrammatical explanation, section III describes the sub module used in the proposed structure, and section IV elaborates comparative results. Practical observations are compared with traditional techniques and the conclusions of the proposed work are mentioned in section V.

#### II. PROPOSED ARCHITECTURE

Radix-2<sup>2</sup> DIF Fast Fourier Transform architecture for 16 points illustrated in Figure 1, in which twiddle factor multipliers are required one time for two stages. The decomposition of Radix-2<sup>2</sup> is first F[k] decomposed as odd and even segments F[2k] as well as F[2k+1] shown in equations (2) and (3). F[2k+1] further dived into even and odd parts according to equations (4) and (5) then the resultant twiddle factors are distributed as shown in Figure 1

$$F[2k] = \sum_{n=0}^{\frac{N}{2}-1} \left\{ f[n] + f[\frac{N}{2} + n] \right\} w_N^{2kn}$$
 (2)

$$F[2k+1] = \sum_{n=0}^{\frac{N}{2}-1} \left\{ f[n] - f[n+\frac{N}{2}] \right\} w^n w_N^{2kn}$$
(3)

$$F[4k+1] = \sum_{n=0}^{N} \begin{cases} f[n] - f\left[n + \frac{N}{2}\right] - \\ f\left[n + \frac{N}{4}\right] \\ - f\left[n + \frac{3N}{4}\right] \end{cases} \} w^n w_N^{4kn}$$

$$(4)$$



$$F[4k+3] = \sum_{n=0}^{\frac{N}{4}-1} \begin{cases} f[n] - f\left[n + \frac{N}{2}\right] + \\ j\left[f\left[n + \frac{N}{4}\right]\right] \\ - f\left[n + \frac{3N}{4}\right] \end{cases}$$
 (5)

The proposed structure in FFT data flow illustrated in Figure 2, in which initial three parts are modeled on pipeline basis, from stage 4 onwards designed on pipeline parallel fashion shown as circles in stage 4 of Figure 2.

The proposed design for 16 values are displayed in **Figure 3**. 16-point FFT requires four stages of butterfly units. The structure follows the pattern as, stage-1, stage-2 and stage-3 are constructed with one butterfly unit which are BU1, BU2 and BU3 respectively. Stage-4 is constructed with two butterfly units BU41 and BU42. Each butterfly unit has two complex inputs and two complex outputs. This structure requires Intermediate elements between one stage to another stage, those are FIFO's to store the intermediate values, multiplexers for selecting two-point butterfly in the overall sixteen-point butterfly unit. The data flow from left to right, at first the initial 8 samples of f(n) that is f(0) to f(7) stored in FIFO F0, the timing plot of the proposed design is displayed in Figure 6 which is control timing for inputs, MUX's FIFO's and multiplier coefficients, in stage-2 three FIFO's F1 and two multiplexers with selection input 0 and 1 used. Stage 3 is like stage-2, in stage-4 there are two butterfly units and two FIFO's for data processing.

Proposed architecture for N=32 is shown in Figure 4, in that there is one more additional parallel stage used to process the data coming from stage-4. The throughput of the proposed architecture is two. The proposed design for N=64 is shown in Figure 5The progression of information of the proposed structure is portrayed as, FIFO F0 loads data from f(0) to f(7)when the first eight clock pulses data is available and in 8pulse onwards control engages read r0 it scrutinizes initial info estimation of f(0) and opposite way moving information is f(7) before f(8), so f(8) as well as f(0) handled through butterfly structure which comprises of dual summation circuits and creates two yield esteems. These two dual summation circuits are put away in two different First-in First-out blocks. The selection lines of multiplexers S0, S1 is chosen based on butterfly structure. After eight clock cycles from 9 to 12 the multiplexer choice line S0 can be made logic 0



Figure 1: 16 point-FFT usingRadix-2<sup>2</sup> DIF

One end of the butterfly is composed in FIFO F1 and the other edge can be composed in FIFO F2. R1 is the read signals for both the FIFOs.

R1 enables the read operation in both the FIFOs. The information in initial FIFO is scrutinizes in the butterfly structure around then s1 signal logic zero chooses moving information prepared through next part of the butterfly creates two yield esteems put away in another two FIFO's. These two FIFOs are denoted as F3 and F4. FIFOs F3 and F4 are empowered by w2 for two clock pulses. Similarly, from 13 to 16 span, signal S0 logic level one indicates that the information in F2 is moved to FIFO unit F1 of stage-1.First butterfly stage can free up additional eight extra rooms. Maximum of four yields are possible in one clock pulse. During the sixteenth clock pulse butterfly five creates F (0), F (8). In the same way butterfly six produces F (2), F (10). Successive clock pulse creates a different four yield possibilities. Remaining eight yield possibilities, empower S1 ascertain. At that point the last eight points of initial stage butterfly handles with in comparative design starting at now showed up for getting another eight esteems, five clock cycles are required fit will takes further five clock pulses.

The Latency for 16-point FFT is 9 and for 32 is 19 and of 39 for 64-point FFT. Regularly in an N point Fast Fourier Transform it is 5N/8-1. This procedure can be extended to higher order FFT structures also, Figure 4 shows the proposed FFT structure for N=32, which takes additional one stage that is two butterfly units. Figure 5 shows the proposed structure for N=64 in which there are one more additional stage. The subunits in the proposed architecture are Butterfly Unit (BU), multiplexers and twiddle factor multiplier unit. Each individual butterfly structure has two complex adders illustrated in Figure 7. Each real addition is performed by the carry save adder for low power



Figure 2:. Proposed structure in FFT Data flow







Figure 3: DIF-FFT of proposed Architecture for N=16

## III. DESIGN OF SUB-SECTIONS

### A. Multiplier Unit

The proposed structure uses multiplier unit shown in Figure 8, this multiplier unit is fractional multiplier unit, because of multiplication with twiddle factor value which is fractional value. Multiplication with value 0.8 is shown in Figure 8, the input X is initially multiply by 8 and the resultant is divided by 10 gives our required result. Division with factor 10 is performed by shift registers arranged in parallel form



Figure 4. Proposed Architecture for N=32



Figure 5.Proposed Architecture for N=64





Figure 6: Timing Diagram of proposed structure

## **B. Butterfly Unit (BU):**

The pipelined designs are composed of butterfly units, the Number of butterfly units the design with full parallel that has no latency is 2N-1. Total BU units of the proposed design evaluated mathematically First three stages only single BU units, total of 3 units from stage-3 there are 2 BU units for each stage, BU units from stage three with order N is  $2(\log_2 N - 3)$  .three units for first three stages

$$T_{BU} = 3 + 2(\log_2 N - 3)$$
  
 $T_{BU} = 2\log_2 N - 3$ 



Figure 7 Two-point Butterfly unit



Figure 8:. Multiplier Unit for twiddle factor multiplication

## IV. RESULTS AND DISCUSSION

The performance of the proposed structure is tested by taking three important parameters into consideration those are multipliers, adders and Latency by the recent work done in [10],[13],[14] and [15] shown in Table 1, as multipliers are concern the proposed structure requires lower multipliers than [10],[13] and radix-2 [14], slightly higher than radix-2<sup>3</sup> [14] and [15]. Adders are lower than [10], [13] and slightly higher than [14] and [15]. The latency of the proposed architecture is lower than [10], [13], [14] and [15]. The comparative results show that the proposed structure is reduces the latency with a little cost of hardware, so that it is low complexity high speed architecture.

Table 1: HARDWARE UTILIZATION COMPARISON TABLE OF THE PROPOSED STRUCTURE

| Type of<br>Design             | Muls              | Adds            | Latency             |
|-------------------------------|-------------------|-----------------|---------------------|
| Rd-2 [10]                     | $\log_2 N - 3$    | $4\log_2 N$     | N                   |
| Rd-2 <sup>3</sup> [13]        | $\log_8 N - 1$    | $2\log_2 N$     | $\frac{3N}{2}-1$    |
| Rd-2 [14]                     | $\log_2 N - 3$    | $2\log_2 N$     | $\frac{3N}{2}$ -6   |
| Rd-2 <sup>3</sup> [14]        | $2(\log_8 N - 1)$ | $2\log_2 N$     | $\frac{3N}{2}$ -6   |
| Rd-2 [15]                     | $4\log_4 N - 4$   | $4\log_4 N + 4$ | $\frac{3N}{4}$ -1   |
| Proposed<br>Rd-2              | $2\log_2 N - 1$   | $4\log_2 N - 6$ | $\frac{5N}{8}$ -1   |
| Proposed<br>Rd-2 <sup>2</sup> | $4\log_4 N - 2$   | $4\log_2 N - 6$ | $\frac{5N}{8}$ $-1$ |

Table 2: Hardware Utilization of proposed structure on vertix-6 FPGA for N=16

| Type              | Registers | Luts | DSP48E | Power(uw) |
|-------------------|-----------|------|--------|-----------|
| Rd-2              | 110       | 301  | 6      | 242       |
| Rd-2 <sup>2</sup> | 85        | 311  | 14     | 114       |

The proposed structure is synthesized on vertex-6 XC6VLX75T device, Table 2shows the Number resources utilized on the device for Radix-2 and Radix-2<sup>2</sup> structures at 100Mhz clock and power in micro watts.

#### V. CONCLUSIONS

A new structure proposed for implementation of pipelined FFT, which is optimized both in complexity as well as latency. The proposed structure has single stages up to first three stages of data flow, from fourth stage onwards parallel pipeline structures are used to compute parallelly. This structure uses FIFO's to store the intermediate values, a low power twiddle factor multiplication generator and multiplexers and low power carry save adder for addition operation. The comparative results with existing architecture show that proposed structure is optimized structure and has low latency and low complexity. The proposed designs best suited for the applications of OFDM and other Low latency module requirement applications with power optimization.



Published By:



#### REFERENCES

- J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput., vol. 19, no. 90, pp. 297-301, 1965.
- E. H. Wold and A. M. Despain, "Pipeline and parallel-pipeline FFT processors for VLSI implementations," IEEE Trans. Comput., no. 5, pp. 414-426, 1984.
- A. M. Despain, "Fourier transform computers using CORDIC iterations," IEEE Trans. Comput., vol. 100, no. 10, pp. 993-1001, 1974.
- G. Bi and E. V Jones, "A pipelined FFT processor for word-sequential data," IEEE Trans. Acoust., vol. 37, no. 12, pp. 1982-1985, 1989.
- R. F. Yazicioglu, P. Merken, R. Puers, and C. Van Hoof, "Low-power low-noise 8-channel EEG front-end ASIC for ambulatory acquisition systems," in 2006 Proceedings of the 32nd European Solid-State Circuits Conference, 2006, pp. 247-250.
- W.-C. Yeh and C.-W. Jen, "High-speed and low-power split-radix FFT," IEEE Trans. Signal Process., vol. 51, no. 3, pp. 864-874, 2003.
- 7. Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh, "Low-power variable-length fast Fourier transform processor," IEE Proceedings-Computers Digit. Tech., vol. 152, no. 4, pp. 499-506, 2005.
- L. Yang, K. Zhang, H. Liu, J. Huang, and S. Huang, "An efficient locally pipelined FFT processor," IEEE Trans. circuits Syst. II Express Briefs, vol. 53, no. 7, pp. 585-589, 2006.
- M. Shin and H. Lee, "A high-speed four-parallel radix-2 4 FFT/IFFT processor for UWB applications," in 2008 IEEE International Symposium on Circuits and Systems, 2008, pp. 960–963.
- M. Garrido, K. K. Parhi, and J. Grajal, "A pipelined FFT architecture for real-valued signals," IEEE Trans. Circuits Syst. I Regul. Pap., vol. 56, no. 12, pp. 2634–2643, 2009.
- S. A. Salehi, R. Amirfattahi, and K. K. Parhi, "Pipelined architectures for real-valued FFT and hermitian-symmetric IFFT with real datapaths," IEEE Trans. Circuits Syst. II Express Briefs, vol. 60, no. 8, pp. 507-511,
- S. Langemeyer, P. Pirsch, and H. Blume, "A FPGA architecture for real-time processing of variable-length FFTs," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 1705-1708.
- M. Ayinala, M. Brown, and K. K. Parhi, "Pipelined parallel FFT architectures via folding transformation," IEEE Trans. Very Large Scale Integr. Syst., vol. 20, no. 6, pp. 1068–1081, 2012.

  M. Ayinala and K. K. Parhi, "FFT architectures for real-valued signals
- based on radix-\$2^{3} \$ and radix-\$2^{4} \$ algorithms," IEEE Trans. Circuits Syst. I Regul. Pap., vol. 60, no. 9, pp. 2422-2430, 2013.
- A. X. Glittas, M. Sellathurai, and G. Lakshminarayanan, "A normal I/O order radix-2 FFT architecture to process twin data streams for MIMO," IEEE Trans. Very Large Scale Integr. Syst., vol. 24, no. 6, pp. 2402-2406, 2016.

# **AUTHORS PROFILE**



Prasanna Kumar G Research scholar in University College of Engineering JNTUK, Kakinada and Associate professor in Vishnu Institute of Technology Bhimavaram, The current research area is signal processing and presently research on low complexity and power optimized FFT designs.



Krishna B.T gotten Ph.D. degree in Electronics and Communication Engineering from Andhra University, Vishakhapatnam, Andhra Pradesh, India in 2010. He has been with the college school of Engineering, JNTUK, Kakinada since 2012. His exploration advantages incorporate



Pushpa. Kgotten Ph.D. degree in Electronics and Communication Engineering from Osmania University, Hyderabad, Andhra Pradesh in 2010. She filled in as Engineer in Astra Microwave Products Limited, Hyderabad during 1992 to 1997. From 1997 onwards, she served in various Engineering Colleges, wound up Associate

Professor in 2000 and Professor in 2007. She is currently a Professor in ECE Department of Shri Vishnu Engineering College for Women, Bhimavaram, Andhra Pradesh, India.

Retrieval Number: K15580981119/19©BEIESP DOI: 10.35940/ijitee.K1558.0981119 Journal Website: www.ijitee.org



