Algorithm-Hardware Co-design Methodology and the Case Study in Cryptographic Applications
Date:
Lattice-based cryptography (LBC) has been established as a prominent research field, with particular attention on post-quantum cryptography (PQC) and fully homomorphic encryption (FHE). As the implementing bottleneck of PQC and FHE, polynomial multiplication over cyclonic ring has been extensively studied. While schoolbook implementation of the polynomial multiplication takes the computation complexity of O(n^2) for length-n polynomials, the number theoretic transform (NTT) based polynomial multiplication reduces the complexity to O(nlogn). Although the complexity of the NTT algorithm is much lower than the schoolbook algorithm, the efficiency of the NTT algorithm in parallel computing has yet to be effectively improved. Besides, current works struggled with scalability, hindering their adaptation to various parameters, such as bit-width and polynomial length.
This thesis proposes algorithmic and architectural optimizations to improve the efficiency of NTT-based LBC implementations in parallel computing. Firstly, we propose the novel Discrete Galois Transformation (DGT) algorithms utilizing the split-radix variant and radix-4 variant, respectively.DGT is a variant of NTT that reduces transform length to half, but DGT requires more multiplication operations than the latest NTT algorithm in theoretical analysis. The proposal of DGT utilizing the split-radix variant could improve the number of multiplication operations compared to the existing NTT while achieving a higher level of parallelism to the existing NTT. Besides, the proposal of DGT utilizing the radix-4 variant could further improve the level of parallelism and reduce the memory throughput to the existing NTT.
We proposed the hardware implementations for the proposed DGT algorithms utilizing the split-radix variant and radix-4 variant. We proposed a unified split-radix DGT processor with the dedicated stream permutation network and implemented it on the Xilinx Artix-7 FPGA. The processor could unify the DGT, the inverse DGT, and component-wise multiplication and achieves at least 49.4% faster transformation and 65.3% faster component-wise multiplication, with at most 87% and 32% LUT-NTT area-time product and LUT-CWM area-time product, compared with the state-of-the-art polynomial multipliers with the same processing element settings on similar platforms. Besides, as a proof-of-concept, we designed a highly efficient PQC accelerator using the proposed split-radix DGT processor. We implemented a PQC scheme named KyberKEM, one of the final round key encapsulation mechanisms in the NIST post-quantum cryptography competition. The implementation results on Artix-7 FPGA show significant performance improvements over the state-of-the-art KyberKEM designs, demonstrating the performance advantage of the algorithm-hardware co-design methodology.
Lastly, to implement the efficient radix-4 DGT adapting more LBCs, we propose a set of scalable building blocks, including a modified Barrett modular multiplier accepting arbitrary modulus with only one integer multiplier, a radix-4 DGT butterfly unit, and a stream permutation network. The proposed modules are implemented on the Xilinx Virtex-7 and U250 FPGA to evaluate resource utilization and performance. Lastly, a design space exploration framework is proposed to generate optimized radix-4 DGT hardware constrained by polynomial and platform parameters. The sensitivity analysis showcases the generated hardware’s performance and scalability. The implementation results on the Xilinx Virtex-7 and U250 FPGA show significant performance improvements over the state-of-the-art works, which reached at least 35%, 192%, and 68% area-time product improvements in terms of LUTs, BRAMs, and DSPs, respectively.