Talks and presentations

Algorithm-Hardware Co-design Methodology and the Case Study in Cryptographic Applications

January 06, 2025

Talk, Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China

Lattice-based cryptography (LBC) has been established as a prominent research field, with particular attention on post-quantum cryptography (PQC) and fully homomorphic encryption (FHE). As the implementing bottleneck of PQC and FHE, polynomial multiplication over cyclonic ring has been extensively studied. While schoolbook implementation of the polynomial multiplication takes the computation complexity of O(n^2) for length-n polynomials, the number theoretic transform (NTT) based polynomial multiplication reduces the complexity to O(nlogn). Although the complexity of the NTT algorithm is much lower than the schoolbook algorithm, the efficiency of the NTT algorithm in parallel computing has yet to be effectively improved. Besides, current works struggled with scalability, hindering their adaptation to various parameters, such as bit-width and polynomial length.

This thesis proposes algorithmic and architectural optimizations to improve the efficiency of NTT-based LBC implementations in parallel computing. Firstly, we propose the novel Discrete Galois Transformation (DGT) algorithms utilizing the split-radix variant and radix-4 variant, respectively.DGT is a variant of NTT that reduces transform length to half, but DGT requires more multiplication operations than the latest NTT algorithm in theoretical analysis. The proposal of DGT utilizing the split-radix variant could improve the number of multiplication operations compared to the existing NTT while achieving a higher level of parallelism to the existing NTT. Besides, the proposal of DGT utilizing the radix-4 variant could further improve the level of parallelism and reduce the memory throughput to the existing NTT.

ProgramGalois: A Programmable Generator of Radix-4 Discrete Galois Transformation Architecture for Lattice-based Cryptography

November 06, 2023

Talk, Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China

Lattice-based cryptography (LBC) has been established as a prominent research field, with particular attention on post-quantum cryptography (PQC) and fully homomorphic encryption (FHE). As the implementing bottleneck of PQC and FHE, number theoretic transform (NTT) has been extensively studied. However, current works struggled with scalability, hindering their adaptation to various parameters, such as bit-width and polynomial length. In this work, we proposed a novel DGT algorithm utilizing the radix-4 variant to achieve a higher level of parallelism to the existing NTT. Furthermore, to implement the efficient radix-4 DGT adapting more LBCs, we proposed a set of scalable building blocks, including a modified Barrett modular multiplier accepting arbitrary modulus utilizing only one integer multiplier, a radix-4 DGT butterfly unit, and a stream permutation network. The proposed modules are implemented on the Xilinx Virtex-7 and U250 FPGA to evaluate resource utilization and performance. Lastly, a design space exploration framework is proposed to generate optimized radix-4 DGT hardware constrained by polynomial and platform parameters. The sensitivity analysis showcases the generated hardware’s performance and scalability. The implementation results on the Xilinx Virtex-7 and U250 FPGA show significant performance improvements over the state-of-the-art works, which reached at least 35%, 192%, and 68% ATP improvements in terms of LUTs, BRAMs, and DSPs, respectively.

Algorithm-Hardware Co-design of Split-Radix Discrete Galois Transformation for KyberKEM

November 25, 2022

Talk, Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China

KyberKEM is one of the finalized key encapsulation mechanisms (KEMs) in the NIST postquantum cryptography (PQC) competition. In the NIST competition, the performance of the implementation of KyberKEM on the field-programmable gate array (FPGA) is one of the essential issues. Related works proposed plenty of improvements on negative wrapped convolution (NWC) via number theoretic transform (NTT), which could be considered as the bottleneck of KyberKEM. Discrete Galois Transformation (DGT) is a variant of NTT working on GF(q^2), reducing the transform length in half. But DGT requires more multiplication computation operations than NTT. In this work, we first proposed the novel DGT variant by introducing the “split radix” method, which is named split-radix DGT. When the polynomial size N = 1024, the proposed split-radix DGT requires 93% multiplication operations of the state-of-the-art NTT without compromising the reduction of the transform length. Secondly, a dedicated stream permutation network and the corresponding scheduling plan for the split-radix DGT in KyberKEM are proposed, which could reduce 25% clock cycle counts compared to the state-of-the-art NTT architecture in KyberKEM. Then, the pipelined unified split-radix DGT processor is proposed and implemented on Xilinx Zynq-7000 platform at 240 MHz, which could achieve at least 1.33 times faster NTT, and 1.94 times faster PWM with 0.91 times smaller area-time product (ATP) compared with the results of the state-of-the-art designs of NWC in KyberKEM on similar platforms. Finally, the highly efficient architecture of KyberKEM on FPGA using split-radix DGT is proposed. The implementation results on an FPGA show that our design could outperform the state-of-the-art designs of KyberKEM on similar platforms.

Multiplication Algorithms in Lightweight PQC - Binary Ring-LWE

June 23, 2020

Talk, Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China

The most time-consuming function in most Post-Quantum Cryptography (PQC) is the polynomial multiplication modulo q. Based on the fact, researchers have developed many polynomial multiplication methods, such as Toom-Cook, Karatsuba, and NTT. One of the key strategies to implement and realize the multiplication algorithm with high performance is in the parallelized nature and relies on the compromise between the device area and the performance. However, in some lightweight scenarios, such as IoT applications, such a compromise is not acceptable because of the limited device area. In this sharing, I would introduce a lightweight PQC algorithm named the Binary Ring-LWE, with a sparse convolution product method.