- Published on
Post-Quantum Cryptography Journey
- Authors
- Name
- Jayden Koh
Description
I studied post-quantum cryptography for a research project. I even gave a guest lecture about my project at my university under the mentorship of Prof. JV. The lecture slides for that presentation can be found here.
Below are my notes from researching the topic and how I created the slides.
Post-Quantum Cryptography
What is Post-Quantum Cryptography?
NIST has chosen CRYSTALS (Cryptographic Suite for Algebraic Lattices) to be the de facto standard for Post-Quantum Computing which includes digital signatures (Dilithium) and key encapsulation (Kyber).
Adoption of PQC by IBM and Background Knowledge
Integration of PQC by the US government
Integration and Background of PQC by OffSec
https://research.ibm.com/blog/quantum-safe-roadmap
What is Lattice-Based Cryptography?
Lattice-Based Cryptography is the new standard of mathematical cryptography that CRYSTALS relies on compared to Integer Factorization used by Modern Cryptography as it proves equally difficult to break for quantum computers as for digital computers.
Lattice-Based Cryptography Primer by RedHat
Lattice Cryptography Beginner: https://eprint.iacr.org/2015/938.pdf
What is Learning With Errors?
Learning with Errors is the idea of hiding a secret vector “s” within a public vector “t” and public matrix “A” that is irreversible when a small amount of error “e” is introduced such that t = As + e.
https://medium.com/@aditya.koranga/learning-with-errors-quantum-proof-algorithm-f43a6599fca7
What is Module Learning With Errors?
This is a variant of LWE that utilizes polynomials instead of integers to increase efficiency while maintaining security. This is what CRYSTALS relies on for security.
What is CRYSTALS-Dilithium?
Dilithium is the new digital signature scheme standard.
The implementation used is by Beckwith et al. from George Mason University.
Digital Signature Background
Deriving Schnorr's from first principles.
What is Schnorr’s Identification Protocol?
https://www.zkdocs.com/docs/zkdocs/zero-knowledge-protocols/schnorr/
https://crypto.stanford.edu/cs355/19sp/lec6.pdf
Schnorr is a Zero Knowledge Proof in the family of Sigma Protocols that follows this verification scheme.
What is Fiat-Shamir heuristic?
Transforms an interactive proof (Sigma) into noninteractive (Signature) by having the prover choose the challenge. To ensure that the prover cannot falsify a challenge transcript, the challenge needs to be generated after commitment with no way for the challenge to be predicted. This can be done with by hashing the public key with the commitment and letting the Verifier verify it.
Dilithium Process
https://eprint.iacr.org/2017/633.pdf
Key Generation
Standard numbers:
- Generate random matrix A
- Randomly sample secret vectors s1 and s2 with a coefficient upper bound of η
Result: The public key is t and A
Key Insights
- The matrix A is used to mix the private key s1 into the public key. Without the private key s1, it's very difficult to compute the vector t because A is random and s1 is secret.
- The smallness of s1 and s2 (controlled by η) is crucial for ensuring that the private key is not too large or unwieldy, but still difficult for an attacker to guess or reverse-engineer.
- The public key is made up of the matrix A and the vector t, and these are used by anyone wishing to verify a signature without being able to easily deduce the private key.
Signing
Generate random masking vector y from a discrete Gaussian distribution with a standard deviation proportional to γ1
Split each number in the matrix into high and low bits
A challenge c is created with the hash of the message (M) and their higher bits (w1). Most of the coefficients are -1, 1, or 0 because c has a small norm.
Potential signature is z = y + c * s1
If the coefficients of z are within a certain bound of γ1 - β, or z doesn’t align with the challenge, the mask is resampled.
The output, z the signature and c the challenge, is distributed so that anyone can use the challenge to check the signature, essentially verifying its validity.
The Main Idea in Simple Terms:
- The masking vector y "masks" the secret key, preventing it from being exposed.
- The challenge c is based on the message and helps in forming the signature, but it's done in such a way that it doesn't reveal the private key.
- Rejection sampling ensures that the signature is safe and doesn't leak any information, by rejecting signatures that don't meet certain criteria (like being too large).
Transmission
Challenge vector c = hash(m + w1)
Signature vector z = y + c * s1
Public Key matrix A and vector t
Verification
This is the main proof, if this can’t be met then the signature is invalid. This works since
c*s2 is really small, the higher bits aren’t affected.
Why This is Secure
- Prevents leakage of private key: Since the high-order bits are the same, the private key is not exposed in the signature verification process.
- Ensures correctness: The checks guarantee that the signature corresponds to the message and was created using the valid private key.
- No forgery: Even if an attacker tries to forge a signature, they will not be able to correctly generate a challenge c that matches the message and the high-order bits, because they don't know the secret key.
FPGA GMU Implementation
Steps to Reproduce on ZYNQ UltraScale+
Implementation fails to synthesize
Change file path names for KAT files to be absolute in all testbench files,
zetas.txt
also needs to be changed inoperation_module.v
The files were originally written on a Windows system so the file paths are broken.
keccak_round.v
has to be added manually as a design source otherwise, synthesis fails.Generating Bitstream directly fails
Because a ZYNQ is used there is a SoC which allows for a Processing System to work alongside the Programmable Logic
https://adaptivesupport.amd.com/s/article/418935?language=en_US
https://www.youtube.com/watch?v=AOy5l36DroY
Gemini Solution
1. HDL Design (PL Side):
- Interfaces: Define clear interfaces between your HDL and the PS. Common methods include:
- AXI4-Lite: For control and status signals. This is good for infrequent data transfers and configuration.
- AXI4-Stream: For high-throughput data transfers. This is ideal for streaming data in and out of the PL.
- Direct Signals (less common for complex data): You could use individual GPIOs, but AXI is generally preferred for its standardized and more manageable interface.
- Input Logic: Design the HDL logic to receive data from the AXI interface (or whatever interface you choose). This might involve buffering, data formatting, or other processing.
- Output Logic: Design the HDL logic to send data to the AXI interface. Again, you might need buffering or formatting.
2. PS Software (C/C++):
- Xilinx SDK/Vitis: Use the Xilinx SDK (older) or Vitis (newer) to develop your PS application.
- Device Drivers: You'll need device drivers to interact with the AXI interfaces connected to your PL design. Xilinx provides these. Look for drivers related to AXI4-Lite and AXI4-Stream.
- SD Card Access: Use standard C/C++ file I/O functions (e.g.,
fopen
,fread
,fclose
) along with appropriate SD card libraries (if needed) to read the expected output data from the file on the SD card. - AXI Communication:
- Memory Mapping (AXI4-Lite): For AXI4-Lite, you'll typically memory map the AXI registers into the PS address space. Then, you can use pointers to read and write to these registers to control the PL and exchange data.
- DMA (AXI4-Stream - Recommended for Data): For AXI4-Stream, Direct Memory Access (DMA) is strongly recommended for performance. The DMA will handle the data transfers between the PS memory and the PL's AXI4-Stream interface, freeing up the CPU. You'll need to configure the DMA appropriately.
- Data Transfer:
- Input: Read data from your input source (if any) and send it to the PL via the AXI interface (either through memory-mapped registers or the DMA).
- Output: Receive data from the PL via the AXI interface (again, likely using DMA for AXI4-Stream).
- Verification: Compare the data received from the PL with the data read from the SD card. Implement a suitable comparison algorithm (e.g., byte-by-byte comparison, checksums, or more sophisticated methods depending on your data). Report any errors or discrepancies.
3. Building and Running:
- Hardware Design (Vivado): Create your Vivado project, add your HDL code, and connect the AXI interfaces to the PS. Generate the bitstream.
- Software Design (SDK/Vitis): Create your PS application in the SDK/Vitis. Link the necessary libraries (drivers, SD card access, etc.).
- Boot and Run: Boot the Zynq with the generated bitstream and your PS application. The PS application will then control the PL, send input data, receive output data, and perform the verification.
Example Code Snippet (Conceptual - DMA Transfer - AXI4-Stream):
C
`// ... DMA setup and configuration ...
// Start the DMA transfer to send data to the PL Xil_DMA_Start(&dma_instance);
// ... Wait for DMA transfer to complete ...
// Read data from the PL using DMA Xil_DMA_Read(&dma_instance, destination_buffer, data_size);`
Key Considerations:
- Synchronization: Ensure proper synchronization between the PS and PL. Use control signals and status registers to coordinate data transfers.
- Error Handling: Implement robust error handling in both the PL and PS code.
- Performance: For high-throughput applications, optimize your data transfer methods and consider using DMA.
- Debugging: Use the Xilinx debugging tools (e.g., ILA in Vivado, software debugger in SDK/Vitis) to troubleshoot any issues.
create_bd_cell
(Most Common and Recommended):The most common and recommended way to connect AXI4-Stream interfaces is to use the
create_bd_cell
command within your XDC. This command instantiates a block in the Block Design (BD) and automatically handles the connections. This is much easier and less error-prone than manual connections.- Interfaces: Define clear interfaces between your HDL and the PS. Common methods include:
A7 test results:
KG2[ 0] completed in 4875 clock cycles KG2[ 1] completed in 4873 clock cycles KG2[ 2] completed in 4865 clock cycles KG2[ 3] completed in 4867 clock cycles KG2[ 4] completed in 4872 clock cycles Moving to KG3 KG3[ 0] completed in 8291 clock cycles KG3[ 1] completed in 8290 clock cycles KG3[ 2] completed in 8296 clock cycles KG3[ 3] completed in 8292 clock cycles KG3[ 4] completed in 8290 clock cycles Moving to KG5 KG5[ 0] completed in 14037 clock cycles KG5[ 1] completed in 14030 clock cycles KG5[ 2] completed in 14042 clock cycles KG5[ 3] completed in 14030 clock cycles KG5[ 4] completed in 14034 clock cycles SG2[ 0] completed in 10945 clock cycles SG2[ 1] completed in 22581 clock cycles SG2[ 2] completed in 22584 clock cycles SG2[ 3] completed in 16789 clock cycles SG2[ 4] completed in 22609 clock cycles Moving to SG3 SG3[ 0] completed in 24293 clock cycles SG3[ 1] completed in 16178 clock cycles SG3[ 2] completed in 32400 clock cycles SG3[ 3] completed in 113482 clock cycles SG3[ 4] completed in 40503 clock cycles Moving to SG5 SG5[ 0] completed in 268203 clock cycles SG5[ 1] completed in 24846 clock cycles SG5[ 2] completed in 24860 clock cycles SG5[ 3] completed in 135480 clock cycles SG5[ 4] completed in 46982 clock cycles VY2[ 0] completed in 6582 clock cycles VY2[ 1] completed in 6582 clock cycles VY2[ 2] completed in 6582 clock cycles VY2[ 3] completed in 6582 clock cycles VY2[ 4] completed in 6582 clock cycles Moving to VY3 VY3[ 0] completed in 9724 clock cycles VY3[ 1] completed in 9724 clock cycles VY3[ 2] completed in 9724 clock cycles VY3[ 3] completed in 9724 clock cycles VY3[ 4] completed in 9724 clock cycles Moving to VY5 VY5[ 0] completed in 14642 clock cycles VY5[ 1] completed in 14635 clock cycles VY5[ 2] completed in 14647 clock cycles VY5[ 3] completed in 14635 clock cycles VY5[ 4] completed in 14639 clock cycles
Details
In Algorithm 2, signature generation is described. This is the most complex and costly operation of Dilithium.
We present a new signature generation implementation approach based on splitting the signature generation rejection loop into a two-stage pipeline to balance latency and area utilization.
Algorithm Family Keygen (μs) Verify (μs) Sign (μs) Dilithium-II VUS+ 19 26 43/117 Dilithium-III VUS+ 32 39 63/193 Dilithium-V VUS+ 55 57 95/215 Dilithium-V K7 81 85 141/318 Dilithium-V A7 121 126 210/475
KAT Files
The extra _2
, _3
, _5
correspond to the NIST security levels.
- c
- c_s1
- h
- k
- m
- mlen
- rho
- s1
- s2
- t0
- t1
- tr
- z
- zs
Dilithium Hardware Acceleration
Recent Papers
- Beckwith et al. - 2021, GMU, implementation
- Zhao et al. - 2021, Tsinghua University,
- Aikata et al. - 2022, INTL,
- Li et al. - 2023, Huazhong University, PRNG, Karatsuba
What is CRYSTALS-Kyber?
Kyber is a Key Encapsulation Mechanism
What is Number Theoretic Transform?
The Fast-Number Theoretic Transform is a transformation for integers in a finite field that is used to reduce polynomial multiplication from O(n^2) to O(nlog_2n) which is one of the bottlenecks of Lattice-Based Cryptography. Using a method similar to the Fast Fourier Transform, the Fast Number Theoretic Transform can reduce the time complexity to quasi-linear which is important when performance is restricted by time, energy, and space.
NTT Beginner: https://eprint.iacr.org/2024/585.pdf
NTT Visual: https://codeforces.com/blog/entry/43499
https://www.cs.cmu.edu/afs/cs/academic/class/15451-s10/www/lectures/lect0423.txt
Textbook Polynomial Multiplication
Standard polynomial multiplication takes O(n^2) time. Additionally, taking both the positively wrapped convolution (dividing by x^n-1) and negatively wrapped convolution (dividing by x^n+1) takes O(n^2) time.
Polynomial multiplication in CRYSTALS is negatively wrapped which is more complicated than positively wrapped.
Multiplying a polynomial g by a polynomial h results in a polynomial y.
Getting the positively wrapped convolution of the polynomial y is necessary for module Learning With Errors which can be done with a long division strategy. This is still slow.
Using the NTT method still runs in O(n^2) time:
The positively wrapped convolution can be calculated with NTT and the n-th root of unity. Instead of multiplying and dividing the polynomials directly, a transform is employed with the matrix M and inverse matrix M^-1
n-th root of unity:
Transformation matrices:
An example of M is when n = 4, likewise M^-1 will be very similar:
This is still O(n^2) time complexity because a matrix is still multiplied by a vector.
Negatively Wrapped Convolutions are the same idea except the formulas are different, namely, the 2n-th root of unity is added.
What is the Number Theoretic Transform?
NTT (Number Theoretic Transform) converts polynomials from coefficient form to evaluation (point-value) form. This transformation still takes O(n^2) time to multiply the new matrix by a vector.
Note NTT evaluates the numbers not as literal representations but as values analogous to their literal values. In reality, the NTT transformation of a polynomial has no physical meaning.
What is Convolution in the NTT domain?
What is Fast-NTT?
To make NTT more efficient than raw multiplication, since it runs on O(n^2), Fast-NTT converts coefficient form to evaluation form in O(n*logn) using the properties of symmetry and periodicity and CT and GS butterflies. The butterflies reduce the repetitive multiplications by multiplying similar factors simultaneously.