Published on

Post-Quantum Cryptography Journey

Authors
  • avatar
    Name
    Jayden Koh
    Twitter

Description

I studied post-quantum cryptography for a research project. I even gave a guest lecture about my project at my university under the mentorship of Prof. JV. The lecture slides for that presentation can be found here.

Below are my notes from researching the topic and how I created the slides.

Post-Quantum Cryptography

What is Post-Quantum Cryptography?

NIST has chosen CRYSTALS (Cryptographic Suite for Algebraic Lattices) to be the de facto standard for Post-Quantum Computing which includes digital signatures (Dilithium) and key encapsulation (Kyber).

NIST PQC Link

Adoption of PQC by IBM and Background Knowledge

Integration of PQC by the US government

Integration and Background of PQC by OffSec

https://research.ibm.com/blog/quantum-safe-roadmap

What is Lattice-Based Cryptography?

Lattice-Based Cryptography is the new standard of mathematical cryptography that CRYSTALS relies on compared to Integer Factorization used by Modern Cryptography as it proves equally difficult to break for quantum computers as for digital computers.

Lattice-Based Cryptography Primer by RedHat

Lattice Cryptography Beginner: https://eprint.iacr.org/2015/938.pdf

What is Learning With Errors?

Learning with Errors is the idea of hiding a secret vector “s” within a public vector “t” and public matrix “A” that is irreversible when a small amount of error “e” is introduced such that t = As + e.

https://medium.com/@aditya.koranga/learning-with-errors-quantum-proof-algorithm-f43a6599fca7

What is Module Learning With Errors?

This is a variant of LWE that utilizes polynomials instead of integers to increase efficiency while maintaining security. This is what CRYSTALS relies on for security.

https://medium.com/asecuritysite-when-bob-met-alice/learning-with-errors-and-ring-learning-with-errors-23516a502406

What is CRYSTALS-Dilithium?

Dilithium is the new digital signature scheme standard.

The implementation used is by Beckwith et al. from George Mason University.

Digital Signature Background

Deriving Schnorr's from first principles.

What is Schnorr’s Identification Protocol?

https://www.zkdocs.com/docs/zkdocs/zero-knowledge-protocols/schnorr/

https://crypto.stanford.edu/cs355/19sp/lec6.pdf

Schnorr is a Zero Knowledge Proof in the family of Sigma Protocols that follows this verification scheme.

What is Fiat-Shamir heuristic?

Transforms an interactive proof (Sigma) into noninteractive (Signature) by having the prover choose the challenge. To ensure that the prover cannot falsify a challenge transcript, the challenge needs to be generated after commitment with no way for the challenge to be predicted. This can be done with by hashing the public key with the commitment and letting the Verifier verify it.

Dilithium Process

https://eprint.iacr.org/2017/633.pdf

Key Generation

Standard numbers:

q=223213+1,n=256q = 2^{23}-2^{13}+1, n=256
  1. Generate random matrix A
  2. Randomly sample secret vectors s1 and s2 with a coefficient upper bound of η
t=As1+s2t = As_1 + s_2

Result: The public key is t and A

Key Insights

  • The matrix A is used to mix the private key s1 into the public key. Without the private key s1, it's very difficult to compute the vector t because A is random and s1 is secret.
  • The smallness of s1 and s2 (controlled by η) is crucial for ensuring that the private key is not too large or unwieldy, but still difficult for an attacker to guess or reverse-engineer.
  • The public key is made up of the matrix A and the vector t, and these are used by anyone wishing to verify a signature without being able to easily deduce the private key.

Signing

  1. Generate random masking vector y from a discrete Gaussian distribution with a standard deviation proportional to γ1

  2. Split each number in the matrix into high and low bits

    w=w1×2γ2+w0w = w_1 \times 2^{\gamma_2} + w_0
  3. A challenge c is created with the hash of the message (M) and their higher bits (w1). Most of the coefficients are -1, 1, or 0 because c has a small norm.

  4. Potential signature is z = y + c * s1

  5. If the coefficients of z are within a certain bound of γ1 - β, or z doesn’t align with the challenge, the mask is resampled.

  6. The output, z the signature and c the challenge, is distributed so that anyone can use the challenge to check the signature, essentially verifying its validity.

    The Main Idea in Simple Terms:

    • The masking vector y "masks" the secret key, preventing it from being exposed.
    • The challenge c is based on the message and helps in forming the signature, but it's done in such a way that it doesn't reveal the private key.
    • Rejection sampling ensures that the signature is safe and doesn't leak any information, by rejecting signatures that don't meet certain criteria (like being too large).

Transmission

Challenge vector c = hash(m + w1)

Signature vector z = y + c * s1

Public Key matrix A and vector t

Verification

This is the main proof, if this can’t be met then the signature is invalid. This works since

c*s2 is really small, the higher bits aren’t affected.

Why This is Secure

  • Prevents leakage of private key: Since the high-order bits are the same, the private key is not exposed in the signature verification process.
  • Ensures correctness: The checks guarantee that the signature corresponds to the message and was created using the valid private key.
  • No forgery: Even if an attacker tries to forge a signature, they will not be able to correctly generate a challenge c that matches the message and the high-order bits, because they don't know the secret key.

FPGA GMU Implementation

  • Steps to Reproduce on ZYNQ UltraScale+

    Implementation fails to synthesize

    Change file path names for KAT files to be absolute in all testbench files, zetas.txt also needs to be changed in operation_module.v

    The files were originally written on a Windows system so the file paths are broken.

    keccak_round.v has to be added manually as a design source otherwise, synthesis fails.

    Generating Bitstream directly fails

    Because a ZYNQ is used there is a SoC which allows for a Processing System to work alongside the Programmable Logic

    https://adaptivesupport.amd.com/s/article/418935?language=en_US

    https://www.youtube.com/watch?v=AOy5l36DroY

    • Gemini Solution

      1. HDL Design (PL Side):

      • Interfaces: Define clear interfaces between your HDL and the PS. Common methods include:
        • AXI4-Lite: For control and status signals. This is good for infrequent data transfers and configuration.
        • AXI4-Stream: For high-throughput data transfers. This is ideal for streaming data in and out of the PL.
        • Direct Signals (less common for complex data): You could use individual GPIOs, but AXI is generally preferred for its standardized and more manageable interface.
      • Input Logic: Design the HDL logic to receive data from the AXI interface (or whatever interface you choose). This might involve buffering, data formatting, or other processing.
      • Output Logic: Design the HDL logic to send data to the AXI interface. Again, you might need buffering or formatting.

      2. PS Software (C/C++):

      • Xilinx SDK/Vitis: Use the Xilinx SDK (older) or Vitis (newer) to develop your PS application.
      • Device Drivers: You'll need device drivers to interact with the AXI interfaces connected to your PL design. Xilinx provides these. Look for drivers related to AXI4-Lite and AXI4-Stream.
      • SD Card Access: Use standard C/C++ file I/O functions (e.g., fopen, fread, fclose) along with appropriate SD card libraries (if needed) to read the expected output data from the file on the SD card.
      • AXI Communication:
        • Memory Mapping (AXI4-Lite): For AXI4-Lite, you'll typically memory map the AXI registers into the PS address space. Then, you can use pointers to read and write to these registers to control the PL and exchange data.
        • DMA (AXI4-Stream - Recommended for Data): For AXI4-Stream, Direct Memory Access (DMA) is strongly recommended for performance. The DMA will handle the data transfers between the PS memory and the PL's AXI4-Stream interface, freeing up the CPU. You'll need to configure the DMA appropriately.
      • Data Transfer:
        • Input: Read data from your input source (if any) and send it to the PL via the AXI interface (either through memory-mapped registers or the DMA).
        • Output: Receive data from the PL via the AXI interface (again, likely using DMA for AXI4-Stream).
      • Verification: Compare the data received from the PL with the data read from the SD card. Implement a suitable comparison algorithm (e.g., byte-by-byte comparison, checksums, or more sophisticated methods depending on your data). Report any errors or discrepancies.

      3. Building and Running:

      • Hardware Design (Vivado): Create your Vivado project, add your HDL code, and connect the AXI interfaces to the PS. Generate the bitstream.
      • Software Design (SDK/Vitis): Create your PS application in the SDK/Vitis. Link the necessary libraries (drivers, SD card access, etc.).
      • Boot and Run: Boot the Zynq with the generated bitstream and your PS application. The PS application will then control the PL, send input data, receive output data, and perform the verification.

      Example Code Snippet (Conceptual - DMA Transfer - AXI4-Stream):

      C

      `// ... DMA setup and configuration ...

      // Start the DMA transfer to send data to the PL Xil_DMA_Start(&dma_instance);

      // ... Wait for DMA transfer to complete ...

      // Read data from the PL using DMA Xil_DMA_Read(&dma_instance, destination_buffer, data_size);`

      Key Considerations:

      • Synchronization: Ensure proper synchronization between the PS and PL. Use control signals and status registers to coordinate data transfers.
      • Error Handling: Implement robust error handling in both the PL and PS code.
      • Performance: For high-throughput applications, optimize your data transfer methods and consider using DMA.
      • Debugging: Use the Xilinx debugging tools (e.g., ILA in Vivado, software debugger in SDK/Vitis) to troubleshoot any issues.

      create_bd_cell (Most Common and Recommended):

      The most common and recommended way to connect AXI4-Stream interfaces is to use the create_bd_cell command within your XDC. This command instantiates a block in the Block Design (BD) and automatically handles the connections. This is much easier and less error-prone than manual connections.

  • A7 test results:

    KG2[   0] completed in                 4875 clock cycles
    KG2[   1] completed in                 4873 clock cycles
    KG2[   2] completed in                 4865 clock cycles
    KG2[   3] completed in                 4867 clock cycles
    KG2[   4] completed in                 4872 clock cycles
    Moving to KG3
    KG3[   0] completed in                 8291 clock cycles
    KG3[   1] completed in                 8290 clock cycles
    KG3[   2] completed in                 8296 clock cycles
    KG3[   3] completed in                 8292 clock cycles
    KG3[   4] completed in                 8290 clock cycles
    Moving to KG5
    KG5[   0] completed in                14037 clock cycles
    KG5[   1] completed in                14030 clock cycles
    KG5[   2] completed in                14042 clock cycles
    KG5[   3] completed in                14030 clock cycles
    KG5[   4] completed in                14034 clock cycles
    
    SG2[          0] completed in                10945 clock cycles
    SG2[          1] completed in                22581 clock cycles
    SG2[          2] completed in                22584 clock cycles
    SG2[          3] completed in                16789 clock cycles
    SG2[          4] completed in                22609 clock cycles
    Moving to SG3
    SG3[          0] completed in                24293 clock cycles
    SG3[          1] completed in                16178 clock cycles
    SG3[          2] completed in                32400 clock cycles
    SG3[          3] completed in               113482 clock cycles
    SG3[          4] completed in                40503 clock cycles
    Moving to SG5
    SG5[          0] completed in               268203 clock cycles
    SG5[          1] completed in                24846 clock cycles
    SG5[          2] completed in                24860 clock cycles
    SG5[          3] completed in               135480 clock cycles
    SG5[          4] completed in                46982 clock cycles
    
    VY2[          0] completed in                 6582 clock cycles
    VY2[          1] completed in                 6582 clock cycles
    VY2[          2] completed in                 6582 clock cycles
    VY2[          3] completed in                 6582 clock cycles
    VY2[          4] completed in                 6582 clock cycles
    Moving to VY3
    VY3[          0] completed in                 9724 clock cycles
    VY3[          1] completed in                 9724 clock cycles
    VY3[          2] completed in                 9724 clock cycles
    VY3[          3] completed in                 9724 clock cycles
    VY3[          4] completed in                 9724 clock cycles
    Moving to VY5
    VY5[          0] completed in                14642 clock cycles
    VY5[          1] completed in                14635 clock cycles
    VY5[          2] completed in                14647 clock cycles
    VY5[          3] completed in                14635 clock cycles
    VY5[          4] completed in                14639 clock cycles
    

GitHub - GMUCERG/Dilithium: High-speed full CRYSTALS-DILITHIUM implementation on FPGA: Keygen, Sign, Verify.

  • Details

    In Algorithm 2, signature generation is described. This is the most complex and costly operation of Dilithium.

    We present a new signature generation implementation approach based on splitting the signature generation rejection loop into a two-stage pipeline to balance latency and area utilization.

    AlgorithmFamilyKeygen (μs)Verify (μs)Sign (μs)
    Dilithium-IIVUS+192643/117
    Dilithium-IIIVUS+323963/193
    Dilithium-VVUS+555795/215
    Dilithium-VK78185141/318
    Dilithium-VA7121126210/475

KAT Files

The extra _2, _3, _5 correspond to the NIST security levels.

  • c
  • c_s1
  • h
  • k
  • m
  • mlen
  • rho
  • s1
  • s2
  • t0
  • t1
  • tr
  • z
  • zs

Round 3 Benchmarks

Dilithium Hardware Acceleration

Recent Papers

What is CRYSTALS-Kyber?

Kyber is a Key Encapsulation Mechanism

What is Number Theoretic Transform?

The Fast-Number Theoretic Transform is a transformation for integers in a finite field that is used to reduce polynomial multiplication from O(n^2) to O(nlog_2n) which is one of the bottlenecks of Lattice-Based Cryptography. Using a method similar to the Fast Fourier Transform, the Fast Number Theoretic Transform can reduce the time complexity to quasi-linear which is important when performance is restricted by time, energy, and space.

NTT Beginner: https://eprint.iacr.org/2024/585.pdf

NTT Visual: https://codeforces.com/blog/entry/43499

https://www.cs.cmu.edu/afs/cs/academic/class/15451-s10/www/lectures/lect0423.txt

https://www.researchgate.net/publication/372303847_Conceptual_Review_on_Number_Theoretic_Transform_and_Comprehensive_Review_on_Its_Implementations

Textbook Polynomial Multiplication

Standard polynomial multiplication takes O(n^2) time. Additionally, taking both the positively wrapped convolution (dividing by x^n-1) and negatively wrapped convolution (dividing by x^n+1) takes O(n^2) time.

Polynomial multiplication in CRYSTALS is negatively wrapped which is more complicated than positively wrapped.

Multiplying a polynomial g by a polynomial h results in a polynomial y.

y=ghy = g*h

Getting the positively wrapped convolution of the polynomial y is necessary for module Learning With Errors which can be done with a long division strategy. This is still slow.

PWC(y)=y/(xn1)PWC(y) = y/(x^n - 1)

Using the NTT method still runs in O(n^2) time:

The positively wrapped convolution can be calculated with NTT and the n-th root of unity. Instead of multiplying and dividing the polynomials directly, a transform is employed with the matrix M and inverse matrix M^-1

g^=Mgh^=Mhy^=g^h^PWC(y)=M1y^\hat{g} = M*g\\ \hat{h} = M*h\\ \hat{y} = \hat{g}*\hat{h}\\ PWC(y) = M^{-1}*\hat{y}

n-th root of unity:

ωn1 mod q\omega^n \equiv 1 \text{ mod } q

Transformation matrices:

M^i=i=0n1ωijai mod qM^1j=n1j=0n1ωija^j mod q\hat{M}_i = \sum^{n-1}_{i=0} \omega^{ij}a_i \text{ mod } q \\ {\hat{M}^{-1}}_j = n^{-1}\sum^{n-1}_{j=0} \omega^{-ij}\hat{a}_j \text{ mod } q

An example of M is when n = 4, likewise M^-1 will be very similar:

This is still O(n^2) time complexity because a matrix is still multiplied by a vector.

Negatively Wrapped Convolutions are the same idea except the formulas are different, namely, the 2n-th root of unity is added.

ψ2=ω mod qM^i=i=0n1ψ2ij+iai mod qM^1j=n1j=0n1ψ2ijja^j mod q\psi^2 = \omega \text{ mod } q\\ \hat{M}_i = \sum^{n-1}_{i=0} \psi^{2ij+i}a_i \text{ mod } q \\ {\hat{M}^{-1}}_j = n^{-1}\sum^{n-1}_{j=0} \psi^{-2ij-j}\hat{a}_j \text{ mod } q

What is the Number Theoretic Transform?

NTT (Number Theoretic Transform) converts polynomials from coefficient form to evaluation (point-value) form. This transformation still takes O(n^2) time to multiply the new matrix by a vector.

Note NTT evaluates the numbers not as literal representations but as values analogous to their literal values. In reality, the NTT transformation of a polynomial has no physical meaning.

What is Convolution in the NTT domain?

What is Fast-NTT?

To make NTT more efficient than raw multiplication, since it runs on O(n^2), Fast-NTT converts coefficient form to evaluation form in O(n*logn) using the properties of symmetry and periodicity and CT and GS butterflies. The butterflies reduce the repetitive multiplications by multiplying similar factors simultaneously.