In addition to the commencement of public releases by companies participating in DARPA’s Data Protection in Virtual Environments (DPRIVE) program, such as TREBUCHET, BASALISC, and HERACLES, there has been significant progress in the field of FHE (Fully Homomorphic Encryption) acceleration over a little more than a year since the last blog post. In this post, we will briefly summarize these developments to look into the future.

#### A Brief Recap of FHE

As cloud-based everything-as-a-service (XaaS) becomes more prevalent, users are increasingly sharing sensitive data with cloud servers, such as financial and medical information. Homomorphic encryption (HE) is a cryptographic technique that allows users to share encrypted data without compromising its confidentiality. This is because HE allows computations to be performed on encrypted data without decrypting it first.

One of the most popular HE schemes is CKKS, which supports the encryption of real or complex vectors. CKKS has been used for a variety of privacy-preserving applications, including private machine learning (ML) workloads.

There are two main ways to apply CKKS to workloads: leveled HE (LHE) and fully homomorphic encryption (FHE). LHE limits the number of HE operations that can be performed on an encrypted message, or ciphertext. This means that users must periodically decrypt the ciphertext and re-encrypt it with a higher level of security in order to continue operating. FHE, by contrast, does not have this limitation. However, FHE is more computationally expensive than LHE, which makes it less practical for real-world applications until recently.

To reduce the overhead of FHE, various hardware solutions have been proposed. These solutions range from the use of general-purpose CPU/GPU systems to custom-made accelerators in FPGA and ASIC. In particular, ASIC solutions have shown performance improvements of over two orders of magnitude over state-of-the-art GPU implementations. This makes them a promising candidate for future FHE adoption.

#### CKKS Encryption and HE Operations

State-of-the-art HE schemes (e.g., CKKS, BFV, BGV, FHEW, and TFHE) are based on the ring learning with errors (RLWE) problem, which is known to be secure against cryptographic attacks, including those utilizing quantum computers. In the CKKS scheme, a user initially packs real or complex numbers into a vector message (m). m is transformed into a degree-(*N* − 1) (*n* ≤ *N*/2) integer polynomial (P_{m}), also referred to as plaintext. The transformation multiplies m with a large scale Δ and performs rounding to obtain integers with small rounding errors. P_{m} is then encrypted into a ciphertext ([[m]]), which consists of a pair of polynomials in the cyclotomic polynomial ring (R_{Q}) with extremely large integer coefficients (e.g., ≃ 2^{1200}).

[[m]] can be sent to a server to offload computation on m. When offloading the calculation of *f*(m), the server performs a homomorphic evaluation of *f* on [[m]]. This requires a series of primitive HE ops to be performed on [[m]], which include evaluating an addition or multiplication (mult) op between [[m]] and another ciphertext (HAdd and HMult), a plaintext, or a constant, cyclic rotation of [[m]] (HRot), and more. Primitive HE ops can be combined to construct more complex HE ops, such as bootstrapping. Primitive HE ops are further subdivided into primary polynomial functions, including number-theoretic transform (NTT), base conversion (BConv), automorphism, and other element-wise functions (e.g., add and mult).

#### Residue Number System, Number Theoretic Transform, and Base Conversion

To efficiently handle large integers, CKKS employs the residue number system (RNS). In R_{Q}, all integer ops are modular ops with respect to the polynomial modulus *Q*. We can set *Q*to be the product of *L* RNS primes, each small enough to fit in a machine word, and decompose each large coefficient of a polynomial. Then, expensive large-integer ops can be replaced by a set of *L* parallel modulo ops natively supported by machines (e.g., 32b or 64b). RNS transforms a degree-(*N* − 1) polynomial into an *L* × *N* matrix, where each row, referred to as a limb of the polynomial, corresponds to a unique prime.

In R_{Q}, polynomial mult is equivalent to a negacyclic convolution between two vectors of coefficients, having O(*N*^{2}) complexity when naïvely calculated. NTT, an integer-version Fourier transform, can be applied to each coefficient vector, converting the convolution into a simple element-wise mult and reducing the overall complexity to O(*N* log *N*) by applying FFT algorithms. Inverse NTT (iNTT) should be performed to obtain the final result, but polynomials are usually kept in their NTT-applied versions, termed evaluation representation, as more element-wise functions can be evaluated in this representation. However, some functions require the polynomial to be restored to its original form, termed coefficient representation, by performing iNTT. When used with RNS, (i)NTT is applied to each limb of the *L* × *N* matrix.

When two polynomials have different polynomial moduli, BConv matches the modulus. This is done when we must multiply a polynomial having a modulus *Q* with an evaluation key, evk (dnum pairs of polynomials with a bigger modulus *PQ*), for HMult and HRot ops, which include the same subroutine, called key-switching. HMult uses the same single evk, and HRot uses a separate evk for each rotation amount *r*. It is required for the evks to use an auxiliary modulus *P* along with *Q*. With RNS, an evk is expressed as dnum = ⌈*L*/*K*⌉ (decomposition number) pairs of (*L* + *K*) × *N* matrices. BConv mostly involves computing matrix-matrix mult between an *L* × *N* matrix of R_{Q}, and a predefined *K* × *L* matrix, producing a *K* × *N* matrix of R* _{P}*. BConv requires the input to be in the coefficient representation before computation. Thus, the sequence or pattern of iNTT → BConv → NTT is frequently observed.

#### Distribution Strategies for Ciphertexts and Evaluation Keys and Their Tradeoffs

Because HE ops are massively parallel innately, FHE accelerators populate a large number of computing units (e.g., vector lanes, processing elements, and clusters). Then, one of the critical design decisions is how to distribute (scatter) a ciphertext ([[m]]) or an evaluation key (evk) across these computing units. A coefficient-wise distribution scatters N coefficients of each limb equally to the entire computing units. By contrast, limb-wise distribution scatters each prime and the corresponding limbs to a specific computing unit. The former performs (i)NTT and BConv with and without inter-unit communication, respectively, whereas vice versa for the latter. Because (i)NTT and BConv are dominant operations of FHE in terms of computation complexity, these distribution strategies heavily influence the microarchitecture of the FHE accelerator.

#### F1, the First ASIC Accelerator Proposal for FHE

F1 (MICRO 2021) is the first ASIC HE accelerator to evaluate the performance of a single-slot CKKS bootstrapping. It is a programmable accelerator supporting multiple FHE schemes, including CKKS and BGV. F1 uses relatively small parameters, (N,L) = (2^{14}, 15), and deploys a massive amount of computation logic to utilize the parallelism in HE ops. In particular, F1 utilizes a dedicated NTT unit (NTTU) that implements a pipelined 2D-FFT using 896 modular multipliers. F1 consists of 16 vector clusters, each with 2^{7} lanes that feed data to an NTTU and other FUs. Every HE op is a sequence of modular mults and modular adds, and a modular mult requires a costly modular reduction op.

Although F1 provides high speedup compared to conventional CPU/GPU/FPGA-based systems, it is primarily applied to small problem sizes and relatively simple workloads. F1 uses limb-wise data distribution, adopts the machine word size of 32b, is 151.4mm^{2} in size at a 12/14nm technology node, and shows a TDP of 180.4W excluding the HBM power.

#### CraterLake and BTS

CraterLake (ISCA 2022) is a successor of F1 featuring 256MB of on-chip memory (4× larger than F1), massive BConv units, and pseudo-random number generators for evks which halve the memory traffic and capacity requirements for evks. CraterLake developed upon F1’s design of placing N^{0.5} parallel lanes in a lane group (cluster) that feeds data into massive vector NTTUs. A vector NTTU contains 0.5×N×logN butterfly units placed in an organization directly reflecting the four-step 2D FFT dataflow. CraterLake deploys 16 NTTUs. CraterLake uses coefficient-wise data distribution, adopts the machine word size of 28b, is 472.3mm^{2} in size at a 12/14nm technology node (same as F1), estimating over 317W of TDP.

BTS (ISCA 2022) has a tiled architecture composed of 2,048 processing elements (PEs) that communicate through a network on chips in the center. BTS performs an extensive search on optimal HE parameter sets for FHE accelerators under the observation that the loading time of evks becomes the performance bottleneck. Then, BTS optimizes the dataflow for HE ops so that they finish within the loading time. BTS features 512MB of on-chip memory and uses coefficient-wise data distribution. BTS adopts the machine word size of 64b, which is 373.6mm^{2} in size at a 7nm technology node, estimating 163.2W of TDP.

#### ARK, State-of-the-art CKKS Acceleration Architecture

ARK (MICRO 2022) tried to take the best of both BTS and CraterLake. To reduce off-chip memory access, which was identified as the key performance bottleneck by BTS, ARK minimizes the required number of evks for bootstrapping and homomorphic convolution. It also generates the limbs of plaintexts used in PMult and PAdd on the fly to save off-chip memory access.

The microarchitecture of ARK aims at addressing the new (by applying the aforementioned optimizations) computation and data movement constraints of FHE that minimizes on-chip data movement. This enables ARK to provide real-time encrypted CNN inference, showing 0.125 seconds of inference time for the ResNet-20 model, which is more than 18,000 times faster than the single-core CPU implementation, both with a batch size of one and reported in the same year of 2022.

ARK mostly uses limb-wise data distribution but temporarily switches to coefficient-wise data distribution during BConv. Similar to BTS, ARK features 512MB of on-chip memory and adopts the machine word size of 64b. It is 418.3mm^{2} in size at a 7nm technology node, estimating 281.3W of TDP. F1, CraterLake, BTS, and ARK all target a system equipped with a memory system supporting 1TB/s.

#### Recent Demonstrations of CKKS Acceleration Using FPGAs and GPUs

FAB (HPCA 2023) is one of the first FPGA-based accelerators to support FHE CKKS, which optimizes the key-switching subroutine to maximize the reuse of ciphertexts, given the limited on-chip memory capacity. Poseidon (HPCA 2023) also supports FHE CKKS and deploys dedicated compute cores, such as a high-radix NTT core. FAB and Poseidon offer performance comparable to those of GPUs (and ASIC accelerators for some FHE workloads, but bootstrapping on FPGA accelerators is still orders of magnitude slower than more recent ASIC proposals, such as CraterLake, BTS, and ARK. Among the GPU implementations, TensorFHE (HPCA 2023) proposed the use of tensor cores in recent NVIDIA GPUs for NTT acceleration, breaking down each of its 32-bit words into 8-bit integers to feed into tensor cores. However, the limited on-chip memory capacity of GPUs (dozens of MBs) still incurs frequent off-chip memory accesses, making FHE workloads memory-bound.

** About the author:** Jung Ho Ahn is the Dean of Seoul National University’s Graduate School of Convergence Science and Technology. His research interests include bridging the gap between the performance demand of emerging applications and the performance potential of modern and future massively parallel systems.

**Disclaimer:** *These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.*