Poly/Math Dialects #55

AlexanderViand-Intel · 2023-06-29T16:03:09Z

AlexanderViand-Intel
Jun 29, 2023
Maintainer

As discussed in the MLIR Dialects Session, I'm setting up three discussions topics, corresponding to the three streams identified.
This is the Poly/Math Dialects discussion, you can find the High-Level FHE Dialect and Scheme Dialects discussions at the links.

The goal of this abstraction/dialect is to allow the underlying math operations of modern FHE schemes to be efficiently represented, and computations at this level to be easily retargeted to different backends

Technical/MLIR Challenges

As this is a relatively low-level dialect, FHE programs will most likely consist of many thousands, if not millions, of instructions at this level. As a result, efficiency considerations become significantly more important than for higher-level dialects. As a result, we should closely coordinate with MLIR experts to ensure we are designing the dialect and its implementation/batteries in keeping with "MLIR performance folklore". One of the main challenges will be the parameters, including moduli/primes, and how to track them. HECO (in an internal PoC) and HEaaN.MLIR (closed source) already propose poly dialects, and snippets of these have been shared and can act as a starting point for the design of a dialect.

Conceptual Challenges

FHE schemes require a wide range of operations, and even "RLWE-only" schemes such as CKKS, B/FV, and BGV require a variety of operations that are not technically "ring operations" (e.g. Rounding, Arbitrary permutation of coefficients, Coefficient extraction, NTT (not strictly within the ring), Polynomial Evaluation, and CRT/RNS style decompositions of ring coefficients). In addition, DM/CGGI style schemes that also use a variety of other ciphertext types include even more variety in their operations. As a result, it is not clear how to (a) combine both into a single dialect and (b) ensure that this dialect is generic enough to have uses besides expressing FHE computations.

Next Steps

We need to collect a full list of "necessary" operations in order to cover all relevant variants of the mainstream schemes. In addition, we should try to define clear semantics for these operations, in order to avoid ambiguities when these operations need to be lowered to different backends. The timeline for this is roughly 1 month, after which progress should be reported back to the main Working Group. Looking beyond this, a possible timeline for a draft RFC to MLIR might be 3 months after that.

Tasks

Please register your interest in actively contributing to this abstraction/dialect by commenting below, and suggest possible next task.
If interested, please check whether you are able to agree to the Google CLA, as there is likely to be a significant amount of prototyping and development in the HEIR repo before a potential upstreaming process would begin.

j2kun · 2023-06-29T20:59:30Z

j2kun
Jun 29, 2023
Maintainer

I can start by collecting a list of operations needed for the CGGI scheme. Thankfully it is relatively short:

Multiply and add two polynomials in the ring (and vectorized/tensorized versions thereof; versions use both a dot product and a vector-matrix product where ops are in the ring)
Multiply a monic monomial with a polynomial in the ring; technically simpler because it can be implemented with rotations and sign flips of the polynomial's coefficients
Generate a monic monomial x^n and a monic polynomial of the form x^n - 1 for a variable input n (the former is used in the original CGGI blind rotate, the latter replaces it in some optimizations of blind rotate such as BMMP '17)
Extract and rearrange the coefficients of a list of polynomials into a vector (hence dropping the polynomial structure); used in sample extraction
base-B decomposition of the coefficients of a polynomial, mapping an individual polynomial to a list of polynomials where the k-th coefficient of the i-th polynomial corresponds to the k-th digit of the i-th polynomial. Here B is a power of 2.

The last one is to limit noise growth, and it corresponds to a naive application of a base-B decomposition on integers, but mapped across a tensor of polynomials and "reshaped" to be consistent with the intended layout of the polynomial's coefficients.

Then, of course, polynomial multiplication is often implemented using FFT (not NTT for CGGI, because CGGI needs power-of-two moduli), and so it seems likely that we might want the implementation of a poly.mul operation be determined by a lowering, either lowering to poly.ntt or to an fft implementation (or even convolve, as would likely be preferred for a TPU backend, though bit-width issues come into play there as well), and I imagine fft would not belong in this dialect but another "low level" math dialect.

Then there is the other question about how (I believe) some CGGI implementations like Zama's actually have the polynomials kept in FFT form so that repeated conversions to and from the frequency domain are not required. Perhaps someone from Zama can chime in here, about how that impacts the list of operations above, since I'm not clear on how, say, sample extraction works with polynomials in FFT form.

1 reply

j2kun Jul 11, 2023
Maintainer

@BourgerieQuentin could you speak to how tfhe-rs manages polynomials in FFT form? Does it just keep just the bootstrapping key in FFT form? Or does it manage to work with all the ciphertexts in FFT form through a program?

davearcher · 2023-07-03T14:47:38Z

davearcher
Jul 3, 2023

Jumping in to see how we can help from Galois. Our ISA for hardware acceleration most closely matches this level of abstraction. What's on the agenda?

4 replies

j2kun Jul 4, 2023
Maintainer

Could you provide a reference to your ISA (if it's public)? Since we'll be designing this dialect with an eye toward lowering to hardware-adjacent dialects, it would help show what would be involved in a lowering.

AlexanderViand-Intel Jul 6, 2023
Maintainer Author

Yes, I think this abstraction level is a natural candidate for a hand-off from a "hardware agnostic" compiler to one that actually targets different platforms (x86/etc via LLVM IR, TPU/GPU, DPRIVE-style ASICs, etc).

In addition to some (presumably straightforward) lowerings of poly.some_op into multiple instructions (e.g., isa.somepre isa.someop isa.somepost ) one big difference between Poly/Arith and ISAs/LLVM would be explicit memory/register management (e.g. load/store operations in various flavors).

More subtly, there's scheduling: Plain MLIR IR tends to be interpreted as either a graph defined by def-use chains or a sequence of instructions, but some kind of additional information is probably necessary to fully describe an execution on a complex ASIC (or even across multiple devices!) A lot of this is probably proprietary "secret sauce" that's not suitable for standardization, but we could try to consider how the higher-level parts of the toolchain could lay the groundwork for this scheduling (e.g., by utilizing the transform dialect to add scheduling hints to a program)

j2kun Jul 6, 2023
Maintainer

For more on the transform dialect (which TIL exists), see: https://mlir.llvm.org/docs/Tutorials/transform/

It seems to offer MLIR versions of both pragmas and scheduling infrastructure.

dmwit Jul 17, 2023

Here's a brief writeup of the BASALISC ISA we're working on at Galois as it is currently envisioned. I'm happy to answer any questions as well.

Overview

The BASALISC chip is intended to accelerate operations on large polynomials. In the current design, each polynomial is stored in a 64k-long vector of 64-bit numbers. It has acceleration for a few distinct kinds of polynomial operations:

Conversion between frequency- and time-domain polynomial representations (relative to the 128k'th cyclotomic polynomial)
Arithmetic on frequency-domain polynomials
Permutations aka automorphisms
Generation of random polynomials

In all cases, the operations are done modulo a prime specified in the instruction. Some implementation details of our memory layout bleed through into the ISA in a few places: several of the operations above are split into two phases. Taking the NTT (which converts from time-domain to frequency-domain polynomial representation) as an example, there are separate NTT1 and NTT2 instructions. If you need to do several NTT operations in a row, it is more efficient to group the phases together, executing several NTT1 instructions followed by several NTT2 instructions.

In additional to polynomial operations, there are also a number of bookkeeping operations:

Initialization of a table of primes and various important constants for each prime -- a primitive root of unity and the inverse needed for Montgomery arithmetic
RNG seeding
Movement of polynomials between the register file and memory
Initialization of memory, e.g. loading keys, program constants, and the program itself
Parameter setup

Our plan involves a three-layer memory hierarchy:

There is fairly limited but very fast on-chip memory. This memory has space for 128 polynomial registers.
There is a much larger off-chip (but on-board) DRAM that is a bit slower. Theoretically this memory is byte-addressable, but most of our ISA assumes that addresses are at the residue-polynomial granularity, and so ignore the bottom few bits of addresses.
If the program and its data don't fit in DRAM, then the chip needs to communicate with the host; we support DMA operations but even so this is significantly slower than the on-board DRAM.

Polynomial operations

Polynomial operation instructions have:

An opcode indicating which operation to perform
An index into a table of primes (0-31); this table should be configured before executing polynomial operations (more on this in the "Out of band configuration" section)
A destination register (0-127)
A source register (0-127)
(optional) A second source register (0-127), indicated in the "2" column below
(optional) A 64-bit immediate, indicated in the "I" column below

There are no constraints on repeating registers within a single instruction -- a single register may be used for both sources, or as both a source and a destination, or in all three positions.

Opcode	2	I	Meaning
ADD	✓		Add two vectors pointwise.
SUB	✓		Subtract two vectors pointwise.
MUL	✓		Multiply the two vectors pointwise.
ADDI		✓	Add the given immediate to all positions in the vector.
SUBI		✓	Subtract the vector pointwise from the given immediate (i.e. the vector is negated, then added to the immediate).
MULI		✓	Multiply the given immediate by all positions in the vector.
ADDMULI	✓	✓	Multiply one source by the given immediate, then add the other source.
MORPH1		✓	Perform the nth automorphism, phase 1. Only the bottom 15 bits of the given immediate are used.
MORPH2		✓	Perform the nth automorphism, phase 2.
NTT1			Convert from time-domain to frequency-domain polynomial representation, phase 1.
NTT2			Convert from time-domain to frequency-domain, phase 2.
INTT1			Convert from frequency-domain to time-domain, phase 1.
INTT2			Convert from frequency-domain to time-domain, phase 2.

Bookkeeping operations

There are three memory management instructions:

LOAD takes a register (0-127) and a DRAM address, and copies one polynomial from off-chip memory into the given register.
STORE takes a register (0-127) and a DRAM address, and copies one polynomial from the register into off-chip memory.
FENCE has no arguments. It waits until all instructions have retired, then raises an interrupt on the host.

There is a random number generator which can be used to initialize registers. This can be used, for example, to avoid storing half of each keyswitching key in some protocols. The generator stores 9 64-bit parameters for each prime modulus, plus one extra parameter set for the "current state". Each parameter set has 8 seed values and a generator for the relevant prime field. There are three instructions:

RNGCONFIG takes a parameter index (0-8), a 64-bit value, and an index into the table of primes (0-31), and is used to initialize the parameters associated with that prime.
RNGSETUP takes an index into the table of primes (0-31); it resets the current RNG state to the values most recently passed to RNGCONFIG for that prime.
RNGGENERATE takes a register (0-127) and fills it with the next random polynomial in the sequence, using the prime modulus most recently passed to RNGSETUP. It reads and updates the current state.

Out of band configuration

There is a separate communication channel for setup and teardown. These operations are expected to be performed very infrequently, and so are kept off the hot path. Without diving into the details of these too carefully, there are broadly speaking two things they do:

Initializing a table of primes that can be used as moduli. The constants needed for each prime include:
- The prime itself
- The inverse of 2^64 in that prime field, needed for Montgomery multiplication
- Twiddle factors, computed from a primitive root of unity, used in the calculation of NTT and INTT operations
Transferring data between the host and on-board DRAM. This is used for:
- Loading the program
- Initializing key material
- Transferring program constants and input ciphertexts from the host
- Communicating results back to the host at program end

j2kun · 2023-07-26T22:52:04Z

j2kun
Jul 26, 2023
Maintainer

@AlexanderViand-Intel roughly how big are the parameters in the HECO parameter sets? I'm working on the Poly dialect right now and I want to know whether I should use 64-bit integers or an arbitrary-precision integer for the polynomial's coefficients and exponents. I think it will need to be arbitrary precision, but I just want to explicitly confirm that the polynomial schemes need >> 64 bits.

I noticed the CKKS paper uses large ciphertext coefficient modulus, and reports the modulus in a table using log q, but it doesn't explicitly say that the modulus is an exact power of two.

2 replies

AlexanderViand-Intel Jul 28, 2023
Maintainer Author

(Partial) answer here in the discussions of #74

AlexanderViand-Intel Jul 28, 2023
Maintainer Author

To answer things for HECO concretely: the proof of concept always assumed that all the actual ciphertexts/plaintexts/component polynomials would be provided in some serialized form, since we were mostly interested in compatibility with SEAL and its serialization. As aresult, there were poly.load_poly operations that could be translated to calls to SEAL-Polytools (de)serialization, but no ability to encode a polynomial directly into the IR.

I do seem to recall at some point needing a way to get, e.g., a zero polynomial, and adding an operation like poly.from_tensor to create a polynomial, but that was probably just constrained to take a tensor with elements of unspecified bitwidth and the tensors were built with not-further-specified IntegerAttr.

AlexanderViand-Intel · 2023-08-14T14:33:17Z

AlexanderViand-Intel
Aug 14, 2023
Maintainer Author

Including the Batteries: Poly-to-LLVM

I think it makes sense to (rather than re-implementing a bunch of polynomial math in C++ and going via library calls) that we follow the HEaan.MLIR approach and implement a lowering to the MLIR LLVM-IR dialect via other MLIR dialects such as linalg, memref, etc that in turn provide their own lowering to LLVM IR.
While this is a non-trivial amount of work (it sure would be nice if the existing HEaaN.MLIR implementation was available under Apache v2 w/ LLVM exceptions.... 😉), including this should give people a nice reason to invest the effort of adopting the standardized MLIR stack.

The reason I bring this up now already is that we should make sure that our design for the poly dialect doesn't end up making it difficult to lower it (as mentioned in #74 such issues frequently only become apparent when actually trying to do the lowering).

2 replies

j2kun Aug 14, 2023
Maintainer

My plan was to start working on this with Shruthi (@code-perspective) this week.

j2kun Aug 14, 2023
Maintainer

Starting a WIP PR in #94, so far adding the ring attribute to the poly type and converting the poly ops to binary ops instead of variadic. Added a lower-poly pass shell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poly/Math Dialects #55

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Poly/Math Dialects #55

AlexanderViand-Intel Jun 29, 2023 Maintainer

Technical/MLIR Challenges

Conceptual Challenges

Next Steps

Tasks

Replies: 4 comments · 9 replies

j2kun Jun 29, 2023 Maintainer

j2kun Jul 11, 2023 Maintainer

davearcher Jul 3, 2023

j2kun Jul 4, 2023 Maintainer

AlexanderViand-Intel Jul 6, 2023 Maintainer Author

j2kun Jul 6, 2023 Maintainer

dmwit Jul 17, 2023

Overview

Polynomial operations

Bookkeeping operations

Out of band configuration

j2kun Jul 26, 2023 Maintainer

AlexanderViand-Intel Jul 28, 2023 Maintainer Author

AlexanderViand-Intel Jul 28, 2023 Maintainer Author

AlexanderViand-Intel Aug 14, 2023 Maintainer Author

Including the Batteries: Poly-to-LLVM

j2kun Aug 14, 2023 Maintainer

j2kun Aug 14, 2023 Maintainer

AlexanderViand-Intel
Jun 29, 2023
Maintainer

Replies: 4 comments 9 replies

j2kun
Jun 29, 2023
Maintainer

j2kun Jul 11, 2023
Maintainer

davearcher
Jul 3, 2023

j2kun Jul 4, 2023
Maintainer

AlexanderViand-Intel Jul 6, 2023
Maintainer Author

j2kun Jul 6, 2023
Maintainer

j2kun
Jul 26, 2023
Maintainer

AlexanderViand-Intel Jul 28, 2023
Maintainer Author

AlexanderViand-Intel Jul 28, 2023
Maintainer Author

AlexanderViand-Intel
Aug 14, 2023
Maintainer Author

j2kun Aug 14, 2023
Maintainer

j2kun Aug 14, 2023
Maintainer