Miden IR concrete syntax #15

jjcnn · 2023-07-20T14:55:34Z

jjcnn
Jul 20, 2023

Below is an initial proposal for a concrete syntax for Miden IR. There are a couple of issues we need to resolve before we can finalize the proposal, but I think the basics are there.

<name>               ::= [a-zA-Z][a-zA-Z0-9_]*


<constant>           ::= _bool true
                       | _bool false
                       | _i8 <integer>
                       | _i16 <integer>
                       | _i32 <integer>
                       | _i64 <integer>
                       | _i128 <integer>
                       | _isize <integer>
                       | _f64 <float>
                       | _felt <integer>

<type>               ::= _bool
                       | _i8
                       | _i16
                       | _i32
                       | _i64
                       | _i128
                       | _isize
                       | _f64
                       | _felt

<module>             ::= module <name> { <function>* }
                
<function>           ::= function <visibility> <name> ( <formal_params> ) -> ( <return_types> ) { <statement>* }
                
<visibility>         ::= _private
                       | _public
                       | _extern

<formal_param>       ::= <variable> : <type>

<more_formals>       ::= , <formal_param>
                       | , <formal_param> <more_formals>

<formal_params>      ::=  // Nothing
                       | <formal_param>
                       | <formal_param> <more_formals>

<more_return_types>  ::= , <type>
                       | , <type> <more_returns>
                    
<return_types>       ::=  // Nothing
                       | <type>
                       | <type> <more_return_types>
                
<actual_params>      ::=  // Nothing
                       | <variable>
                       | <variable>, <actual_params>

<constant>           ::= <integer>
                       | <float>
                       | _null

<unop>               ::= _ref
                       | _deref
                       | _load
                       | _asint
                       | _asptr
                       | _neg // Always unchecked
                       | _inv // Always unchecked
                       | _pow2 // Always unchecked
                       | _not <check>
                       | _popcnt <check>
                       | _isodd // Always unchecked

<binop>              ::= _store
                       | _add <overflow>
                       | _sub <overflow>
                       | _mul <overflow>
                       | _muladd <overflow>
                       | _div <check>
                       | _mod <check>
                       | _divmod <check>
                       | _exp // Always unchecked
                       | _and <check>
                       | _or <check>
                       | _xor <check>
                       | _shl <check>
                       | _shr <check>
                       | _rotl <check>
                       | _rotr <check>
                       | _eq <check>
                       | _neq <check>
                       | _gt <check>
                       | _gte <check>
                       | _lt <check>
                       | _lte <check>
                       | _min <check>
                       | _max <check>
                  
<typeop>             ::= _cast
                       | _trunc
                       | _zext
                       | _sext
                       | _test

<check>              ::= _unchecked
                       | _checked
                        
<overflow>           ::= <checked>
                       | _wrapping
                       | _overflowing

<match_branch>       ::= <u32> => { <statement>* }

<more_return_values> ::= <variable>
                       | <variable>, <more_return_values>

<return_values>      ::= //Nothing
                       | <variable>
                       | <variable> <more_return_values>

<statement>          ::= <variable> := <binop> <variable> <variable>;
                       | <variable> := <binop> <variable> <constant>;
                       | <variable> := <unop> <overflow> <variable>;
                       | <variable> := <unop> <overflow> <constant>;
                       | <variable> := _call <name> ( <actual_params> );
                       | <variable> := <typeop> <type> <variable>;
                       | <variable> := <typeop> <type> <constant>;
                       | if <variable> then { <statement>* } else { <statement>* }
                       | match <variable> with { <match_branch>* }
                       | while <variable> { <statement>* }
                       | return ( <return_values> );
                       | _memcpy (<variable>, <variable>, <variable>, <type>); // (#bytes, from_loc, to_loc, type)
                       | _assert (<variable>);
                       | _assert (<constant>);
                       | _assertz (<variable>);
                       | _assertz (<constant>);
                       | _asserteq (<variable>, <variable>);
                       | _asserteq (<constant>, <variable>);
                       | _asserttest (<type, <variable>);
                       | _asserttest (<type, <constant>);
                       | assembly { <assembly_instruction>* }

<assembly_instruction> ::= // TODO

I have tried to create a syntax that is easy to parse, because I don't want to spend a huge amount of time writing and maintaining a parser.

There are a handful of keywords (if, match, etc.) that are pretty standard. For everything remotely non-standard I have used a naming convention where operations all start with an _, and disallowed user-defined variable names that start with _. This will make it easier to maintain backward compatibility.

There are a couple of outstanding issues that I haven't thought into this proposal:

Structured data: There is no syntax for arrays or structs, neither as type declarations nor for creating or accessing them as values. This means that memory layout must be handled by whichever compiler phases precede Miden IR. I'm not sure this is a good idea in general, but we currently don't have the abstract syntax to deal with these issues either, so I've left it as an open question for the time being.
Floating point data: I have added floats as a datatype because our abstract syntax allows it, but I don't know how we want to compile them to Miden.
Unstructured control flow: I have opted for if-then-else, while and match statements to manage control flow, because that makes it easier to parse. However, the abstract syntax assumes that we have a CFG, which means that we will parse structed control flow into unstructured control flow, and then restructure it. We can add labels and jumps to the syntax, but I think we should keep the structured constructs as well.
Assembly: I haven't added any assembly instructions, since it's still an open question how we want to support this semantically. I have just reserved assembly as a keyword for future use.
Type safety: I am not planning to do any typechecking of the programs we parse. I think that should be dealt with by earlier compiler phases. We still need to specify which types we expect for each operation, though.
Use of variables: I am not planning to verify that programs are in SSA form, as that should also be dealt with by earlier compiler phases. However, we do need to specify what SSA means in the presence of while loops. It shouldn't be too hard to specify, but it needs to be made clear and exact.
Library imports: There is currently no syntax (neither abstract nor concrete) for importing libraries, which we will definitely need.

bobbinth · 2023-07-20T15:57:06Z

bobbinth
Jul 20, 2023
Maintainer

Thank you! One comment so far: I would probably get rid of float types - in my mind this would be handled by a library.

3 replies

jjcnn Jul 21, 2023
Author

True, and floats are probably not a high priority in the first place. I only added them because we can represent them in abstract syntax - we can't compile them.

bitwalker Jul 21, 2023
Maintainer

I think we should leave them in the IR, since many languages will have float types and we may at least want to represent that, even if initially we simply raise an error during codegen until we can make use of the library that provides software support for them.

bobbinth Jul 21, 2023
Maintainer

I'm fine with keeping them in if you have a strong preference - just want to make sure we don't spend resources on them until most other things are working as floats are fairly low on the priority list.

bitwalker · 2023-07-21T16:37:50Z

bitwalker
Jul 21, 2023
Maintainer

I had thought we were going to parse the same format we emit for display, since that would allow for the full generality of the IR to be expressed by someone emitting the textual form, this form has no notion of blocks or the control flow instructions the IR supports, and so if someone wants to compile to Miden IR, they will almost certainly not want to emit the textual form. Most compiler backends are going to lower to an SSA form, which is what our IR provides, but this textual form doesn't seem to represent that fully (or enforce the SSA property).

This is going to be a particular issue with converging control flow paths - how do you represent a variable that takes on two different values depending on a condition? In the IR, you represent this using phi nodes in the form of block arguments, but there is no equivalent in this textual representation. As a result, we'd have to permit the syntax to be in non-SSA form, and then we're basically in a situation where we're implementing a full compiler frontend for a very low-level language. If we're going to go that route, we may as well make the language higher level and treat it as a separate thing.

Setting those concerns aside for a moment, the syntax looks good to me, just a couple notes:

I would call match, switch instead (as that is the name of the instruction, and the semantics are more like a C switch than the more general pattern matching implied by match)j
I would use . to append modifiers to instructions, e.g. add.overflowing rather than separating them with spaces.
I would prefer to prefix variables rather than keywords, as there are a number of places where keywords can't conflict with user-controlled identifiers (e.g. visibility modifiers for a function declaration, instruction modifiers as described above), and other IRs usually take that approach as well, e.g. LLVM IR. I would use % or $ for that purpose.
We should allow :: in function names, so that extern functions can be declared fully-qualified. We should also make sure that function names are valid MASM names, or we'll run into issues (that may already be the case, just mentioning it here)

Structured data:

We should add the ability to add type declarations to a module which accept struct definitions, e.g. type Foo = { u32, felt }. We can add getelement/setelement instructions that work like so:

getelement(ptr, index), fetches the field at offset index from ptr, the value of which is given by the pointee type of ptr
setelement(ptr, index, value), stores value in the field at offset index from ptr, where the types of the field and value must agree
getelementptr(ptr, index), calculates a new pointer that directly addresses the field at offset index of ptr, based on the struct type described by ptr

That will be sufficient to support anything we care to do with structs. We can add allocation intrinsics once I add support for globals to the IR, since we'll have to manage memory the same way one does in Wasm (with a global representing the current heap pointer), but in the meantime just having the syntax for describing them is enough IMO.

Unstructured control flow

As I mentioned at the top, if we're assuming the text must be in SSA form, then it isn't possible to describe very simple things like convergent control flow, which isn't unstructured, but can't be described with just if when you can't redefine variables due to the SSA property. If we allow the text to not be in SSA form, then we have to implement a much more complicated compiler frontend for it, since we would need to restore the SSA property before lowering to the IR. So ultimately, it isn't that we need unstructured control flow, but we need to be able to represent certain types of control flow that would also permit unstructured control flow. Just to clarify: unstructured control flow isn't given by goto itself, it depends on the control flow graph of the resulting program. You can describe structured control flow in terms of goto just fine.

Assembly

We have support for this in the IR, but yeah, not a priority at all. If we can reuse the MASM parser to parse inline-assembly blocks, then representing them the same way is done in Rust/LLVM IR would be best (i.e. the block is parsed as a string containing the assembly language, and isn't processed as part of the host language grammar).

Type safety:

Yeah we don't need to do a type check phase, but we can do really basic type checking when verifying the IR, since everything will have a type associated with it, all we have to do is make sure that operands have the correct types based on the semantics of a given instruction (e.g. we can verify that add is given two numeric values of the same type).

Use of variables: I am not planning to verify that programs are in SSA form, as that should also be dealt with by earlier compiler phases.

The IR requires SSA form, and we can't assume that someone is emitting the textual IR in SSA form correctly, so we have to verify this when parsing. This is simple to verify though, simply keep track of variables which have been defined as they are encountered in the IR

I think we should stick to a textual representation that matches the IR, rather than trying to lift it to a higher level. It will only complicate things, and if we want a higher level language of our own design, then we should do that separately, rather than thinking of it as representing Miden IR directly. That means removing the high-level control flow syntax, introducing blocks and block arguments, and using br and brz/brnz for intra-procedural control flow. Everything else looks good to me syntax-wise, it is really just the level of abstraction that is the issue IMO.

0 replies

jjcnn · 2023-07-24T15:48:35Z

jjcnn
Jul 24, 2023
Author

I had thought we were going to parse the same format we emit for display,

I have only just now discovered that we have a display format for the IR in ir/src/hir/write.rs. A quick look at that file suggests a number of changes I would like to make to the output, but I would need to spend a bit more time studying it.

this form has no notion of blocks or the control flow instructions the IR supports

The problem with a concrete syntax based on basic blocks (which I believe is what you are hinting at) is that it will be harder to parse the control structure. The basic blocks can come in any order, and the logic required to construct the CFG is more complex than what can reasonably be described as parsing, and as I mentioned in the beginning the goal of this syntax is to create something that is easy to parse.

I'm ok with a more complex parser/"frontend", but that needs to be a conscious decision.

(Side note: If we keep the high-level constructs, then we should add break and continue to the syntax).

Most compiler backends are going to lower to an SSA form, which is what our IR provides, but this textual form doesn't seem to represent that fully (or enforce the SSA property).

SSA is not a syntactic property, so there is no way for a syntax to enforce SSA. Hence my comment that SSA should be enforced by earlier compiler phases, and can therefore be assumed by the parser.

we have to verify this [SSA form] when parsing. This is simple to verify though, simply keep track of variables which have been defined as they are encountered in the IR

Except in the case of converging control flow, though.

It gets even more difficult if we change the syntax to use jumps and labels, because the control flow will be impossible for the parser to determine.

...a particular issue with converging control flow paths - how do you represent a variable that takes on two different values depending on a condition?

There is an implicit assumption (which I agree should be made explicit) that variable names are globally unique and implicitly declared at the top of a function. Then this problem goes away:

function f (...) {
  // The variable name a is only used in this function, and is implicitly declared at this point.
  ...
  if condition {
    a := <value 1>;
  }
  else {
    a := <value 2>;
  }
  ...
}

This seems like a reasonable assumption to place on a compiler frontend that generates Miden IR.

I would call match, switch instead
I would use . to append modifiers to instructions
I would prefer to prefix variables rather than keywords

I'm fine with these changes.

We should allow :: in function names

This would be part of adding library imports to the syntax, which as I said I haven't considered yet.

I didn't put a huge amount of effort into defining which strings are allowed as names, but I consider that a detail that can be worked out as we go along. This is a first draft after all.

we can do really basic type checking when verifying the IR, since everything will have a type associated with it

Well, I suppose. We would need to keep track of the inferred type of each variable, and that the inferred type is the same in each branch that assigns to that variable, but I guess that's doable.

I think we should stick to a textual representation that matches the IR, rather than trying to lift it to a higher level. It will only complicate things, and if we want a higher level language of our own design, then we should do that separately, rather than thinking of it as representing Miden IR directly.

The problem of high-level vs. low-level control flow won't go away, though. We had the discussion when we were looking at Sway and Move, and when we changed focus to WASM. We are now having the same discussion again for the concrete syntax for Miden IR, and if/when we start working on our own high-level language we will have it again because it will seem futile to lower the code to Miden IR only to raise it back to MASM (in the same way that it already seems for the high-level languages we have considered so far).

Did we not at some point have a plan to have two IRs, one for low-level one for high-level control flow? Did we end up scrapping that idea?

0 replies

bitwalker · 2023-07-25T04:20:15Z

bitwalker
Jul 25, 2023
Maintainer

The problem with a concrete syntax based on basic blocks (which I believe is what you are hinting at) is that it will be harder to parse the control structure. The basic blocks can come in any order, and the logic required to construct the CFG is more complex than what can reasonably be described as parsing, and as I mentioned in the beginning the goal of this syntax is to create something that is easy to parse.

The order of the basic blocks doesn't matter though, we have a "sea of nodes" style IR, the CFG is implicit in the edges between blocks (given by the control flow instructions terminating those blocks).

When parsing, if a reference to a block that hasn't been parsed yet is seen, it can be stubbed out until the actual block definition is reached (at which point things like block arguments can be amended, and the content of the block fleshed out). If a reference to a value whose definition hasn't been parsed yet is encountered, the value definition can be stubbed out as well, and then amended when the actual definition is reached. The only auxilary data structure you would need is a mapping between the parsed block/value IDs and the ones generated by the DataFlowGraph API, since the PrimaryMap/SecondaryMap data structures don't allow arbitrary keys. A single traversal of the CFG after parsing can validate that all value definitions dominate their uses.

SSA is not a syntactic property, so there is no way for a syntax to enforce SSA. Hence my comment that SSA should be enforced by earlier compiler phases, and can therefore be assumed by the parser.

It absolutely is a syntactic property of the IR, "static single assignment", i.e. values are immutable (static) and only have a single definition (single assignment).

Except in the case of converging control flow, though.

This is of course the problem with having a textual IR that is at a different level of abstraction than the actual IR. If the textual format is at the same level of abstraction, then we have phi nodes (block arguments) to handle this while preserving the SSA property.

There is an implicit assumption (which I agree should be made explicit) that variable names are globally unique and implicitly declared at the top of a function. Then this problem goes away:

Unfortunately it doesn't, because it wouldn't be possible to ensure during parsing that those variables have definitions that dominate all uses, nor that there is only a single definition along any given control flow path. These are properties easily checked at the level of abstraction our IR is at, which is why having the textual representation match that is so beneficial. We can trivially validate the SSA property is preserved, as a value can only ever have a single definition in the entire function, either as a block argument, or an instruction result - if you ever observe a second definition while parsing, the IR is invalid. The validation pass I mentioned earlier verifies the dominance property.

As soon as you kick SSA to the curb in the syntax for the IR, you have to perform the same kind of analysis and transformation as you would from a higher-level language to get it back, which defeats the point.

This would be part of adding library imports to the syntax, which as I said I haven't considered yet.

I didn't put a huge amount of effort into defining which strings are allowed as names, but I consider that a detail that can be worked out as we go along. This is a first draft after all.

Sure, makes sense!

Well, I suppose. We would need to keep track of the inferred type of each variable, and that the inferred type is the same in each branch that assigns to that variable, but I guess that's doable.

Since the IR already requires function and block signatures to be typed, and all instructions produce results which are typed as a function of their inputs, all we actually need to do is make a pass through to verify that nothing disagrees (e.g. that a value passed as an argument to a block matches the type signature of the block). This could be done in the same validation pass that verifies that value definitions dominate uses. We'll want to do that anyway when working directly with the in-memory representation in order to catch errors made during lowering or optimizations/transformations.

The problem of high-level vs. low-level control flow won't go away, though. ..because it will seem futile to lower the code to Miden IR only to raise it back to MASM.

It's not a problem, not really. The representation of "higher-level" control flow instructions won't be part of the current IR anyway, it will be represented in the stack IR used for codegen (it's possible we can translate direct to MASM, but my current experiments are based on a separate IR for convenience, I'll revisit it when implementation is done). That IR and the associated compiler pass are what I've been working on. At the point where translation into the stack IR is performed, the SSA IR has been transformed into a tree, rather than a DAG, with loops handled specially during translation. I'll omit the details for brevity here, but there is a straightforward set of rules for converting arbitrarily complex CFGs into an optimal sequence of MASM instructions from this representation (along with a few other properties provided by earlier passes). It remains to be seen if there are any edge cases that I've missed, but I've run a number of interesting CFGs through the algorithm (by hand), with success.

In the SSA IR, the low-level CFG representation is desirable, as there are a bunch of graph-theoretic foundations on which various well-known analyses and optimizations are based that rely on it. It's well worth a bit of extra work on the backend to keep that as the representation where we do all of our heavy lifting.

As an aside, since you mentioned it - the choice to go for Wasm rather than Move/Sway as a frontend language/IR wasn't based on any issues of control flow, but rather that both Move and Sway have a baked-in impedance mismatch with the Miden rollup. Trying to represent the rollup in either of those languages in a clean way was proving to be a real challenge. Doing so in Rust, and having rustc emit Wasm that we then translate to MASM, gives us a lot more freedom. Of course, there are other problems that come with Rust/Wasm, but on the whole they felt much more solvable, or at the very least could be worked around. Whether that pays off remains to be seen. In any case, figured I would elaborate a bit on the rationale there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miden IR concrete syntax #15

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Miden IR concrete syntax #15

jjcnn Jul 20, 2023

Replies: 4 comments · 3 replies

bobbinth Jul 20, 2023 Maintainer

jjcnn Jul 21, 2023 Author

bitwalker Jul 21, 2023 Maintainer

bobbinth Jul 21, 2023 Maintainer

bitwalker Jul 21, 2023 Maintainer

jjcnn Jul 24, 2023 Author

bitwalker Jul 25, 2023 Maintainer

jjcnn
Jul 20, 2023

Replies: 4 comments 3 replies

bobbinth
Jul 20, 2023
Maintainer

jjcnn Jul 21, 2023
Author

bitwalker Jul 21, 2023
Maintainer

bobbinth Jul 21, 2023
Maintainer

bitwalker
Jul 21, 2023
Maintainer

jjcnn
Jul 24, 2023
Author

bitwalker
Jul 25, 2023
Maintainer