Case Desugaring Revised #37

gnumonik · 2024-07-01T23:36:37Z

gnumonik
Jul 1, 2024
Maintainer

Case Desugaring Proposal 2.0

Motivation

In this document I will outline a proposal for Purus case desugaring.

Some necessary context items:

We have already decided to remove nested patterns from the AST. That decision means that, prior to any of the changes suggested here, the only possibilities for patterns are:
1. Variable patterns, e.g. x -> (...)
2. As-patterns, e.g. nm@(...) -> (...)
3. Literal patterns, e.g. 1 -> (...). Specifically: Literal patterns for booleans, integers, characters, and strings. While Purescript supports array literal patterns, it is trivial to desugar those to constructor patterns (the current working branch already does this). Objects are desugared to tupled constructors, and object patterns become constructor patterns as well.
4. Constructor patterns, e.g. Just x -> (...)
5. Wildcard patterns, e.g. _ -> (...)
Due to PIR limitations, we are forced to use compile generated destructor functions to perform case analysis on constructor patterns. (The reason for this is that we use PIR datatypes declarations to define our types, and even though we enable the SOP compilation flag when compiling PIR to PLC, at the PIR level case expressions will not typecheck if applied to a datatype).
At the moment, we (more or less) preserve the structure of case expressions that PureScript defines in the CoreFn AST, and only do translation to an appropriate PIR construct during the final compiler pass. That is: Our current implementation forces us to construct the PIR terms that correspond to case analysis in PIR, using ad-hoc helper functions. This is extremely error prone and fragile (not to mention confusing to work on).

A few things about the PIR generated destructor functions that are worth keeping in mind:

These functions have a regular form. E.g. for data Maybe a = Nothing | Just a, the destructor (let's say match_Maybe) has the type forall x. Maybe y -> forall z. z -> (y -> z) -> z.
PIR requires explicit type instantiation for quantified type variables. Given the previous type, if we had case (x :: Maybe Int) of {Nothing -> True, Just _ -> False_}, we have to perform distinct phases of type instantiation. If tyInst represents a type instantiation and # represents expression application, then properly assigning a type to match_Maybe for this example requires a few distinct steps:
1. First we first have to instantiate the type arguments to the type constructor (if the type constructor is non-nullary). Here, this would look like tyInst Int match_Maybe, or in PIR pretty syntax, match_Maybe {Int}.
2. Next, we have to apply the instantiated destructor function to the scrutinee expression, e.g. let match_Maybe_Int_Applied = match_Maybe {Int} # x (where x :: Maybe Int)
3. Finally, we have to instantiate the return or output type of the destructor function. Here, that would look like tyInst Bool match_Maybe_Int_Applied, or in PIR pretty syntax, match_Maybe_Int_Applied {bool}. Or, explicitly, (match_Maybe {Int} # x) {Bool}, which has the type Bool -> (Int -> Bool) -> Bool.
We have to do all of this instantiation & application in PIR itself, which, again, is awkward and confusing to work with.

To summarize these preliminaries: We are forced by the structure of our current IR AST to perform a complex and error-prone transformation of case expressions, all at once, during the final stage of compilation into PIR. Because our AST lacks a notion of type instantiation, we are forced to reconstruct the type arguments to be instantiated in an ad-hoc manner.

This is not ideal. In the next section I will outline some changes - changes which I believe will be relatively quick to implement - to both the AST and our case desugaring strategy that will help us separate the case compilation process into multiple phases and reduce the chance for subtle mistakes.

Proposed Changes

Add a TyInst node to our IR AST

PIR relies heavily upon type instantiations. As far as I am able to tell, the PIR typechecker does not transform implicit type instantiations (e.g. applying (f :: forall x. x -> Bool) to 2 does not instantiate the x to Int).

We did not appreciate this fact earlier, and working around it is quite cumbersome. We are forced to deduce both the types that need to be applied, and the order in which to apply them, using imperfect information during the final pass during which we convert to PIR.

The fact that we perform monomorphization does not actually help us here! Or, to be more specific, it does not help us with respect to PIR-generated function identifiers or polymorphic builtins.

The reason for this is ultimately that our AST contains type annotations in places that the PIR AST does not. In our AST, Builtins and Constructor/Destructor functions are represented as free variables (i.e. "absolutely free variables", free with respect to any context you might choose), and free variables in our IR carry type annotations. In PIR, there are no type annotations for either variables or builtins:

-- PlutusIR.Core.Type
data Term tyname name uni fun a =
                          Var a name
                        | Builtin a fun

Suppose we have a "PIR-primitive" function, that is, something that is either a builtin, constructor, or destructor. Suppose it has the type forall x y. y -> x -> (...). We can, of course, monomorphize that type when it is applied to concrete arguments. Assuming that x := Bool, y := Int, our monomorphizer will produce a monomorphic type Bool -> Int -> (...). This is exactly what we want it to do when it comes to non-primitive functions (i.e. every function that is not in one of the three categories enumerated above): A non-primitive function identifier will invariably get inlined to a lambda expression that accepts a monomorphic argument.

But this process does not help us very much at all when it comes to PIR primitive functions. In order to determine what needs to be instantiated, we have to lookup the original polymorphic signature and then check that the variables are instantiated by concrete types before instantiating those types in the right order.

This is duplicated work: The monomorphizer has access to this information directly during the monomorphization phase.

I propose that we add a constructor to our IR AST for TyInst, along the lines of:

data Exp x t a 
 = TyInst t (Exp x t a)
 | (...)

And modify the monomorphizer (and object desugarer) to add the necessary type instantiations while monomorphizing (or desugaring objects). This yields an AST much closer to PIR and, I believe, will substantially cut down the complexity of the final PIR pass.

Remove As-Patterns

While As-patterns are not strictly nested in the sense we had previous discussed - a@b yields a product [typeOf a, typeOf b] in our Temporary SOP plan, I do not think they are very useful (useful at all).

But mainly we should remove them because it makes the thing I'm about to propose in the next section easier :p

Break up case expression compilation into multiple passes over the IR

If we have TyInst in the AST, no nested patterns, and no As-patterns in the AST, we can eliminate case expressions entirely from the IR before the final compilation phase.

For the purpose of this example, assume we have:

Constructor Patterns, Literal Patterns for Int/Char/String, Variable Patterns, and Wildcard patterns.
A function isComplete that tells us whether a case expression is total or partial. (This should be fairly easy to write w/ the above restrictions)
Functions eqInt, eqChar, eqString which return an ADT Boolean (these have to be handled specially, but we already do something similarly in the final PIR step)
For an arbitrary type T and its constructors C1...CN, a constructor and destructor function for each constructor (following PIR we will just use the name of the constructor)

As a first pass, we can eliminate literal patterns in the following way. Consider

case (x :: Int) of 
  0 -> 0 
  1 -> 1 
  _ -> -1

We can transform this "flat" case analysis into nested case analysis on equality tests:

case eqInt x 0 of 
  True -> 0 
  False -> case eqInt x 1 of  
    True -> 1 
    False -> (-1) -- wildcard patterns bind no variables, this should always be a safe optimization for wildcard patterns in the "final" alternative position

A similar procedure can be employed for the other primitive equalities for each literal pattern. (Recall that array and object literals are also converted into constructor patterns).

Because Boolean is a Datatype with two nullary constructors, we know that the type of the PIR destructor function match_Boolean is forall r. r -> r -> r. We can then transform to:

match_Boolean 
  (eqInt x 0) 
  0 
  (match_Boolean (eqInt x 1) 
   1
   (-1)
  )

All literal patterns can be eliminated by the first phase, and all constructor patterns can be eliminated by the second phase, provided that the case expression is total. (I believe PureScript throws an error if the set of alternatives isn't total, but I am certain it at least throws a warning which we could upgrade to an error.)

More generally, assume that we have only:

Literal Patterns (1,'c',"hello")
Constructor Patterns (Just x, Nothing)
Variable Patterns (x,y,z)
Wildcard Patterns (_)

For an arbitrary case expression with a scrutinee x :: T, T will either be:

A primitive type
A datatype (with constructors)

If T is a primitive type, the following holds:

The patterns in the case expression must be either Literal, Variable, or Wildcard patterns
Variable and wildcard patterns match everything

Therefore, for primitive types, the patterns in a well-formed (complete) set of case alternatives will either consist in:

An exhaustive enumeration of literal patterns in the alternatives (in any order)
A partial (possibly empty) enumeration of literal patterns in the alternatives followed by either a single Wildcard Pattern or Variable Pattern

If T is a datatype (a non-primitive type), the following holds:

The patterns in a case expression must be either Constructor, Variable, or Wildcard patterns
Variable and wildcard patterns match everything

Therefore, if T is a non-primitive type (i.e. a datatype), the patterns in a well-formed (complete) set of case alternatives will either consist in:

An exhaustive enumeration of the constructor patterns for the constructors of T (in any order)
A partial (possibly empty) enumeration of the constructor patterns of T followed by either a single WildCard pattern or Variable pattern

The general procedure for compiling case expressions is: First we transform literal patterns, then we eliminate constructor patterns. The following examples should make the procedure clear:

case (x :: Maybe Int) of
  Just y -> case y of 
    1 -> 1 
    2 -> 2 
    _ -> 3_
  Nothing -> 0 
  
{- First, we eliminate the literal patterns if necessary. To check whether this is necessary, it should suffice to examine the
   pattern in the first alternative if the scrutinee has a primitive type. 
   
   To perform the transformation, we only need to look at two patterns in each step: The first pattern and the next pattern. If a case expression 
   is well-formed (we can easily check this if PS doesn't), then we will always encounter either a single VarP/WildP, or a set of 
   literal patterns which is either exhaustive or followed by an infallible pattern (VarP/WildP). 
-} 
case (x :: Maybe Int) of 
  Just y -> case eqInt y 1 of 
    True -> 1 
    False -> case y of 
      2 -> 2 
      _ -> 3
  Nothing -> 0
  
-- Step: We continue translating and "push the case expression inward"

case (x :: Maybe Int) of 
  Just y -> case eqInt y 1 of 
    True -> 1 
    False -> case eqInt y 2 of 
      True -> 2
      False -> case y of 
        _ -> 3
  Nothing -> 0
  
-- A case expression with a single alternative w/ a WildP pattern is equivalent to the RHS of that alternative 

case (x :: Maybe Int) of 
  Just y -> case eqInt y 1 of 
    True -> 1 
    False -> case eqInt y 2 of 
      True -> 2
      False -> 3
  Nothing -> 0
  
{- Next, we eliminate constructor patterns. Assume that we have 
 
   - `match_Maybe :: forall a. Maybe a -> (forall out. out -> (a -> out) -> out)`
   - `match_Boolean :: forall out. Boolean -> out -> out -> out`
   
Note that, when dealing with constructor patterns, we may have to reorder the alternatives to use the destructor function. This is simple, since we 
can easily determine the constructor index from constructor patterns. 

The order in which we do this (inside out vs outside in) should not matter. Starting from the outside, and omitting the type instantiations: 
-}
match_Maybe 
  x
  0
  (\y -> case eqInt y 1 of 
    True -> 1 
    False -> case eqInt y 2 of 
      True -> 2
      False -> 3
  )

-- Step: We apply the same procedure to the inner case expression: 

match_Maybe 
  x
  0
  (\y -> match_Boolean 
           (eqInt y 1)  
           1 
           (case eqInt y 2 of 
              True -> 2
              False -> 3
           )
  )
  
-- Final Step: Same thing on the innermost case expression 

match_Maybe 
  x
  0
  (\y -> match_Boolean 
           (eqInt y 1)  
           1 
           (match_Boolean
             (eqInt y 2) 
             2
             3
           )
  )

The only case not covered by the above example is a mixture of irrefutable patterns with constructor patterns. For example:

data ABCD a b c d = A a | B a b | C a b c | D d  

match_ABCD :: forall tA tB tC tD 
            . ABCD tA tB tC tD 
           -> forall out
              . (tA -> out)
             -> (tA -> tB -> out)
             -> (tA -> tB -> tC -> out)
             -> (tD -> out)
             -> out

If we have a partial set of constructors (but the case expression is well-formed & complete), e.g.

-- Example 1
case (x :: ABCD Int Bool Int Int) of 
 C n1 _ n2 -> n1 + n2
 other -> f other
 
-- Example 2
case (x :: ABCD Int Bool Int Int) of 
 C n1 _ n2 -> n1 + n2 
 _ -> 0

The translation is mechanically more complex but conceptually simple. For clarity, it is useful to state a few rules & axioms that have been implicit until now:

"Bare" variable patterns, i.e. those that do not occur as arguments to a constructor pattern, can be directly replaced with the scrutinee expression.
For a complete & well-formed case expression where the scrutinee is a datatype and there is at least (and at most) one infallible pattern in the LHS of an alternative, the RHS of that alternative with the infallible pattern can only be:
- a. An expression that does not mention any variables bound in the pattern (i.e. if the pattern is a Wildcard pattern and binds no variables)
- b. An expression that applies some function to the single variable bound in the pattern (i.e. if the pattern is a Variable pattern)

2 justifies 1: In a non-nested context, variable patterns can only bind a single variable, yielding a 1 argument lambda which will be immediately applied to the scrutinee expression. 1) simply states that we are licensed to perform beta reduction directly in this context.

-- Example 1. Here, I show the explicit beta reduction explained above. (In the actual implementation we can skip it and substitute the scrutinee directly) 

match_ABCD 
  x 
  (\ _a -> (\other -> f other) x)
  (\ _a _b -> (\other -> f other) x)
  (\n1 _unused n2 -> n1 + n2)
  (\ _d -> (\other -> f other) x)

-- NOTE: In this particular example, beta and eta reduction would both give us the final form. 

match_ABCD 
  x 
  (\ _a -> f x)
  (\ _a _b -> f x)
  (\n1 _unused n2 -> n1 + n2)
  (\ _d -> f x)
  

-- Example 2. Here I elide the beta/eta reductions 

match_ABCD 
  x 
  (\ _a -> 0)
  (\ _a _b -> 0)
  (\ n1 _b n2 -> n1 + n2)
  (\ _d -> 0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case Desugaring Revised #37

{{title}}

Replies: 0 comments

Select a reply

Case Desugaring Revised #37

gnumonik Jul 1, 2024 Maintainer

Case Desugaring Proposal 2.0

Motivation

Proposed Changes

Add a TyInst node to our IR AST

Remove As-Patterns

Break up case expression compilation into multiple passes over the IR

Replies: 0 comments

gnumonik
Jul 1, 2024
Maintainer