Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraint based type inference #219

Merged
merged 33 commits into from
Oct 26, 2023

Conversation

yihozhang
Copy link
Collaborator

This PR proposes a constraint-based global type inference algorithm for egglog. More importantly, it proposes a core IR framework for writing rule analysis and transformation passes, which facilitates future development.

In terms of type inference:

  • The type inference algorithm is inspired by Souffle and other Datalog compilers' design: it goes through each query and action atom, generates respective type constraints, and uses a simple fixpoint-based solver to solve type constraints. This design is declarative (one can inspect generated constraints) and scalable (one can add new type constraints to handle new features).
  • This PR fixes a number of issues with the current type checker, e.g., Query atoms are order-sensitive #196, Duplicate literals in query cause egglog to panic #133, Created map type is always first defined map type #113, and potentially Type Inference Failure #80.
  • The type inference algorithm supports both polymorphism and ad hoc overloading: for example, it supports both polymorphic functions like vec-add : vec<a> -> vec<a> -> vec<a> and overloaded functions like add, which has several definitions such as f64 -> f64 -> f64 and String -> String -> String.

In terms of the core IR:

  • The old typecheck.rs is refactored into several orthogonal passes:
    • A lowering pass that transforms a Rule into CoreRule,
    • A congruence pass that gets rid of redundant atoms,
    • A canonicalize pass that removes equality check and "canonicalizes" variables,
    • A type checking pass.
    • Another lowering pass to Program (not currently implemented).
  • A CoreRule has two parts: Query and Actions. Query uses a conjunctive query--based IR, while Actions uses an imperative SSA-based NormAction, which is currently used in desugaring. The goal is to gradually phase out NormExpr once desugar.rs is refactored into several compiler passes over core IR, so that we have a unified internal IR for query analysis and transformation.

@yihozhang
Copy link
Collaborator Author

yihozhang commented Aug 29, 2023

I had a discussion with @oflatt about the differences between the proposed conjunctive query-based IR and the existing NormFact IR. The goal of the proposed IR is to provide a flexible representation where we can develop a series of nano-passes and analyses that progressively transform the query into an efficient form.
However, this is lacking in NormFact.
To summarize my ideas, here are some advantages of the proposed IR in my opinion:

It is declarative and unordered.

The existing NormFact IR enforces an order between atoms, as is shown in #196. For example, consider the rule

(rule 
  ((= x y)
   (= y 1)) 
  ((R x)))

The query corresponds to NormFact

AssignVar x y
AssignLit y 1

which is not executable and causes egglog to panic.

In the proposed IR, it will first be lowered to

(rule 
  ((value-eq x y)
   (value-eq y 1)) 
  ((R x)))

After that, a canonicalization pass will substitute x with y in the rule, and then substitute y for 1:

(rule 
  () 
  ((R 1))

One way around this is to eliminate the ordered-ness of the NormFact and use ConstrainEqLit, etc. to model an equality check similar to value-eq. But even with this, NormFact is still not a good target for query optimizations. See the second point.

The proposed IR is freer in the sense that it gives you a bigger space of expressible programs.

value-eq, which is a primitive for equal checks, subsumes all of AssignVar, AssignLit, and ConstrainEq and is more general. Consider a simple rule

(rule 
  ((= x (Num 0))
   (= z (Mul x y)))
  ())

In NormFact it looks like this

(AssignLit __var 0)
(Assign x (Num __var))
(Assign z (Mul x y))

This indirection introduced by variable __var complicates query evaluation, since now without looking at the context we don't know if (Num __var) is a look-up (which can be further optimized) or scan. The point is that because of the lack of flexibility of NormAction, there is no way of inlining 0 in place.

Consider another example:

(rule 
  ((= (Num y) (Num x)))
  ())

In NormFact it looks like this

(Assign z1 (Num y))
(Assign z2 (Num x))
(ConstrainEq z1 z2)

Again, there is no way of expressing the same query efficiently in NormFact, because the IR disallows "inlining" z1 and z2.
In contrast, in the proposed IR, the above NormFact looks like this:

(Num y z1)
(Num x z2)
(value-eq z1 z2)

A canonicalizing pass will substitute z1 with z2 and give you

(Num y z1)
(Num x z1)

which directly corresponds to an efficient join.

Oliver had a good point that for proof generation, you may need to handle literals specially, so with NormAction one can assume all the arguments to functions in Assign is a variable, which can be handled uniformly. However, because the proposed IR is more expressive, it captures programs expressible in NormAction, so one can have a "de-canonicalization" pass that replaces every argument with a variable and generates a bunch of (value-eq var val) atoms. Then, after proof generation is done, another canonicalization pass can be used to eliminate these unnecessary value-eqs

A uniform view of atoms is useful for query compilation.

(This point is less relevant to NormFact IR but I think it's very cool.)

In the proposed IR, a query is just a vec of atoms, which abstracts over both functions and primitives. This uniform view is useful for further optimizations in query compilation. For example, consider

(rule 
  ((R x y z)
   (= (+ x y) z))
  ())

In the proposed IR, it corresponds to two atoms

(R x y z unit)
(+ x y z)

The only difference between + and R is that + enforces a functional dependency. One can therefore generate a query plan as

for x in R.x
	for y in R[x].y
		for z in Isect(R[x,y].z, +[x, y].z):
			output [x, y, z]

The current query compiler, however, will go through each z in R[x, y] and do the check afterward.

Query compilation

Finally, compilation from NormFact to a query plan is more complex than doing it from the proposed IR. Basically, you still need to convert NormFact to a conjunctive query-like IR to figure out query plans, variable orderings, etc.

@yihozhang
Copy link
Collaborator Author

yihozhang commented Aug 31, 2023

Update: All the tests are passing! Some big TODOs left (maybe for future PR):

  • Query atoms are order-sensitive #196, Duplicate literals in query cause egglog to panic #133, Type Inference Failure #80: Since the errors are thrown by NormFact, we have to refactor or remove NormFact to fix this issue.
  • Refactor type inference for actions: Currently this is done by reusing the existing action checker and is not integrated into the proposed framework.
  • Refactor query compilation: The proposed IR does not split up atoms into primitives and functional atoms, while the query compilation and evaluation pass does. To be compatible with query compilation I introduced func() and filters() methods that simulate what was the atoms and filters fields. This is however inefficient. The principled way to do this is to also refactor query compilation to reflect this unified view.

@oflatt
Copy link
Member

oflatt commented Aug 31, 2023

As far as I can tell, this PR doesn't replace typechecking.rs- which means that there will still be two typecheckers if we merge this. Is that right?

@yihozhang
Copy link
Collaborator Author

Is the type information from typechecking.rs used in other places? If this is true then you are right.

Are you okay with merging it while two typecheckers are present? If not, I'm happy to put more work into this PR, but this might make code review more complex than splitting into several PRs of smaller scopes.

@oflatt
Copy link
Member

oflatt commented Aug 31, 2023

Typechecking information is used by my terms and proofs encoding unfortunately.
I think ideally, there would be just one type checker and it would happen just after desugaring but before the program reaches the back-end. In my mind sending a type-annotated AST to the back end would be best.

@oflatt
Copy link
Member

oflatt commented Aug 31, 2023

Merging with two typecheckers would raise the question- which one are we going to keep in the future?

@yihozhang
Copy link
Collaborator Author

Typechecking information is used by my terms and proofs encoding unfortunately.

I would say it is very hard to avoid conflicts with something not in the main... (especially since the discussions about terms are not yet conclusive.) My proposal for terms and proofs is that we keep it in a branch and move on. When later we have time (e.g., after eggcc), we can revisit this together, and I can work with you to revive it and have it merged- I'm also very excited about proofs and terms.

Merging with two typecheckers would raise the question- which one are we going to keep in the future?

It is clear to me that the type checker implemented in this PR will subsume both of the existing type checkers (typecheck.rs and typechecking.rs). That's one of the reasons why this PR exists. It is just that having everything in one PR can potentially cause issues with code review. But if you think this is manageable, I can get rid of typechecking.rs.

@oflatt
Copy link
Member

oflatt commented Aug 31, 2023

Yeah, I understand it's quite difficult to avoid conflicts with my branch.
However, I still believe that will some small tweaks we can merge it much sooner than after the eggcc project, which will take a lot of time.

Regardless, as you said this typechecker seems better and it makes the backend cleaner. I think your suggestion of keeping both typecheckers for now is a good compromise. I would just ask that you don't remove the old typechecker and that you do not remove NormCommand/NormFact.

@yihozhang
Copy link
Collaborator Author

Update: This PR is self-contained and ready for review. The refactoring is not 100% done, but it is passing all the tests which is a milestone. So I propose to consider this PR as is for merging, and I plan to make follow-up PRs. More details:

  • Constraint based type inference #219 (comment) is still true.
    • Compilation of actions is still done using the existing code, not the constraint-based type inference algorithm
    • The mentioned issues still exist- need to remove the front-end type check to do this
    • gj.rs and other places still use query.funcs() which is slow.
  • Right now type error generated by the constraint system is a bit hacky. This is because constraint solving is implemented in a very general way (which is good) while the error information we need is very specific. I plan to refactor this by having a generic error type parameter in the constraint system.

TODOs for follow-up PRs (will open a tracking issue for them once this is merged).

  • Add a renaming pass to resolve name conflicts/shadowing once and for all
  • Improve error information for constraint solving.
  • Add a congruence pass for optimization (so that (= c1 (add a b)) (= c2 (add a b)) will be compiled into one atom)
  • Refactor type inference for actions.
  • Refactor the execution for actions to work over NormAction
  • Use an opaque type instead of AtomTerm for fn get_constraints in the primitive function interface.

cc @mwillsey @oflatt

@yihozhang yihozhang marked this pull request as ready for review September 18, 2023 18:58
@yihozhang yihozhang requested a review from a team as a code owner September 18, 2023 18:58
@yihozhang yihozhang requested review from oflatt and removed request for a team September 18, 2023 18:58
@oflatt
Copy link
Member

oflatt commented Sep 18, 2023

Are the new queries any slower now that we are not doing congruence?

@yihozhang
Copy link
Collaborator Author

yihozhang commented Sep 18, 2023 via email

Copy link
Member

@oflatt oflatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, there are two big changes in here. (1) It adds a typechecker that is contraint-based. (2) It changes how query compilation works significantly- getting rid of the complex egraph stuff and just flattening the query.

In theory, I support both of these changes. But as they are implemented in this PR, I don't think it's worth merging.

(1) The contraint-based typechecker is good, but happens too late to matter. I'm starting to think there is no point in merging it now and forcing you to replace the current typechecker with it. I think it will be easier for both of us if you replace the existing typechecking.rs with this as a separate PR.

(2) I like that this simplifies and improves query compilation, but it seems to re-invent the wheel. Most of what query compilation in this PR does is flatten things into atoms, which is exactly what desugaring already does. In fact, if you reuse the NormFact struct it basically already captures what you are doing here.

Again, great work on this PR! I think with more work this will be a big improvement to egglog.
Let me know what you think of my suggestions

@@ -424,14 +424,16 @@ pub struct Desugar {

impl Default for Desugar {
fn default() -> Self {
let mut type_info = TypeInfo::default();
type_info.add_primitive(ValueEq {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add this primitive in the implementation of default for TypeInfo

src/typecheck.rs Outdated Show resolved Hide resolved
src/typecheck.rs Outdated
.iter()
.skip(1)
.map(|va| Atom {
head: "value-eq".into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using value-eq here? Value-eq is a primitive, not a equality constrain that gj ,knows how to handle

src/ast/desugar.rs Outdated Show resolved Hide resolved
src/constraint.rs Show resolved Hide resolved
src/typecheck.rs Show resolved Hide resolved
src/typecheck.rs Outdated
fn flatten_fact(&mut self, fact: &Fact) -> Query<Symbol> {
match fact {
Fact::Eq(exprs) => {
// TODO: currently we require exprs.len() to be 2 and Eq atom has the form (= e1 e2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this requirement explicit in the long term, it just helps simplify the code

src/typecheck.rs Outdated
}
},
_ => continue,
fn flatten_fact(&mut self, fact: &Fact) -> Query<Symbol> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this basically what desugaring already does? It seems like if you just use the existing NormFact you will be much closer to what you want.

src/typecheck.rs Outdated
loop {
let mut to_subst = None;
for atom in result_rule.body.atoms.iter() {
if atom.head == "value-eq".into() && atom.args[0] != atom.args[1] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this special case- instead of using the primitive value-eq, equality constraints should be specially represented in the data structure. For that reason, I have equality constraints in the NormFact data structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you still insist on having equality constraints as part of the query definition for CoreRule after our discussion. I'm open to changing this, although having value-eq as a regular primitive does simplify things a bit (e.g., fewer cases to consider, no special handling in codegen)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think we shouldn't have value-eq, see my comment above.

src/typecheck.rs Outdated
head: actions.to_vec(),
body: facts.to_vec(),
};
// dbg!(&rule);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray comments

Copy link
Collaborator Author

@yihozhang yihozhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great review @oflatt and sorry for getting back late. Besides addressing some of your comments, I also made the following changes according to our discussion:

  • Typechecking (for queries at least) happens only once now (in typechecking.rs), the backend CoreRule directly uses the type-checking result from it instead of redoing it.
  • The backend takes a NormRule instead of a pair of Query and Action now, so we don't need to convert a NormRule first to an un-desugared Rule and then to CoreRule.
  • Many codes are removed: We no longer need the old type-checking algorithm, we don't have to convert from un-desugared Rule to CoreRule, etc.

I am very happy about all of these changes

src/ast/desugar.rs Outdated Show resolved Hide resolved
src/constraint.rs Show resolved Hide resolved
src/gj.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/typecheck.rs Outdated Show resolved Hide resolved
src/typecheck.rs Outdated
}

#[derive(Debug, Clone)]
pub enum ResolvedSymbol {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ResolvedSymbol as its use in ResolvedCoreRule contrasts with Symbol used in UnresolvedCoreRule. But I'm open to change

src/typecheck.rs Show resolved Hide resolved
src/typecheck.rs Outdated
loop {
let mut to_subst = None;
for atom in result_rule.body.atoms.iter() {
if atom.head == "value-eq".into() && atom.args[0] != atom.args[1] {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you still insist on having equality constraints as part of the query definition for CoreRule after our discussion. I'm open to changing this, although having value-eq as a regular primitive does simplify things a bit (e.g., fewer cases to consider, no special handling in codegen)

src/constraint.rs Outdated Show resolved Hide resolved
@yihozhang yihozhang requested a review from oflatt October 9, 2023 21:58
@oflatt
Copy link
Member

oflatt commented Oct 11, 2023

Before I do my review, I just want to say amazing work @yihozhang and we make a great team! This is a big moment for egglog.

Copy link
Member

@oflatt oflatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review, plan to make another pass after you respond.
Looks like great progress.

@@ -515,6 +515,7 @@ impl ToSexp for RunConfig {

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub struct NormRunConfig {
pub ctx: CommandId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why we have a ctx here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of #236. Every run-schedule is not sufficient. We need to make sure each run has a context (for typechecking its until field.

Err(error) => errors.push(error),
}
}
// If update is successful for only one sub constraint, then we have nailed down the only true constraint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Xor was "exclusive or" so it needs to succeed for exactly one constraint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about the semantics of exclusive or. However, here, Xor succeeds with no updates basically means "I'm not gonna throw an error and will wait for more updates from other constraints until it's clear which sub-constraint is the right one".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks

&self,
type_info: &TypeInfo,
) -> Result<Vec<Constraint<AtomTerm, ArcSort>>, TypeError> {
let mut xor_constraints: Vec<Vec<Constraint<AtomTerm, ArcSort>>> = vec![];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave more comments in complex code like this?
I'm not sure why xor_constraints has this type or why

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

1 => Ok(literal_constraints
.chain(xor_constraints.pop().unwrap().into_iter())
.collect()),
_ => Ok(literal_constraints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this case happen and why is is Xor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments. Basically, xor_constraints collects all possible instantiations of the atom. Each instantiation is a vector of constraints. Among all instantiations, only one should hold, thus exclusive or.

src/typecheck.rs Outdated
}

#[derive(Debug, Clone)]
pub enum ResolvedSymbol {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symbol is used in a lot of places, where here you really mean a function call or primitive. I still think we should change it.

Is this supposed to include type information in the future?

src/typecheck.rs Outdated
loop {
let mut to_subst = None;
for atom in result_rule.body.atoms.iter() {
if atom.head == "value-eq".into() && atom.args[0] != atom.args[1] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think we shouldn't have value-eq, see my comment above.

src/typecheck.rs Outdated
}
}

impl Query<Symbol> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename typecheck.rs to something like query-compile.rs?
We should also move all the typechecking related stuff like this into typechecking.rs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue for this (#268). I'll do it after the PR is merged but not now to make the diff clean.

src/typecheck.rs Outdated
v.insert(Expr::Var(*var));
}
// TODO: should this be in ResolvedCoreRule
pub fn canonicalize(&self, rule: UnresolvedCoreRule) -> UnresolvedCoreRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where it would be good to add documentation- what's the purpose of canonicalize?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have documented its purpose.


    /// Transformed a UnresolvedCoreRule into a CanonicalizedCoreRule.
    /// In particular, it removes equality checks between variables and
    /// other arguments, and turns equality checks between non-variable arguments
    /// into a primitive equality check `value-eq`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, value-eq cannot be completely removed from our codebase since queries may contain equality checks between e.g., global variables, which cannot be compiled away.

if let ENode::Func(_, children) | ENode::Prim(_, children) = &mut node {
for child in children {
*child = self.unionfind.find(*child);
pub(crate) fn resolve_rule(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More typechecking stuff that probably should be in typechecking.rs

@@ -259,8 +266,25 @@ impl TypeInfo {
}

fn typecheck_facts(&mut self, ctx: CommandId, facts: &Vec<NormFact>) -> Result<(), TypeError> {
for fact in facts {
self.typecheck_fact(ctx, fact)?;
// ROUND TRIP TO CORE RULE AND BACK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good place for a TODO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@oflatt oflatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Yihong!

Err(error) => errors.push(error),
}
}
// If update is successful for only one sub constraint, then we have nailed down the only true constraint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks

@oflatt oflatt merged commit 82ca4b6 into egraphs-good:main Oct 26, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants