Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesigning canonicalization: the fundamental (internal) overhaul we need #104

Open
Zac-HD opened this issue Oct 20, 2023 · 0 comments
Open

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Oct 20, 2023

Background

hypothesis-jsonschema is basically just a function which maps a json schema to a Hypothesis strategy for generating instances which conform to that schema... and a few internal helpers.

The basic problem is that there are very many eqivalent ways to express the same set of allowed objects, including many where the obvious translation to a strategy is terribly inefficient. We therefore start by "canonicalizing" the schema: taking the intersection or union of overlapping parts (as appropriate), and generally transforming the schema so that it expresses the same constraints but is as easy to convert to an efficient strategy as possible.

So what needs to change?

I wrote most of hypothesis-jsonschema about four years ago, and it never graduated from beta. There are some fundamental design flaws in how we deal with both recursive references and schema versioning, as well as some implementation issues where the organic growth of the code has left it slower and harder to understand than it ought to be. We'll also want to support schema versions newer than draft-07, which are now in common use.

I think this basically requires a from-scratch rewrite of the canonicalization logic. Happily, I learned a lot about what (not) to do last time around, so the next version can be substantially cleaner and I don't expect that we'd need to do this again. The rewrite could be in Python; or it could easily be extracted to Rust - performance challenges led to the omission of several useful rewrite passes, and the interface is "(serialized?) json schema in, (serialized?) json schema out" with no complicated control-flow or handoff.

A sketch of the current design

https://github.com/python-jsonschema/hypothesis-jsonschema/blob/master/src/hypothesis_jsonschema/_canonicalise.py contains:

  • canonicalish(), which takes a schema and runs many imperative modifications to handle various subtypes of schema
  • merged(), which takes n schemas and returns their intersection (or None, if infeasible)
  • some helpers to compute numeric bounds expressed by a schema

The design I want

Represent everything with objects!

  • A schema (or subschema) is represented as an immutable object. Along with the contents of the (sub)schema, this should contain a reference to the top-level schema (possibly self) to allow for resolution of references. The pair of (root schema, json pointer) stably and uniquely identifies each Schema object - use memoization for improved performance.
  • Create subclasses of Schema for each major subtype of schemas - types, all_of, any_of, etc.; and version-specific variants where needed. Parsing a schema which permits multiple types should automatically convert to anyof over those types and allof over any other constraints. This "splitting" step is really important!
  • Identify all referenced subschemas, i.e. locations which are pointed-to from elsewhere. After a first pass at canonicalization, inline any pointed-to subschemas which do not themselves contain references. Repeat until there are no such subschemas - now, any reference left must be recursive and we can convert to a recursive strategy with st.deferred().
  • Give our schema objects explicit .intersection() and .union() methods. Also implment the other set methods based on these; we often want (e.g.) subtraction in practice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant