Redesigning canonicalization: the fundamental (internal) overhaul we need #104

Zac-HD · 2023-10-20T21:55:55Z

Background

hypothesis-jsonschema is basically just a function which maps a json schema to a Hypothesis strategy for generating instances which conform to that schema... and a few internal helpers.

The basic problem is that there are very many eqivalent ways to express the same set of allowed objects, including many where the obvious translation to a strategy is terribly inefficient. We therefore start by "canonicalizing" the schema: taking the intersection or union of overlapping parts (as appropriate), and generally transforming the schema so that it expresses the same constraints but is as easy to convert to an efficient strategy as possible.

So what needs to change?

I wrote most of hypothesis-jsonschema about four years ago, and it never graduated from beta. There are some fundamental design flaws in how we deal with both recursive references and schema versioning, as well as some implementation issues where the organic growth of the code has left it slower and harder to understand than it ought to be. We'll also want to support schema versions newer than draft-07, which are now in common use.

I think this basically requires a from-scratch rewrite of the canonicalization logic. Happily, I learned a lot about what (not) to do last time around, so the next version can be substantially cleaner and I don't expect that we'd need to do this again. The rewrite could be in Python; or it could easily be extracted to Rust - performance challenges led to the omission of several useful rewrite passes, and the interface is "(serialized?) json schema in, (serialized?) json schema out" with no complicated control-flow or handoff.

A sketch of the current design

https://github.com/python-jsonschema/hypothesis-jsonschema/blob/master/src/hypothesis_jsonschema/_canonicalise.py contains:

canonicalish(), which takes a schema and runs many imperative modifications to handle various subtypes of schema
merged(), which takes n schemas and returns their intersection (or None, if infeasible)
some helpers to compute numeric bounds expressed by a schema

The design I want

Represent everything with objects!

A schema (or subschema) is represented as an immutable object. Along with the contents of the (sub)schema, this should contain a reference to the top-level schema (possibly self) to allow for resolution of references. The pair of (root schema, json pointer) stably and uniquely identifies each Schema object - use memoization for improved performance.
Create subclasses of Schema for each major subtype of schemas - types, all_of, any_of, etc.; and version-specific variants where needed. Parsing a schema which permits multiple types should automatically convert to anyof over those types and allof over any other constraints. This "splitting" step is really important!
Identify all referenced subschemas, i.e. locations which are pointed-to from elsewhere. After a first pass at canonicalization, inline any pointed-to subschemas which do not themselves contain references. Repeat until there are no such subschemas - now, any reference left must be recursive and we can convert to a recursive strategy with st.deferred().
Give our schema objects explicit .intersection() and .union() methods. Also implment the other set methods based on these; we often want (e.g.) subtraction in practice.

The text was updated successfully, but these errors were encountered:

Zac-HD mentioned this issue May 18, 2024

PyCon US 2024 sprints! HypothesisWorks/hypothesis#3994

Closed

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesigning canonicalization: the fundamental (internal) overhaul we need #104

Redesigning canonicalization: the fundamental (internal) overhaul we need #104

Zac-HD commented Oct 20, 2023

Redesigning canonicalization: the fundamental (internal) overhaul we need #104

Redesigning canonicalization: the fundamental (internal) overhaul we need #104

Comments

Zac-HD commented Oct 20, 2023

Background

So what needs to change?

A sketch of the current design

The design I want