Definitions, closedness, embedding, abstractions etc #3653

myitcv · 2025-01-05T09:04:27Z

myitcv
Jan 5, 2025
Maintainer

#1576 contains discussion about the fact that embedding a definition recursively opens the definition at the site of the embedding to allow for additional fields and definition (play):

#def1: {
	name?: string
}

#def2: {
	#def1
	age?: int
}

d1: #def1 & {
	name: "test"
	// age: 100 // not allowed
}

d2: #def2 & {
	name: "test"
	age:  100
}

This behaviour is per the spec:

An embedded value of type struct is unified with the struct in which it is embedded, but disregarding the restrictions imposed by closed structs. So if an embedding resolves to a closed struct, the corresponding enclosing struct will also be closed, but may have fields that are not allowed if normal rules for closed structs were observed.

However, #1576 makes an interesting observation that definitions referenced by the embedded definition are also opened (play):

#Bar: {
	bar?: string
}

#Foo1: {
	bars?: [...#Bar]
}

#Foo2: {
	#Foo1
	bars?: [...{
		baz?: string
	}]
}

f1: #Foo1 & {
	bars: [{
		bar: "test"
        // baz: "something" // not allowed
	}]
}

f2: #Foo2 & {
	bars: [{
		bar: "test"
		baz: "something"
	}]
}

The perhaps surprising thing here is that #Foo2 allows the addition of the field bar?: string to the list element of type #Bar: it was #Foo that was embedded, #Bar was simply originally referenced by #Foo. Per @mpvl, this is per the spec:

Definitions are indeed recursively opened according to the spec

Indeed if definitions were not opened recursively as this behaviour demonstrates, then it would be impossible (without additional machinery) to allow #Foo2 to widen (with respect to closedness) the type of element allowed in the bars field.

That said, #1576 appears (from my initial understanding, and indeed re-reading) to be more oriented towards a problem that would be solved by a proposed must() builtin (also mentioned in #943).

However the "recursive opening of a definition referenced by the embedded definition" aspect caused me to revisit this #1576 in the context of the following problem, and in doing so touch on some of the benefits/issues with respect to definitions, closedness, embedding and the like.

My "problem"

All the CUE repos use GitHub Actions for CI, and correspondingly we use the GitHub Actions workflow schema to validate our CI declarations. At a very much simplified level, a workflow schema (and example workflow instance) looks like this (play):

#workflow: {
	jobs!: [string]: {
		steps!: [...{
			run?: string
		}]
	}
}

w1: #workflow & {
	jobs: test: steps: [{
		run: "echo hello world"
	}]
}

In the course of a "Friday hack" exploration of a solution for #3603, I considered doing the following (play) to template in the setting of a bash option:

#cueworkflow: {
	#workflow
	jobs!: [string]: steps!: [...{
		#run?: string
		run?:  =~"^set -o nounset\n"
		if #run != _|_ {
			run: """
				 set -o nounset
				 \(#run)
				 """
		}
	}]
}

w2: #cueworkflow & {
	jobs: test: steps: [{
		#run: "hello world"
	}]
}

This works. The embedding of #workflow, per the spec, recursively opens all definitions, allowing additional fields to be declared. Indeed because we have not yet enforced #543 we could add a #run field without needing to embed #workflow, but given #543 feels like the right thing to do, it's appropriate to present this example as if it were implemented.

The addition of the #run field is in effect widening the type of #cueworkflow with respect to #workflow: the set of possible values for a value of #cueworkflow is greater than those possible with #workflow.

(However this widening specifically affects the set of fields allowed, not the values of existing fields. i.e. it is impossible via embedding to widen a field f?: string to f?: _.)

Ok, so what's the problem?

Given that I appear to have a solution here, we are "job done" through one lens. We have been able to reuse the #workflow definition, and in doing so present a neat abstraction on top of the existing structure. The data transformation from the #run field to the target run field is neatly described in terms of regular CUE (regular CUE that would be greatly improved readability-wise with the additions proposed in #943) in a way that is clear to the user. The user of #cueworkflow does not need to think about an entirely different structure to that with which they are familiar: they simply have to use #run instead of run (because the latter is set for them). This approach is also relatively forward compatible: if the authors of #workflow make a change that causes our abstraction augmentation to break in some way, we will know about it. That, as the authors of #cueworkflow, is a risk we accept in reusing the structure of #workflow in this way.

But is this the best/right approach?

Definitions are recursively closed on reference. Should we instead shift to a more explicit approach through some additional syntax/builtin? e.g. rclose(workflow)? This would require users of such schemas to have to add additional syntax at each "call" site in order to validate that structure does not declare additional fields, catch typos. Such an approach would, however, make much more explicit and precise the locations at which "typo checking" is expected.

We are nearly 100% reusing the structure of #workflow here, which gives a familiar feel to the end user. If we were to choose a different interface, what would it look like? At what point does the benefit of providing a "clean" interface surpass the cost of having to learn a new abstraction?

Should we instead be treating this as a data transformation problem? i.e. we continue to use #workflow as is, but somehow specify that the exported concrete configuration value is a transformed version of the input, where each run field value is prefixed by the line set -o nounset? That would likely require some version of #165, https://github.com/myitcv/cuetransform or equivalent, to allow us to express the data transformation in a clean, path-oriented way. Such an approach could however "bury" the transformation from the user in a way that might cause confusion: "where did this set -o nounset line come from?".

Conclusion?

No real hard conclusions, just some observations:

Good: definitions conveniently allowing declaring values (types) that are used as the basis for validation most of the time, and hence the recursive closing is what users of the definition wants and they get that via a reference to such a definition. Contrast a situation where I would need to add some additional syntax beyond the reference to do that recursive closing/equivalent "manually". This would undoubtedly provide a huge opportunity for users of the definition to forget such syntax, and accidentally therefore allow invalid values. Noting that we still have room for recursively closing literal values like #{ x?: string }, rather than relying on definitions.
Good: definitions existing in a non-data namespace is also convenient, for example when it comes to augmenting an existing type in the way presented here. I note that per CUE field types: what's available & what's missing #2709 there is also merit in a field type that is not part of the data model, not recursively closed, but is accessible across package boundaries.
Good: embedding recursively opening closed structures. Given the current behaviour of definitions, this allows those looking to reuse structure (like our reuse of the #workflow structure) to do so trivially. Perhaps a more explicit mechanism of recursively "removing" closedness would be more appropriate, but this feels like a relatively minor detail.
Good: embedding only allows the widening of structure/shape, not field types. i.e. we can can widen the type of elements of the steps list by adding the #run field, but cannot change run?: string to run?: any for example.
Bad?: embedding is quite a blunt tool. I could any fields/structure to #cueworkflow unless careful. Placing #cueworkflow in a separate package from its usage sites and adding some tests would help to mitigate any risks of "other" fields accidentally being unified.
Acceptable: the risks, as authors of #cueworkflow, with our additions being "broken" by changes from authors upstream to #workflow.
Unsure: should we be treating this as a data transformation?

References

Would very much welcome thoughts/etc from others in this space. I started this discussion really to briefly flesh out my thoughts in this space. But I fully acknowledge this space is much bigger than the scope of what I cover here. I make no claim to have covered that entire space; indeed I can't hold the entire space in my head! So this first post is an attempt to iteratively tackle some aspects.

cc @mpvl @verdverm @rogpeppe @cuematthew based on (recent) conversations/interactions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definitions, closedness, embedding, abstractions etc #3653

{{title}}

Replies: 0 comments

Select a reply

Definitions, closedness, embedding, abstractions etc #3653

myitcv Jan 5, 2025 Maintainer

My "problem"

Ok, so what's the problem?

Conclusion?

References

Replies: 0 comments

myitcv
Jan 5, 2025
Maintainer