Datatype system design choices #185

mkroetzsch · 2023-04-19T15:26:43Z

mkroetzsch
Apr 19, 2023
Maintainer

This discussion is to gather some thoughts about Nemo's logical datatype system, i.e., the types that a user would be allowed to select on the rules level.

What's in a datatype?

Datatypes have at least the following tasks:

Define the value space (e.g., unicode strings, signed 64bit numbers, RDF resources)
Define the syntax for such values:
- Lexical space (which strings are acceptable representations of values)
- Lexical-value mapping (how do we get values from strings)
- Value-lexical mapping (how to get a unique lexical value for each value = normal form)
Representation of such values as physical data (bijective mapping)
Semantics of "overloaded" built-in predicates and functions (e.g., "<=" may have a different meaning for a string type than for a dateTime type).

Note that logical values can have indirect or composite representations on the physical layer. For example, a date might be stored in the form of several physical values (in several columns) and a nested function term might be represented through tuples in several relations. Built-in predicates and functions need to undergo a similar translation into physical built-ins to realise their semantics.

Note that "lexical" means "unicode string" for us. Low-level encoding of unicode glyphs (e.g., UTF-8 vs. UTF-16) is not a concern of the datatype, but of I/O routines that have to turn glyphs into bytes. In memory, we work with glyphs in all cases.

Also note that our lexical value -- in contrast to RDF and XSD -- does not need to encode our "internal" logical datatype. For example, "42"^^xsd:int might be a lexical value of a value in a logical integer datatype but also in a general RDF term type.

What kind of datatypes are there?

One can imagine datatypes of several basic forms:

Primitive datatypes: base types such as "double" or "IRI"
Structured datatypes: things like nested function terms or sets, where values are naturally "composite" objects rather than "atomic" values; their definition would always have to be "composite" as well (e.g., "set of double")
Union datatypes: types formed as unions of other types (could be restricted to predefined unions)
Restricted types: types formed by restricting existing (primitive) types, e.g., "strings of length <=32" (could be restricted to predefined restrictions)

Types have a natural hierarchical relation based on set inclusion of their value spaces. Moreover, there could be built-in functions to convert values in different ways (in the style of "toString: int -> string" or "langTag: LanguageTaggedString -> string").

Which datatypes do we need?

This needs discussion, but we should be open to future extensions. One can derive some possible demands from related systems and technologies that we support:

Primitive types:

Basic computing types are certainly needed: unicode string, some integer type, double
- it is not clear how "technical" these need to be, e.g., if we would need specific integer types like i32 and u16, or if a general "int" or "integer" type is enough
RDF and SPARQL compatibility: IRIs, blank nodes, and standard XSD and RDF types
- as in Rulewerk, we will view "abstract logical constants" as specific relative IRIs
- A specific case here is the union of all RDF terms (IRIs, bnodes, literals), which would be the most general primitive type and our current default when no type is given

The basic types are subsumed by the RDF types. Many XSD types could be practically realised as restrictions of general types.

Structured types:

Function terms: as in logic programming, but also with any other type
Other complex-value types: sets, lists, boolean functions (over a base set), etc.
Object terms: representations for JSON objects or XML trees (schemaless frames)

The final part ("object terms") is overlapping with more general notion of "frame-like atoms". As a term, these would correspond to "frame-like functions", but storage issues are very similar.

How should our datatypes by denoted?

Many RDF datatypes already have names. These names are IRIs.
Some more general "types" may correspond to RDFS/OWL classes:
- rdfs:Resource: "anything" that can be a node in RDF graphs
  - in our setting, this would presumably also cover composite values
- xsd:anyURI and owl:Thing: are classes for IRI-based values, but with subtly different semantics that may not fit our needs
- xsd:anyType is another general type, but meant as a placeholder for unknown XSD types rather than a real class
IRIs are quite long and programmers are used to shorter names
We might need additional primitive types that are not in XSD, for accurate type inference (e.g., there is no type to represent values that are in xsd:byte and in xsd:positiveInteger at the same time, but such data might emerge when joining values)

Specific basic questions to answer early on

Should we use IRIs, short strings, or both (e.g., as IRIs + short aliases) to denote logical types?
Should we use RDF-like names for number types ("float", "double", "long", "int", ...) or more technical names ("float32", "float64", "int64", "int32", ...)?
What is our name for the "any" type?
How are blank nodes treated? (we need them to read RDF data, but they must be "document local" in their meaning, which requires renaming in the right places; this mechanism might violate the idea of a fixed "lexical-value mapping" and canonincal "normal form" for datatypes, but maybe this is ok)

mkroetzsch · 2023-04-20T07:22:11Z

mkroetzsch
Apr 20, 2023
Maintainer Author

Further reading: The documentation of supported SPARQL types in Oxigraph might be instructive to get an idea of how the RDF-side of types roughly looks, and how they can be implemented in a DBMS context.

0 replies

mkroetzsch · 2023-04-20T15:40:29Z

mkroetzsch
Apr 20, 2023
Maintainer Author

After the above technical musings, some notes on usability:

Although different, it would help users if our logical types and the native RDF types would overlap in large parts. So we should probably allow a type like xsd:long that will behave as expected (this could be a "derived type" for us that is merely restricting some other number type, or an alias to some other type)
Aliases can be nice but reduce readability. Most programming languages have exactly one name per type.
User-defined aliases can be useful to declare the "sort" of a parameter to be different from another, even if the value space is the same. However, for this to work like a constraint, one would need to have the type inference behave accordingly (not based on value spaces but based on name? -- this is another discussion).
Broader types are easier to use. For example, xsd:integer is convenient in that it does not require users to decide how large your numbers might become. However, some clever mechanisms would be needed to maintain performance in the face of such uncertainty. Can we have an "adaptive" integer type that will use small memory as long as values are small? This should probably be a signed type, since the loss of half the positive value space is not a big deal for "data" (not talking about dictionary values here) whereas adaptive "signing" of unsigned integers is work.
"Adaptive" internal storage would also be useful for other types. E.g., one can store many real-world xsd:decimal values as fixed point representations using some int type, but some real-world values can be beyond this (example case).
- (Side remark: There are also "adaptive" schemes for IRI storage, e.g., to have a special dictionary scheme for IRIs that are based on numbers; a typical case is Wikidata.)

Having good adaptivity would suggest a system with fewer primitive types, where specific types use the same internal handling with merely some extra constraints on the values. However:

Floating point encodings of different lengths do have incompatible value spaces and different lexical-value mappings, so that an adaptive processing is not possible.

So one possible initial approach might be:

Support datatypes for "any", "double", "integer", and "string" (proper names still to be determined, but aliases that are local names of RDF/XSD types should also use their semantics)
For each, support two names: an IRI and a shortcut alias
Start with i64 for integer and generalize this to an adaptive integer type

Still to be clarified:

Which IRI is suitable for "any" (in our sense)? Especially: how do complex values relate to RDF concepts?
Are there other IRI sources for datatypes that are not in RDF/XML (e.g., i128 or f16) or do we need own IRIs for that?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datatype system design choices #185

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Datatype system design choices #185

mkroetzsch Apr 19, 2023 Maintainer

What's in a datatype?

What kind of datatypes are there?

Which datatypes do we need?

How should our datatypes by denoted?

Replies: 2 comments

mkroetzsch Apr 20, 2023 Maintainer Author

mkroetzsch Apr 20, 2023 Maintainer Author

mkroetzsch
Apr 19, 2023
Maintainer

mkroetzsch
Apr 20, 2023
Maintainer Author

mkroetzsch
Apr 20, 2023
Maintainer Author