Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayOfStructElem can't contain null #90

Open
bdoubrov opened this issue Jun 30, 2023 · 13 comments
Open

ArrayOfStructElem can't contain null #90

bdoubrov opened this issue Jun 30, 2023 · 13 comments

Comments

@bdoubrov
Copy link
Collaborator

Current definition of ArrayofStructElem specifies element types as dictionary, while it can also be null.

See veraPDF/veraPDF-library#1355

@petervwyatt
Copy link
Member

This raises a larger question about the handling of null especially in arrays. In some cases an array element of null might be acceptable, in others it isn't. The current Arlington data model only explicitly defines a Type of null if ISO 32000-2 mentions it.

Dictionary handling is covered by 7.3.7 "A dictionary entry whose value is null (see 7.3.9, "Null object") shall be treated the same as if the entry does not exist." so dictionaries will never have a null unless ISO 32000-2 explicitly mentions it or there is a glitch in the matrix (e.g. Table 207 for Mac and Unix entries).

@petervwyatt
Copy link
Member

In the case of ArrayOfStructElem, this applies to StructTreeRoot K and ParentTree (number tree) and StructElem K and Ref entries. Consideration for null is not mentioned anywhere so we need to clarify ISO 32000-2 before any change is made to the Arlington data model. Also, need to be very careful about generalizing for number-tree or name-trees...

This is what existing Errata #157 on PDF 2.0 seeks to clarify.

@bdoubrov
Copy link
Collaborator Author

bdoubrov commented Jul 3, 2023

I'd still try to go for an argument for allowing null in name-tree and number-tree based on the following text from 7.9.6:

A name tree serves a similar purpose to a dictionary — associating keys and values
[...]
The values associated with the keys may be objects of any type. Stream objects shall be specified by indirect object references (7.3.8, "Stream objects"). The dictionary, array, and string objects should be specified by indirect object references, and other PDF objects (null objects, numbers, booleans, and names) should be specified as direct objects.

The second part of this text explicitly allows null objects as values. And the first part implies that null as a value in the name-tree shall be treated in the same way as a null value in the dictionary.

Same applies to number-tree. And the ParentTree in the StructureTreeRoot is a perfect example of an important use case, where null values are almost unavoidable if some structure elements are deleted from the document.

@petervwyatt
Copy link
Member

One of the issues in veraPDF/veraPDF-library#1355 is that the key in the name-tree is null and name tree keys are defined as "where each keyi shall be a string" and "The keys shall be sorted in lexical order". number-tree use slightly different wording for the key type ("each keyi is an integer" - "is" vs "shall") but the same wording for the sorting requirement ("The keys shall be sorted in numerical order").

So name-tree/number-tree keys cannot be null (fails "shall be a string"/"is an integer") and no allowance is made in either the lexical or numeric sorting algorithms for null keys - for example, whether they need to come first, last, or are skipped over entirely. And what would it mean if you had a null key but a non-null value??? Is that meaningful? (I note in veraPDF/veraPDF-library#1355 that both the key and value are both null).

ISO 32000-2 definitely needs improvement in this area - and worthy of a separate Errata.

@u-fischer
Copy link

@petervwyatt The keys in my example in the verapdf issue are not null. The key is the zero at the start, the value of this key is an array and inside this array are then the null values.

@petervwyatt
Copy link
Member

Oh! Sorry - I didn't check the file and misunderstood... I just saw a long array of null!

Certainly, name-/number-trees can have any object (including null) as a value as @bdoubrov noted above, but precisely which objects are valid for each name-/number-tree is dependent on what the specific requirements defined for the key that is the name-/number-tree.

In the case of logical structure, there are plenty of "shall" statements saying what the value needs to be but null is not mentioned anywhere that I can see. And clause 7.3.9 starts with a very clear "The null object has a type and value that are unequal to those of any other object." so I think a heavy dose of sprinkling "... or null object" in Tables 354 and 355 is required.

@u-fischer
Copy link

so I think a heavy dose of sprinkling "... or null object" in Tables 354 and 355 is required.

I don't think so. This is about the parent tree, and the relevant section is 14.7.5.4, "Finding structure elements from content items" which says

  • For a content stream containing marked-content sequences that are content items, the value shall be an array of indirect references to the sequences’ parent structure elements. The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array.

NOTE Because marked-content identifiers serve as indices into an array in the structural parent tree, their assigned values need to be as small as possible to conserve space in the array

I pondered a lot about this when implementing the tagging. The question is if every value in the array must be an indirect references to a structure element, which would imply that the MCID on a page must be numbered consecutively starting from 0. Or if the as small as possible in the note means that gaps in the numbering are allowed and if yes how to fill up the gaps (the null value is then a quite logical candidate).

In LaTeX I implemented a strict numbering, but as shown by the axes4 example this is not the case in other implementations. A strict numbering would probably be a large problem for applications which edit a PDF as they would have to renumber the marked content sequences when content is deleted.

@petervwyatt
Copy link
Member

petervwyatt commented Jul 4, 2023

Exactly. The issue is with the statement "... the value shall be an array of indirect references to the sequences’ parent structure elements." which does not permit null as that is clearly not an IR to a parent struct elem.

It is good that it is raised and we can seek clarification - such as simply appending "or null" to that sentence.

Do you happen to know if any popular implementations can be made to create a similar scenario, or do they renumber MCIDs?

@u-fischer
Copy link

Do you happen to know if any popular implementations can be made to create a similar scenario, or do they renumber MCIDs?

No sorry, I don't know much about other implementations, and I came only across this problem because I ran the checker on the new version of the best practice guide.

@bdoubrov
Copy link
Collaborator Author

bdoubrov commented Jul 5, 2023

Here is the result of merging two P into one in a popular implementation and doing incremental save. I see same null elements in the array inside ParentTree.

test-merge-P.pdf

@petervwyatt
Copy link
Member

@bdoubrov
Copy link
Collaborator Author

@petervwyatt generally this is fine. I would only suggest removing key argument to have fn:AllowNull() which would only be applied to the current key.

@petervwyatt
Copy link
Member

The reason I'm drifting to using more explicit keys rather than implicit is that it:
a) makes finding specific failures in the model itself easier since the predicates are likely NOT mostly the same everywhere (as mention of the key can help pinpoint);
b) when read aloud it reads more like the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants