Percent-encoding in fragment identifier syntax #443

Treora · 2020-04-02T20:36:00Z

In the note about selectors and states, section 5:

The values SHOULD be percent encoded rfc3986; the encoding is a MUST for characters that may make the fragment ambiguous, namely:
character   code
space       %20
=           %3D
,           %2C
#           %23

The referenced RFC3986 defines the following grammar for a fragment identifier:

fragment    = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded   = "%" HEXDIG HEXDIG
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

In some of the note’s examples, some characters are not percent-encoded that are not actually valid in a fragment identifier.

Square brackets [ ] are are found in e.g. example 18:

http://example.org/page1.html
    #selector(type=XPathSelector,value=/html/body/p[2]/table/tr[2]/td[3]/span)

Angular brackets < > (which delimit URIs, so cannot be used inside them) are found in e.g. example 17:

http://example.org/page1.html
    #selector(type=CssSelector,value=%23elemid%20>%20.elemclass%20+%20p)

May this be worth a thorough review?

Parentheses

Moreover (feel free to factor out into a separate issue): Regarding the list of characters that must be percent-encoded, I wonder if it has been considered to include ( and ) here. Their presence may not strictly make the output ambiguous, but does make it more complex to parse.

For example, I might quote (but not crazy) in the text this is an artificial (but not crazy) example with a RangeSelector: (line breaks inserted for readability)

#selector(
  type=RangeSelector,
  startSelector=selector(type=TextQuoteSelector,exact=(but),
  endSelector=selector(type=TextQuoteSelector,exact=crazy))
)

Note that the total of three closing parentheses after crazy. I think one cannot decide how many of these are part of the cited string (one), and how many are part of the selector(…) syntax (two), without the parser either backtracking or keeping track of its recursion depth.

The proof of concept converter tool is based on PEG.js, which does not support backtracking, so if I am not mistaken cannot parse this. In fact, that tool does not allow parentheses in the values at all — see the last line of the source, where it defines validchar as any of a-zA-Z0-9<>/[]:%+@.-!$&;*_ (is this list based on a particular spec?).

The text was updated successfully, but these errors were encountered:

Due to unresolved questions about the fragment identifier format and difficulties around parsing the fragment identifier correctly, remove the fragment-identifier package entirely for the time being. See also [w3c/web-annotation#443]. Close #66. [w3c/web-annotation#443]: w3c/web-annotation#443

tilgovi mentioned this issue Apr 3, 2020

Remove fragment-identifier package apache/incubator-annotator#71

Merged

azaroth42 added selector note ErratumRaised labels Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percent-encoding in fragment identifier syntax #443

Percent-encoding in fragment identifier syntax #443

Treora commented Apr 2, 2020

Percent-encoding in fragment identifier syntax #443

Percent-encoding in fragment identifier syntax #443

Comments

Treora commented Apr 2, 2020

Parentheses