Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent-encoding in fragment identifier syntax #443

Open
Treora opened this issue Apr 2, 2020 · 0 comments
Open

Percent-encoding in fragment identifier syntax #443

Treora opened this issue Apr 2, 2020 · 0 comments

Comments

@Treora
Copy link

Treora commented Apr 2, 2020

In the note about selectors and states, section 5:

The values SHOULD be percent encoded rfc3986; the encoding is a MUST for characters that may make the fragment ambiguous, namely:

character   code
space       %20
=           %3D
,           %2C
#           %23

The referenced RFC3986 defines the following grammar for a fragment identifier:

fragment    = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded   = "%" HEXDIG HEXDIG
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

In some of the note’s examples, some characters are not percent-encoded that are not actually valid in a fragment identifier.

Square brackets [ ] are are found in e.g. example 18:

http://example.org/page1.html
    #selector(type=XPathSelector,value=/html/body/p[2]/table/tr[2]/td[3]/span)

Angular brackets < > (which delimit URIs, so cannot be used inside them) are found in e.g. example 17:

http://example.org/page1.html
    #selector(type=CssSelector,value=%23elemid%20>%20.elemclass%20+%20p)

May this be worth a thorough review?

Parentheses

Moreover (feel free to factor out into a separate issue): Regarding the list of characters that must be percent-encoded, I wonder if it has been considered to include ( and ) here. Their presence may not strictly make the output ambiguous, but does make it more complex to parse.

For example, I might quote (but not crazy) in the text this is an artificial (but not crazy) example with a RangeSelector: (line breaks inserted for readability)

#selector(
  type=RangeSelector,
  startSelector=selector(type=TextQuoteSelector,exact=(but),
  endSelector=selector(type=TextQuoteSelector,exact=crazy))
)

Note that the total of three closing parentheses after crazy. I think one cannot decide how many of these are part of the cited string (one), and how many are part of the selector(…) syntax (two), without the parser either backtracking or keeping track of its recursion depth.

The proof of concept converter tool is based on PEG.js, which does not support backtracking, so if I am not mistaken cannot parse this. In fact, that tool does not allow parentheses in the values at all — see the last line of the source, where it defines validchar as any of a-zA-Z0-9<>/[]:%+@.-!$&;*_ (is this list based on a particular spec?).

tilgovi added a commit to apache/incubator-annotator that referenced this issue Apr 3, 2020
Due to unresolved questions about the fragment identifier format and
difficulties around parsing the fragment identifier correctly, remove
the fragment-identifier package entirely for the time being.

See also [w3c/web-annotation#443].

Close #66.

[w3c/web-annotation#443]: w3c/web-annotation#443
tilgovi added a commit to apache/incubator-annotator that referenced this issue Apr 3, 2020
Due to unresolved questions about the fragment identifier format and
difficulties around parsing the fragment identifier correctly, remove
the fragment-identifier package entirely for the time being.

See also [w3c/web-annotation#443].

Close #66.

[w3c/web-annotation#443]: w3c/web-annotation#443
Treora pushed a commit to apache/incubator-annotator that referenced this issue Apr 3, 2020
Due to unresolved questions about the fragment identifier format and
difficulties around parsing the fragment identifier correctly, remove
the fragment-identifier package entirely for the time being.

See also [w3c/web-annotation#443].

Close #66.

[w3c/web-annotation#443]: w3c/web-annotation#443
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants