Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsimmon.index yields incorrect position for tokens right before a newline ('\n') #331

Open
hillin opened this issue Nov 30, 2021 · 10 comments

Comments

@hillin
Copy link

hillin commented Nov 30, 2021

const parser = P.seqMap(P.index, P.digits, P.index, function (start, value, end) {
  console.log(start, value, end);
  return value;
});

parser.parse('1234\n');
parser.parse('1234');

outputs:

{offset: 0, line: 1, column: 1} '1234' {offset: 4, line: 2, column: 0}
{offset: 0, line: 1, column: 1} '1234' {offset: 4, line: 1, column: 5}

I suppose these two outputs should be identical.

@wavebeem
Copy link
Collaborator

Hi @hillin,

This is actually worked as intended. Since Parsimmon has already finished parsing 1234, the "cursor position" is on the \n

1 2 3 4 \n
        ^^

After successfully parsing a string, Parsimmon advances to the next character. I understand why this is confusing, but this is both internally easier and what a lot of APIs expect for source code ranges.

The second parse is actually past the end of the document for this reason as well.

@hillin
Copy link
Author

hillin commented Dec 1, 2021

Thanks @wavebeem .
Let's extend the example a little bit:

const parser = P.seqMap(P.index, P.digits, P.index, P.newline, P.index, function (start, value, middle, newline, end) {
  console.log(start, value, middle, newline, end);
  return value;
});

parser.parse('1234\n');

The output is:

{offset: 0, line: 1, column: 1} '1234' {offset: 4, line: 2, column: 0} '\n' {offset: 5, line: 2, column: 1}

This shows that Parsimmon treats the newline character \n as the beginning of a new line, rather than the end of the previous line, which is the counterintuitive part to me. Technically \n is the control character that instructs the device (typewriter, or a text program) to start a new line, the new line must happen after it, so \n has to be with the previous line. That's why it's also called the end of line (EOL).

In reality, we are using Parsimmon to create a parser for a DSL, which then will be used in the monaco editor. Parsimmon's interpretation of newline has created a little difficulty to us when working with monaco's text range API (e.g. https://microsoft.github.io/monaco-editor/api/interfaces/monaco.editor.IMarkerData.html), which does not work as expected if the end of a range is, as P.index says, at column 0 of the next line, rather than at the last column (+1) of the current line. This makes P.index less ideal for marking text ranges, but we don't have a better alternative. All we can do is to fix those indices manually, which has to be done in another pass after parsing because it's hard to know the length of the previous line while parsing.

@wavebeem
Copy link
Collaborator

wavebeem commented Dec 1, 2021

I can see the argument both ways and I'm pretty sure I've used software that expects the end of the token to exclusive rather than inclusive.

var index = Parsimmon(function(input, i) {
  return makeSuccess(i, makeLineColumnIndex(input, i));
});
///////////////////////////////////////////////////////////////////////////
var lastIndex = Parsimmon(function(input, i) {
  return makeSuccess(i, makeLineColumnIndex(input, i - 1));
});
///////////////////////////////////////////////////////////////////////////

At a glance this seems to do what you want. You'll have to add it directly into parsimmon since makeLineColumnIndex is not exposed. With this approach, you can use index at the start of a token and lastIndex after the token to get the full inclusive range, instead of exclusive.

Due to the ridiculous complexity of this function, it's not something I'm interested in exposing at this time, but I could maybe see the case for a P.lastIndex. There are entirely too many nearly useless functions in Parsimmon, but this one would be extremely tricky to do efficiently outside of core. So I would consider the feature if you want to make a pull request with tests and documentation for it.

P.node and P.mark would still maintain their existing behavior for backwards compatibility though, of course.

@wavebeem
Copy link
Collaborator

wavebeem commented Dec 1, 2021

Strictly speaking my suggestion still won't use your preferred logic for newline line/column numbering, but it will skip the issue as long as you don't actually need to include the newline in your range.

Otherwise, if you really really want the P.index numbering to change and not just add a new mode, that's not something I would entertain at this time since it's a breaking change.

@hillin
Copy link
Author

hillin commented Dec 1, 2021

Well I think here is the misunderstanding: I'm totally OK to use an exclusive end index for a token, and I consent it's the right way to do this. My point is, the end index should not span into the next line if the token does not.

So for the example in the original post, for both input I'd expect the end index of 1234 to be at {line: 1, column: 5} (which is exclusive), rather than {line: 2, column: 0}, because again, the token 1234 did not have any part of it laid at line 2.

@hillin
Copy link
Author

hillin commented Dec 1, 2021

Here is another example: for the input 1234\n5678, I'd expect it to be tokenized as:

Token Expected Range Current Range
1234 L1C1 - L1C5 L1C1 - L2C0
\n L1C5 - (L2C0 or L1C6) L2C0 - L2C1
5678 L2C1 - L2C5 L2C1 - L2C5
  • The current end indexing of 1234 is not ideal, because the token never touches line 2, yet we still say it ends at the beginning of line 2, which may be true but not really helpful. I'd prefer it ends at the end of line 1.
  • The \n is also less than ideal, because although I can't find a specification, intuitively I feel it should be the last character of line 1, rather than the first (actually, 0th as all the indices are 1-based) character of line 2. I'll leave this one debatable.
  • 5678 is indexed as expected, but only if it's not followed by a \n.

@wavebeem
Copy link
Collaborator

wavebeem commented Dec 4, 2021

the end index should not span into the next line if the token does not

This still seems to me like the ideal solution is to use inclusive ranges.

I feel it should be the last character of line 1

I see your point more and more. In UNIX, a line is terminated with a newline, not separated by it. it definitely feels like it's part of the line, like you said. This change won't happen within Parsimmon v1 since it would be a breaking change, but I'll definitely think about this going forward.

I'm going to reference this issue in #230 because it's important to think about for v2.

I may end up implementing a change for this in my sister library wavebeem/bread-n-butter#34 first to see how I feel about it. But I don't plan on working on it any time soon.

@hillin
Copy link
Author

hillin commented Dec 5, 2021

This still seems to me like the ideal solution is to use inclusive ranges

To me exclusive ending index is still the legitimate way to go. We just need some kind of virtual index, which does not actually map to a character in the source. This is the standard design of almost all range constructs, the Range in javascript selection API, System.Range in .net and range() in python to name a few.

One thought is, we don't have to change the behavior of Parsimmon.index, which semantically means the current text pointer in the scanner and should reflect how Parsimmon interprets \n (which, as you said, is breaking to change thus not desirable in v1); but we can have another function to correct mark the range of a parsed result. E.g.

function mark<T>(parser: Parser<T>): Parser<{value: T; begin: Index; end: Index}>; 
// in which the end index should be the ideal result mentioned above

Actually this could be very useful (in my use cases) because I find myself always using the Parsimmon.index in pair to mark ranges.

@wavebeem
Copy link
Collaborator

wavebeem commented Dec 5, 2021

The mark method already does this. But .mark doesn't take any options right now. You could add an options object to that method that implements the alternative line number behavior you suggested.

I would certainly review the PR if you want to work on it.

https://github.com/jneen/parsimmon/blob/master/API.md#parsermark

@wavebeem wavebeem reopened this Dec 5, 2021
@hillin
Copy link
Author

hillin commented Dec 5, 2021

The mark method already does this

Oops, I've missed that one.

I would certainly review the PR if you want to work on it.

I'll see what I can do with it.

BTW for now we've switched to line-based parsing, as a workaround of this issue; as well as better supporting for partially tokenization (as expected by the monaco editor).

Daru13 added a commit to exsitu-projects/ilatex that referenced this issue Apr 14, 2022
In a recent commit, various packages were updated to more recent versions. Parsimmon, in particular, was bumped from 1.16.0 to 1.18.1.

However, starting from version 1.17, the semantic of getting the position of the parser's head after reading a newline character (\n) seems to have changed, as suggested by jneen/parsimmon#331 (instead of returning a position at the end of line N where \n was read, it returns a position at the beginning of line N+1). This was a breaking change, given how i-LaTeX's parser is implemented. For this reason, the package was reverted to version 1.16.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants