Portable line-oriented parsing #126

ruv · 2022-10-16T17:50:59Z

ruv
Oct 16, 2022
Maintainer

Concerning line-oriented parsing

When you implement parsing of a line oriented syntax or DSL, you sometimes need to read exactly the next line of the input source, or calculate line indentation.

A common approach to get the next line or the current line from the input source is to use refill and source.

And a common expectation is that refill loads the next line into the input buffer, and source returns a single line.
But it's wrong in the general case.

If the input source is a string from evaluate, then the input buffer may contain multiple lines, and each (maybe except the last) includes a line-terminator sequence. It means that source may return multiple lines too in this case.

If the input source is a block, then refill makes the whole next block the input buffer (see 7.6.2.2125 REFILL (BLOCK EXT)). It also means that parse-name can return a lexeme from the next line of the current block, and Backslash "\" discards only a part of the input buffer that belongs to the current line (see 7.6.2.2535 \ (BLOCK EXT)).

Actually, blocks are rarely used nowadays. But if a program is block-compliant in this regard, it will work correctly if we load a whole file into memory and make it the input buffer, or use evaluate for a string of multiple lines.

Some excerpts from Forth-2012 2.1 Definitions of terms:
input buffer: A region of memory containing the sequence of characters from the input source that is currently accessible to a program.
input source: The device, file, block, or other entity that supplies characters to refill the input buffer.
parse area: The portion of the input buffer that has not yet been parsed, and is thus available to the system for subsequent processing by the text interpreter and other parsing operations.

Comments in evaluated strings

It's very inconvenient that Backslash "\" discards all characters till the end of the input buffer (instead the end of the line) when you evaluate a string. It can be fixed as follows:

: is-input-string ( -- flag )
  \ Return a flag: is the input source a string (being evaluated).
  [defined] blk [if] blk @ 0<> if false exit then [then]
  source-id -1 =
;
: source-following ( -- sd )
  \ Return the parse area (a string).
  \ NB: the returned string may contain a line-terminator sequence in any position.
  source >in @ /string
;
: skip-source-line ( -- )
  \ Discard a part of the parse area that belongs to the current line.
  is-input-string 0= if ['] \ execute exit then
  source-following  over >r  s\" \n"  dup >r  search  if drop r@ then  +  rdrop
  r> -  >in +!
;
: \ ( -- )
  \ This Backslash works as expected in evaluated strings too
  skip-source-line
; immediate

Moving to the next line

A portable way to set the parse area on the next line, which works even in an evaluated string:

: flip-source-line ( -- flag )
  \ Make the parse area starts with the next line of the input source.
  skip-source-line source-following nip if true else refill then
;

It should be noted that standard parsing works slightly different when the input buffer contains multiple lines (for example, during evaluation of a multi-line string).

When the Forth text interpreter has extracted the last lexeme in the line, which is not followed by blanks before the end of the line:

if the input buffer contains a single line, then the parse area is empty, and you need to refill it (flip the line) to parse the next line;
if the input buffer contains a next line, then the parse area already starts with the next line (or a part of a line-terminator sequence), and you don't need to flip the line.

It's because the Forth text interpreter discards one delimiter that follows the extracted lexeme (see 3.4.1 Parsing: "the number in >IN is changed to index immediately past that delimiter, thus removing the parsed characters and the delimiter from the parse area").

So, if some word itself should read the next lexemes, it's better to not refill the input buffer unconditionally, but only if parse-name returns an empty string.

Indentations

If the input buffer may contain multiple lines, we should take it into account when calculating indentation in the current line:

: source-following-line ( -- c-addr u )
  \ Return the part of the parse area that belong to the current line only.
  \ NB: the returned string may include a line-terminator sequence at the tail only.
  >in @ >r
  source-following drop
  skip-source-line
  source-following drop over -
  r> >in !
;
: source-following-indent ( -- u )
  \ Return the number of blank characters at the current position
  \ in the input buffer (within the current line only).
  0 source-following-line over + swap ?do i c@ $20 u> if unloop exit then 1+ loop
;

Applications

One example that is sensitive to this topic is a module system, which allows to define multiple modules in a one file. So, the body of a module should be saved as a string to be later translated (via evaluate) in different contexts where it's required. Hence, a module body should be indifferent whether the input buffer contains a single line or multiple lines.

Another example is a more efficient including of files by load a whole file into memory (or create a memory mapping of the file) and then translate it by evaluate. In this approach we don't need to scan the file's content twice — to break it into lines, and to extract lexemes. Now we scan it only once — some parts are scanned for lexemes, other parts are scanned for a line-terminator, but not twice for the same part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portable line-oriented parsing #126

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Portable line-oriented parsing #126

ruv Oct 16, 2022 Maintainer

Concerning line-oriented parsing

Comments in evaluated strings

Moving to the next line

Indentations

Applications

Replies: 0 comments

ruv
Oct 16, 2022
Maintainer