Portable line-oriented parsing #126
ruv
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Concerning line-oriented parsing
When you implement parsing of a line oriented syntax or DSL, you sometimes need to read exactly the next line of the input source, or calculate line indentation.
A common approach to get the next line or the current line from the input source is to use
refill
andsource
.And a common expectation is that
refill
loads the next line into the input buffer, andsource
returns a single line.But it's wrong in the general case.
If the input source is a string from
evaluate
, then the input buffer may contain multiple lines, and each (maybe except the last) includes a line-terminator sequence. It means thatsource
may return multiple lines too in this case.If the input source is a block, then
refill
makes the whole next block the input buffer (see 7.6.2.2125REFILL
(BLOCK EXT)). It also means thatparse-name
can return a lexeme from the next line of the current block, and Backslash "\
" discards only a part of the input buffer that belongs to the current line (see 7.6.2.2535\
(BLOCK EXT)).Actually, blocks are rarely used nowadays. But if a program is block-compliant in this regard, it will work correctly if we load a whole file into memory and make it the input buffer, or use
evaluate
for a string of multiple lines.Some excerpts from Forth-2012 2.1 Definitions of terms:
input buffer: A region of memory containing the sequence of characters from the input source that is currently accessible to a program.
input source: The device, file, block, or other entity that supplies characters to refill the input buffer.
parse area: The portion of the input buffer that has not yet been parsed, and is thus available to the system for subsequent processing by the text interpreter and other parsing operations.
Comments in evaluated strings
It's very inconvenient that Backslash "
\
" discards all characters till the end of the input buffer (instead the end of the line) when you evaluate a string. It can be fixed as follows:Moving to the next line
A portable way to set the parse area on the next line, which works even in an evaluated string:
It should be noted that standard parsing works slightly different when the input buffer contains multiple lines (for example, during evaluation of a multi-line string).
When the Forth text interpreter has extracted the last lexeme in the line, which is not followed by blanks before the end of the line:
It's because the Forth text interpreter discards one delimiter that follows the extracted lexeme (see 3.4.1 Parsing: "the number in
>IN
is changed to index immediately past that delimiter, thus removing the parsed characters and the delimiter from the parse area").So, if some word itself should read the next lexemes, it's better to not refill the input buffer unconditionally, but only if
parse-name
returns an empty string.Indentations
If the input buffer may contain multiple lines, we should take it into account when calculating indentation in the current line:
Applications
One example that is sensitive to this topic is a module system, which allows to define multiple modules in a one file. So, the body of a module should be saved as a string to be later translated (via
evaluate
) in different contexts where it's required. Hence, a module body should be indifferent whether the input buffer contains a single line or multiple lines.Another example is a more efficient including of files by load a whole file into memory (or create a memory mapping of the file) and then translate it by
evaluate
. In this approach we don't need to scan the file's content twice — to break it into lines, and to extract lexemes. Now we scan it only once — some parts are scanned for lexemes, other parts are scanned for a line-terminator, but not twice for the same part.Beta Was this translation helpful? Give feedback.
All reactions