-
I'm parsing Pro*C source code; it's ANSI C with embedded blocks of PL/SQL that start with
My custom grammar re-uses rules from
My tests show Just looking for some pointers to try out other options, my limited knowledge of ANTLR prevents me from seeing what those options are. -R |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 12 replies
-
So, if you want to only test the exec sql stuff, I would make your lexer mode start out with just something like this: EXECSQL: 'EXEC' ' '* 'SQL' ; // Push to SQL mode here - you have consumed EXEC SQL
IGNORE: . -> skip() ;
// LEXER rules for SQL Mode here
...
SEMI: ';' ; // pop mode here (it's been a while but I think ';' is always the end of an EXEC SQL? Also, though ANTLR4 will accept just about anything as a grammar, it is good practice to combine common prefixes anyway as when there are a ton of these and the grammar gets complex you will not be able to use SLL mode parsing and performance will be tragic. exec_sql: EXEC SQL statements ;
statements: data_mani... etc Or you're company could just contract with me and save yourselves time and effort ;) |
Beta Was this translation helpful? Give feedback.
-
Antlr doesn't have a good way to compose two grammars because the lexer grammars would likely collide, not only in symbol names, but token values as well, and likely have unexpected interactions. Jim's solution is probably easiest. If you then require a parse tree of the entire input down to C and PlSql, you will then need to do tree and token stream surgery. An alternative would be to wrap the lexer grammars in modes, starting out with "C"-mode first, then switch to PlSql mode when "EXEC" is found. Then once finished with the SQL statement, popMode() would go back to "C"-mode. |
Beta Was this translation helpful? Give feedback.
-
I’ve being telling people for 17 years not to put literals in the parser
rules. Which C grammar are you using? - I’ll take a look. I think Terence
wrote one once, and I wrote one commercially on a contract, but that was
ANTLR3 and not open source.
But, yes, free the parser grammar from literals.
Also, your EXEC and SQL literals may look separate but they are just
EXECSQL to the lexer. You probably want something like a whitespace
fragment between them.
.*? Will eat the entire file. You need to say “anything that isn’t a semi
colon followed by a semi-colon.
This isn’t an easy thing to do. You will likely make many mistakes before
gettin there. Are you sure that this the correct path?
…On Thu, Apr 27, 2023 at 01:38 Raffi Basmajian ***@***.***> wrote:
I went with your suggestion using the C grammar as the starting point, but
first had to separate into CParser and CLexer grammar components since
modes are only permitted in lexers. Mode PROC gets activated when tokens
EXEC and SQL are found together. Once in PROC mode, I use .*? to grab the
entire command as I'm not interested in parsing the contents right now. I
switch back to C lexer mode when termination delimiters ; or, END-EXEC
then ;, are found.
lexer grammar CLexer;
//C lexer rules omitted for brevity
ExecSql
: 'EXEC' 'SQL' -> pushMode(PROC) ;
// ----------- PROC Mode ----------------mode PROC;
ProcStatement
: ProcCommand -> popMode
;
ProcCommand
: ExecuteEndExec
| ( 'SELECT' | 'INSERT' | 'UPDATE' ) .*? ';'
;
ExecuteEndExec
: 'EXECUTE' .*? 'END-EXEC' ';'
;
Unfortunately I was not able to test this. ANTLR complained - 600+ times,
lol, basically the same error for each C literal, a common problem I've
seen before and frankly should have expected the side effect after
splitting the C grammar into separate parser/lexer files
cannot create implicit token for string literal in non-combined grammar:
__asm
My ultimate goal is to perform static analysis of pro-c code using
standard gcc utilities like cflow, cxref,etc., but I can't do that until I
parse pro-c code, identify EXEC SQL blocks individually, then rewrite the
original source while excluding - or at least commenting out, all EXEC SQL
blocks, resulting in standard C source code.
- Is it worth the effort converting all C literals into token rules to
address this error? I remember reading lexer *literals* take
precedence over lexer rules, regardless of appearance order. I suspect
there might be unknown side effects here
- Is my implementation for mode PROC at least on the right path?
—
Reply to this email directly, view it on GitHub
<#4212 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMH6NJD5STBHTB6HZTDXDFMQ7ANCNFSM6AAAAAAWMSQRWI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
You want to be explicit. The lexer is just a DFA, but you are correct - it
is just notation really.
The lexer does not need a parser but the parser needs something that
provides Token, the TokenStream asks the lexer for all the tokens and then
provides them when asked by the parser. I often use a lexer + tokenstream
on there own for things like pre-processors. But you cannot direct the
lexer to change modes or look for a particular token based on context.
Modes are the right thing here unless the tokens are common between the two
languages buthen you have to have to two grammars in one specification,
which isn't going to work in this case I think.
You are on the right path.
…On Thu, Apr 27, 2023 at 11:54 AM Raffi Basmajian ***@***.***> wrote:
Hi Jim,
I'm using the public C grammar:
https://github.com/antlr/grammars-v4/blob/master/c/C.g4
I'm definitely using WS fragment - I didn't post it for brevity.
.*? Will eat the entire file. You need to say “anything that isn’t a semi
colon followed by a semi-colon.
My understanding is .*? is non-greedy and will match the fewest number of
characters until the surrounding lexical rules match, not unlike how
comments work, right? The following works without exclusions for */, but
maybe I'm missing something.
COMMENT : '/*' .*? '*/' -> skip ;
And I just realized something. Parsers need lexers, but lexers could be
standalone, right? I can use modes with just the C lexer plus my
customizations - to hell with the parser grammar. :-)
—
Reply to this email directly, view it on GitHub
<#4212 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMFOFZT6D3QOAEFOTFLXDHUW3ANCNFSM6AAAAAAWMSQRWI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Folding string literals and splitting combined grammars is easily done in seconds with Trash trfoldlit and trsplit iff there are lexer rules that are already declare the string literal. The main problem is that many combined grammars don't declare lexer rules for the literals, so the Antlr4 tool will fail with the split grammar. The key is to somehow declare lexer rules for the string literals that are meaningfully named, e.g.
Fixing these in a general way is not easy because one needs to create meaningfully-named lexer symbols for these string literals, e.g., |
Beta Was this translation helpful? Give feedback.
-
I got Trash setup on Windows10/.Net7/gitbash, and reproduced the same output and unique values with
If I'm understanding your advice accurately and despite never using Trash - my goal being to split
How does that look? |
Beta Was this translation helpful? Give feedback.
-
Looks ok. |
Beta Was this translation helpful? Give feedback.
-
Can Trash extract rules from a parser grammar if given a starting rule? For example, from the PlSql parser grammar I want all parser rules essential to I browsed through Trash - perhaps I missed it, does it offer this type of functionality? |
Beta Was this translation helpful? Give feedback.
Essentially, the C.g4 grammar itself has multiple occurrences of the string literal '__extension__'. That's why you see multiple strings outputted by the command.
Trash is a command-line toolkit that parses input, passes parse trees around, and analyzes parse trees. An Antlr4 grammar is parsed just as any input, but with the antlr/antlr4/ grammar. The output of commands in the toolkit are parse …