Discretionary parsing of source code - how to ignore parts I don't need? #4212

raffian · 2023-03-30T00:36:24Z

raffian
Mar 30, 2023

I'm parsing Pro*C source code; it's ANSI C with embedded blocks of PL/SQL that start with EXEC SQL and usually end with ; ...

#include <stdio.h>
int main() {
   varchar firstname[15];
   
   EXEC SQL
      SELECT fname INTO :firstname
         FROM EMPLOYEES
         WHERE id = 999;

   EXEC SQL COMMIT WORK;
   printf("Hello, %s", firstname);
   return 0;
}

My custom grammar re-uses rules from PlSqlParser.g4 as much as possible, and for now I'm customizing that grammar directly for testing concepts but later I intend to extend it. Here's just a snippet: my custom rule exec_sql hooks into root rule sql_script while leveraging existing plsql grammar rules where applicable:

parser grammar PlSqlParser;
 
sql_script
        : ((sql_plus_command | unit_statement | exec_sql) SEMICOLON?)* EOF
        ;

exec_sql 
	: EXEC SQL data_manipulation_language_statements   #execSqlSelectUpdateInsertDelete
	| EXEC SQL commit_statement                        #execSqlCommit
        | EXEC SQL savepoint_statement                     #execSqlSavepoint
	| EXEC SQL rollback_statement                      #execSqlRollback
	;

My tests show exec_sql correctly parses standalone EXEC SQL blocks (no surrounding C code) - good start.
Now I want to use this grammar for parsing Pro*C files to extract EXEC SQL blocks while ignoring the surrounding C code entirely, but I'm not sure what strategy to employ. I read up on Lexer Rules and considered using lexer modes with push()/pop() as shown for XML, but the example is trivial and I'm having difficulty seeing how it can be applied to my use case. Another alternative I considered was sending all input other than EXEC SQL to the hidden channel, but that attempt was not successful.

Just looking for some pointers to try out other options, my limited knowledge of ANTLR prevents me from seeing what those options are.

-R

Answered by kaby76

Aug 18, 2023

Essentially, the C.g4 grammar itself has multiple occurrences of the string literal '__extension__'. That's why you see multiple strings outputted by the command.

$ grep __extension__ C.g4
    |   '__extension__'? '(' compoundStatement ')' // Blocks (GCC extension)
    |   '__extension__'? '(' typeName ')' '{' initializerList ','? '}'
    :   '__extension__'? '(' typeName ')' castExpression
    |   '__extension__' '(' ('__m128' | '__m128d' | '__m128i') ')'

Trash is a command-line toolkit that parses input, passes parse trees around, and analyzes parse trees. An Antlr4 grammar is parsed just as any input, but with the antlr/antlr4/ grammar. The output of commands in the toolkit are parse …

View full answer

jimidle · 2023-03-30T06:30:15Z

jimidle
Mar 30, 2023

So, if you want to only test the exec sql stuff, I would make your lexer mode start out with just something like this:

EXECSQL: 'EXEC' ' '* 'SQL'  ;  // Push to SQL mode here - you have consumed EXEC SQL
IGNORE: . -> skip() ;

// LEXER rules for SQL Mode here
...
SEMI: ';'  ; // pop mode here (it's been a while but I think ';' is always the end of an EXEC SQL?

Also, though ANTLR4 will accept just about anything as a grammar, it is good practice to combine common prefixes anyway as when there are a ton of these and the grammar gets complex you will not be able to use SLL mode parsing and performance will be tragic.

exec_sql: EXEC SQL statements ;

statements: data_mani... etc

Or you're company could just contract with me and save yourselves time and effort ;)

1 reply

raffian Mar 31, 2023
Author

Hi Jim,

(it's been a while but I think ';' is always the end of an EXEC SQL?

Mostly, except for EXEC SQL EXECUTE which has arbitrary semis but always ends with END-EXEC ...

#include <stdio.h>
int main() {
   varchar firstname[15];
   
   EXEC SQL EXECUTE
      DECLARE 
          hireDate DATE;
      BEGIN
         SELECT * from EMPLOYEES;
      END;
   END-EXEC;

   EXEC SQL COMMIT WORK;
   printf("Hello, %s", firstname);
   return 0;
}

So if I use lexer modes to parse pro-c files, I start with C grammar first, then switch modes when consuming EXEC SQL tokens like @kaby76 suggested, right? I modified your suggestion below - haven't tested yet but this approach should work assuming I want to blindly extract EXEC SQL blocks from pro*C without parsing them as plsql since I can do that later in a separate routine, then again, matching "any" text is somewhat problematic based on my research so perhaps I should parse EXEC SQL content at the same time. I'm not sure just have to experiment I suppose.

//default C mode
EXEC_SQL: 'EXEC' ' '* 'SQL'  -> pushMode(EXECSQLMODE);  
IGNORE: . -> skip() ;
// LEXER rules for C Mode here
...
...
mode EXECSQLMODE;
END_EXEC: 'END-EXEC;' -> popMode
SEMI: ';'  -> popMode

kaby76 · 2023-03-30T13:01:17Z

kaby76
Mar 30, 2023

Antlr doesn't have a good way to compose two grammars because the lexer grammars would likely collide, not only in symbol names, but token values as well, and likely have unexpected interactions. Jim's solution is probably easiest. If you then require a parse tree of the entire input down to C and PlSql, you will then need to do tree and token stream surgery. An alternative would be to wrap the lexer grammars in modes, starting out with "C"-mode first, then switch to PlSql mode when "EXEC" is found. Then once finished with the SQL statement, popMode() would go back to "C"-mode.

1 reply

raffian Apr 26, 2023
Author

I went with your suggestion using the C grammar as the starting point, but first had to separate into CParser and CLexer grammar components since modes are only permitted in lexers. Mode PROC gets activated when tokens EXEC and SQL are found together. Once in PROC mode, I use .*? to grab the entire command as I'm not interested in parsing the contents right now. I switch back to C lexer mode when termination delimiters ; or, END-EXEC then ;, are found.

lexer grammar CLexer;

//C lexer rules omitted for brevity

ExecSql
    :   'EXEC' 'SQL' -> pushMode(PROC) ; 

// ----------- PROC Mode ----------------
mode PROC;

ProcStatement
     :  ProcCommand  -> popMode
     ;

ProcCommand
     : ExecuteEndExec                           
     | ( 'SELECT' | 'INSERT' | 'UPDATE' )  .*?  ';'
     ;

ExecuteEndExec
    : 'EXECUTE'  .*?  'END-EXEC' ';'
    ;

Unfortunately I was not able to test this. ANTLR complained - 600+ times, lol, basically the same error for each C literal, a common problem I've seen before and frankly should have expected the side effect after splitting the C grammar into separate parser/lexer files

cannot create implicit token for string literal in non-combined grammar: __asm

My ultimate goal is to perform static analysis of pro-c code using standard gcc utilities like cflow, cxref,etc., but I can't do that until I parse pro-c code, identify EXEC SQL blocks individually, then rewrite the original source while excluding - or at least commenting out, all EXEC SQL blocks, resulting in standard C source code.

Is it worth the effort converting all C literals into token rules to address this error? I remember reading lexer literals take precedence over lexer rules, regardless of appearance order. I suspect there might be unknown side effects here
Is my implementation for mode PROC at least on the right path?

jimidle · 2023-04-27T03:06:51Z

jimidle
Apr 27, 2023

I’ve being telling people for 17 years not to put literals in the parser rules. Which C grammar are you using? - I’ll take a look. I think Terence wrote one once, and I wrote one commercially on a contract, but that was ANTLR3 and not open source. But, yes, free the parser grammar from literals. Also, your EXEC and SQL literals may look separate but they are just EXECSQL to the lexer. You probably want something like a whitespace fragment between them. .*? Will eat the entire file. You need to say “anything that isn’t a semi colon followed by a semi-colon. This isn’t an easy thing to do. You will likely make many mistakes before gettin there. Are you sure that this the correct path?

…

On Thu, Apr 27, 2023 at 01:38 Raffi Basmajian ***@***.***> wrote: I went with your suggestion using the C grammar as the starting point, but first had to separate into CParser and CLexer grammar components since modes are only permitted in lexers. Mode PROC gets activated when tokens EXEC and SQL are found together. Once in PROC mode, I use .*? to grab the entire command as I'm not interested in parsing the contents right now. I switch back to C lexer mode when termination delimiters ; or, END-EXEC then ;, are found. lexer grammar CLexer; //C lexer rules omitted for brevity ExecSql : 'EXEC' 'SQL' -> pushMode(PROC) ; // ----------- PROC Mode ----------------mode PROC; ProcStatement : ProcCommand -> popMode ; ProcCommand : ExecuteEndExec | ( 'SELECT' | 'INSERT' | 'UPDATE' ) .*? ';' ; ExecuteEndExec : 'EXECUTE' .*? 'END-EXEC' ';' ; Unfortunately I was not able to test this. ANTLR complained - 600+ times, lol, basically the same error for each C literal, a common problem I've seen before and frankly should have expected the side effect after splitting the C grammar into separate parser/lexer files cannot create implicit token for string literal in non-combined grammar: __asm My ultimate goal is to perform static analysis of pro-c code using standard gcc utilities like cflow, cxref,etc., but I can't do that until I parse pro-c code, identify EXEC SQL blocks individually, then rewrite the original source while excluding - or at least commenting out, all EXEC SQL blocks, resulting in standard C source code. - Is it worth the effort converting all C literals into token rules to address this error? I remember reading lexer *literals* take precedence over lexer rules, regardless of appearance order. I suspect there might be unknown side effects here - Is my implementation for mode PROC at least on the right path? — Reply to this email directly, view it on GitHub <#4212 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMH6NJD5STBHTB6HZTDXDFMQ7ANCNFSM6AAAAAAWMSQRWI> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

raffian Apr 27, 2023
Author

Hi Jim,

I'm using the public C grammar:
https://github.com/antlr/grammars-v4/blob/master/c/C.g4

I'm definitely using WS fragment - I didn't post it for brevity.

.*? Will eat the entire file. You need to say “anything that isn’t a semi
colon followed by a semi-colon.

My understanding is .*? is non-greedy and will match the fewest number of characters until the surrounding lexical rules match, not unlike rules for standard comments, right? The following works without exclusions for */, but maybe I'm missing .

COMMENT : '/*' .*? '*/' -> skip ;

Also just realized that parsers need lexers, but lexers could be standalone, right? I can use modes with just the C lexer plus my customizations - to hell with the parser grammar. :-)

jimidle · 2023-04-27T04:22:14Z

jimidle
Apr 27, 2023

You want to be explicit. The lexer is just a DFA, but you are correct - it is just notation really. The lexer does not need a parser but the parser needs something that provides Token, the TokenStream asks the lexer for all the tokens and then provides them when asked by the parser. I often use a lexer + tokenstream on there own for things like pre-processors. But you cannot direct the lexer to change modes or look for a particular token based on context. Modes are the right thing here unless the tokens are common between the two languages buthen you have to have to two grammars in one specification, which isn't going to work in this case I think. You are on the right path.

…

On Thu, Apr 27, 2023 at 11:54 AM Raffi Basmajian ***@***.***> wrote: Hi Jim, I'm using the public C grammar: https://github.com/antlr/grammars-v4/blob/master/c/C.g4 I'm definitely using WS fragment - I didn't post it for brevity. .*? Will eat the entire file. You need to say “anything that isn’t a semi colon followed by a semi-colon. My understanding is .*? is non-greedy and will match the fewest number of characters until the surrounding lexical rules match, not unlike how comments work, right? The following works without exclusions for */, but maybe I'm missing something. COMMENT : '/*' .*? '*/' -> skip ; And I just realized something. Parsers need lexers, but lexers could be standalone, right? I can use modes with just the C lexer plus my customizations - to hell with the parser grammar. :-) — Reply to this email directly, view it on GitHub <#4212 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMFOFZT6D3QOAEFOTFLXDHUW3ANCNFSM6AAAAAAWMSQRWI> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

kaby76 · 2023-04-27T07:18:42Z

kaby76
Apr 27, 2023

Folding string literals and splitting combined grammars is easily done in seconds with Trash trfoldlit and trsplit iff there are lexer rules that are already declare the string literal. The main problem is that many combined grammars don't declare lexer rules for the literals, so the Antlr4 tool will fail with the split grammar. The key is to somehow declare lexer rules for the string literals that are meaningfully named, e.g. COMMA: ',';. People often forget to declare lexer rules for all the string literals in parser rules of combined grammars, which can cause hidden errors. https://stackoverflow.com/q/76099139/4779853. It's easy to find these string literals before splitting with trxgrep. For the C.g4 grammar, we can see that there are several string literals used in parser rules that don't have a lexer rule declared:

$ trparse C.g4 | trxgrep ' for $i in (//parserRuleSpec/ruleBlock//STRING_LITERAL/text()) return concat($i,  " ", count(//lexerRuleSpec[lexerRuleBlock//STRING_LITERAL/text() = $i][last()]/TOKEN_REF/text()))' | grep 0
CSharp 0 C.g4 success 0.0680841
'__extension__' 0
'__builtin_va_arg' 0
'__builtin_offsetof' 0
'__extension__' 0
'__extension__' 0
'__m128' 0
'__m128d' 0
'__m128i' 0
'__extension__' 0
'__m128' 0
'__m128d' 0
'__m128i' 0
'__typeof__' 0
'__inline__' 0
'__stdcall' 0
'__declspec' 0
'__cdecl' 0
'__clrcall' 0
'__stdcall' 0
'__fastcall' 0
'__thiscall' 0
'__vectorcall' 0
'__asm' 0
'__attribute__' 0
'__asm' 0
'__asm__' 0
'__volatile__' 0
04/27-04:36:54 ~/l/grammars-v4/c

Fixing these in a general way is not easy because one needs to create meaningfully-named lexer symbols for these string literals, e.g., GNU_ASM: '_asm';. At one point, I had a tool that added in generated names like FUBAR_234: ','; for the string literal (which is what the Antlr4 does itself internally), but the folded grammar looked like crap.

4 replies

raffian Apr 28, 2023
Author

Oh this is interesting, can't wait to try this one out.
Thank you.

raffian Aug 18, 2023
Author

@kaby76

Why do __extension__, __stdcall,__m128d - and others, appear more than once in that list?

kaby76 Aug 18, 2023

Essentially, the C.g4 grammar itself has multiple occurrences of the string literal '__extension__'. That's why you see multiple strings outputted by the command.

$ grep __extension__ C.g4
    |   '__extension__'? '(' compoundStatement ')' // Blocks (GCC extension)
    |   '__extension__'? '(' typeName ')' '{' initializerList ','? '}'
    :   '__extension__'? '(' typeName ')' castExpression
    |   '__extension__' '(' ('__m128' | '__m128d' | '__m128i') ')'

Trash is a command-line toolkit that parses input, passes parse trees around, and analyzes parse trees. An Antlr4 grammar is parsed just as any input, but with the antlr/antlr4/ grammar. The output of commands in the toolkit are parse trees that is optimized for sharing between command-line programs. The fact that C.g4 is itself a grammar makes no difference--the parse of the file is a parse tree, which can be analyzed.

trparse C.g4 performs an Antlr4 parse of the Antlr4 grammar C.g4 and produces a parse tree representation of the grammar C.g4 itself.

Next in the pipelined is the command trxgrep ' for $i in (//parserRuleSpec/ruleBlock//STRING_LITERAL/text()) return concat($i, " ", count(//lexerRuleSpec[lexerRuleBlock//STRING_LITERAL/text() = $i][last()]/TOKEN_REF/text()))'. This command reads the parse tree, and finds nodes according to the XPath expression '//parserRuleSpec/ruleBlock//STRING_LITERAL/text()'. That expression represents the set of parse tree nodes that are STRING_LITERAL, e.g., here and here, etc., but with the parent chain 'parserRuleSpec/ruleBlock'. That is, all STRING_LITERAL on the right-hand side of a parser rule, here. The "/text()" expression computes the actual text of the STRING_LITERAL node, e.g., '__extension__'. The "for-loop" in the XPath expression now creates a string using the 'concat()', and that is outputted by trxgrep. The expression 'count(//lexerRuleSpec[lexerRuleBlock//STRING_LITERAL/text() = $i][last()]/TOKEN_REF/text())' counts the number of times the string literal is actually used in the right-hand side of a lexer rule. If there is no lexer rule with the string, then there's no lexer rule defined for the literal. That is the whole point: we want to find string literals that have no corresponding lexer rule, e.g., Foobar : '__extension__';.

Finally, grep 0 just filters that list down to output string literals without a lexer rule. I could have then added | sort -u afterwards to remove the duplicates.

The whole point of Trash is to get beyond handwriting Antlr visitors and listeners and start thinking and analyzing parse trees. We cannot keep saying to people, "write a visitor to compute _such and such". It's ridiculous. Right now the querying is restricted to XPath. But, the plan is to extend it to execute XQuery and XSLT-like scripts. That will open up complex computations.

Answer selected by raffian

raffian Aug 18, 2023
Author

Thank you for the detailed response. I'm finally getting around to trying this out, setting up locally with .net 7.0 and vscode now.
Best, r

raffian · 2023-08-19T02:49:56Z

raffian
Aug 19, 2023
Author

I got Trash setup on Windows10/.Net7/gitbash, and reproduced the same output and unique values with sort -u

$> trparse C.g4 | trxgrep ' for $i in (//parserRuleSpec/ruleBlock//STRING_LITERAL/text()) return concat($i,  " ", count(//lexerRuleSpec[lexerRuleBlock//STRING_LITERAL/text() = $i][last()]/TOKEN_REF/text()))' | grep 0 | sort -u
CSharp 0 C.g4 success 0.0500244
'__asm' 0
'__asm__' 0
'__attribute__' 0
'__builtin_offsetof' 0
'__builtin_va_arg' 0
'__cdecl' 0
'__clrcall' 0
'__declspec' 0
'__extension__' 0
'__fastcall' 0
'__inline__' 0
'__m128' 0
'__m128d' 0
'__m128i' 0
'__stdcall' 0
'__thiscall' 0
'__typeof__' 0
'__vectorcall' 0
'__volatile__' 0

If I'm understanding your advice accurately and despite never using Trash - my goal being to split C.g4 into separate parser and lexer grammars, I see next steps as follows:

Manually edit C.g4, create lexer rules for literals listed above, ex:

ASM_INLINE_1 : '__asm';
INLINE: '__inline__';
VOLATILE: '__volatile__';
...

Fold literals into lexer rules:

trparse C.g4 | trfoldlit | trsponge -c

Split lexer and parser rules into separate grammars:

trparse C.g4 | trsplit | trsponge

How does that look?

0 replies

kaby76 · 2023-08-19T10:08:51Z

kaby76
Aug 19, 2023

When you add the lexer rules for those string literals, add them before all other lexer rules. You have to do that because that is the semantics of how Antlr treats string literals in combined grammars. It creates rules for undeclared string literals as "V_1 : ...", "V_2 : ...;".
You don't need the trfoldlit step. Antlr will make the association of the string literal in a split grammar.

Looks ok.

1 reply

raffian Aug 19, 2023
Author

Would it hurt if I did step 2) anyway?

btw, the link to trsponge from this page is broken:
https://github.com/kaby76/Domemtech.Trash/tree/main/trsponge

raffian · 2023-08-19T21:36:38Z

raffian
Aug 19, 2023
Author

Can Trash extract rules from a parser grammar if given a starting rule? For example, from the PlSql parser grammar I want all parser rules essential to unit_statement but nothing else though I'm fine using PlSqlLexer as-is, no alterations. I built a utility in Java few months back to do this, used ANTLRv4.g4 grammar to parse PL/SQL grammar - learned a lot in the process but it was messy. I was able to extract a subset of parser rules but relied too much on regex causing unpredictable results when testing the sub-parser grammar rules.

I browsed through Trash - perhaps I missed it, does it offer this type of functionality?

4 replies

kaby76 Aug 19, 2023

Yes you could. But it would be somewhat awkward because it would have to be in a Bash loop in irder to do the transitive closure. Xquery could do this better but I still don't have the xquery grammar implemented.

raffian Aug 20, 2023
Author

A bash loop? In my implementation I parsed the PlSql grammar, identified the starting point rule, then traversed all of its rules - recursively, capturing all rule names along the way. The final list of rules captured during traversal should be the subset of rules required for the starting rule. Is there a better way of solving this problem?

kaby76 Aug 21, 2023

The computation of the transitive closure of a relation ("symbols referenced on RHS of a rule"(string)) contains a for-loop. You asked me, "Can Trash extract rules from a parser grammar if given a starting rule?". Trash does not have looping capabilities. XQuery does, but I haven't added it yet. So does Bash. Or any scripting language. If you wanted to implement the computation of transitive closure using Trash, the looping would have to be implemented outside Trash trxgrep. Here's how that would look.

#

echo "Usage: start-symbol grammar-file-name" >&2
echo "Finds the transitive closure of a start rule in a parser grammar." >&2
sym=$1
todo=($sym)
grammar=$2
visited=()
trparse -t ANTLRv4 $grammar > o.pt
while true
do
	if [ ${#todo[@]} -eq 0 ]; then
		break
	fi
	sym=${todo[0]}
	todo=("${todo[@]:1}")
	syms=`cat o.pt | trxgrep ' //parserRuleSpec[RULE_REF/text() = "'$sym'"]//atom/ruleref/RULE_REF' | trtext | dos2unix`
	for f in ${syms[*]}
	do
		if [[ ! $(echo ${visited[@]} | fgrep -w $f) ]]
		then
			visited[${#visited[@]}]=$f
			todo[${#todo[@]}]=$f
		fi
	done
done
for f in ${visited[*]}
do
	echo "$f"
done

raffian Aug 21, 2023
Author

Oh wow, it's been a while since I looked at discrete math structures, thanks for the links - and background theory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discretionary parsing of source code - how to ignore parts I don't need? #4212

{{title}}

Replies: 8 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Discretionary parsing of source code - how to ignore parts I don't need? #4212

Replies: 8 comments · 12 replies

raffian Mar 31, 2023 Author

raffian Apr 26, 2023 Author

raffian Apr 27, 2023 Author

raffian Apr 28, 2023 Author

raffian Aug 18, 2023 Author

raffian Aug 18, 2023 Author

raffian Aug 19, 2023 Author

raffian Aug 19, 2023 Author

raffian Aug 19, 2023 Author

raffian Aug 20, 2023 Author

raffian Aug 21, 2023 Author

Replies: 8 comments 12 replies

raffian Mar 31, 2023
Author

raffian Apr 26, 2023
Author

raffian Apr 27, 2023
Author

raffian Apr 28, 2023
Author

raffian Aug 18, 2023
Author

raffian Aug 18, 2023
Author

raffian
Aug 19, 2023
Author

raffian Aug 19, 2023
Author

raffian
Aug 19, 2023
Author

raffian Aug 20, 2023
Author

raffian Aug 21, 2023
Author