Skip to content
coonsta edited this page Sep 13, 2010 · 19 revisions

Guruguru is a parsing library for Emacs.

ぐるぐる (guru-guru) is a Japanese onomatopoeiac word meaning “around and around.” PEG parsers memoize the result of parsing a given expression at a given input position. This means parsers operate in linear time, even if the parser has to backtrack, because given a grammar with a fixed number of expressions N and an input of length M the parser will compute at most N·M results.

Guruguru uses a modified PEG algorithm that can parse left-recursive rules by repeatedly evaluating left-recursive expressions. This means Guruguru parsers aren’t guaranteed to operate in linear time; instead they may spend time going “around and around.”

Example

The following example demonstrates evaluating simple arithmetic expressions with Guruguru, once you have put guruguru.el somewhere on your load path:

(require 'guruguru)

(defvar expressions
  (gg-grammar
    (start x := term eof -> x)
    (term
       x := term ?+ y := fact -> (+ x y)
     | x := term ?- y := fact -> (- x y)
     | fact)
    (fact
       x := fact ?* y := num -> (* x y)
     | x := fact ?/ y := num -> (/ x y)
     | num)
    (num
       ?0 -> 0 | ?1 -> 1 | ?2 -> 2 | ?3 -> 3 | ?4 -> 4
     | ?5 -> 5 | ?6 -> 6 | ?7 -> 7 | ?8 -> 8 | ?9 -> 9)))

(defun expression-eval ()
  (interactive)
  (save-excursion
    (goto-char (point-min))
    (let ((result (gg-parse expressions)))
      (message "%s" (if result (cdr result) "parse failed")))))

gg-grammar lets you define a grammar in lispy-syntax-BNF. gg-parse interprets the grammar against the current buffer and returns (t . result) if the parse succeeded or nil if it failed.

After you evaluate the above definitions (say, by putting them in the scratch buffer and typing M-x eval-buffer) you can evaluate simple arithmetic expressions by opening an empty buffer, entering 2*3+7, and then typing M-x expression-eval.

Describing Grammars

gg-grammar defines a grammar from a list of rules. Each rule is list surrounded by parentheses. The first symbol of each rule is its name. (There are four rules in the above example, start, term, fact, and num written (gg-grammar (start ...) (term ...) (fact ...) (num ...)).) The first rule in the grammar is the start rule. When you call gg-parse the first rule is interpreted against the contents of the current buffer at the current point. In the above example the rule is called start, but you can call it anything you like.

The rest of the symbols in a rule define one or more alternatives. If there is more than one alternative, the alternatives are separated by |. In the above example start has one alternative, term and fact have three, and num has ten—one for each digit. A PEG parser, unlike typical BNF, tries the alternatives in order until one of them succeeds. This means the order of alternatives is significant and you must pay close attention that an earlier alternative is not a prefix of a later alternative, or the later alternative will never be activated. An alternative has two parts, a parsing expression and an action, separated by an arrow.

The parsing expression is a sequence of things that must be matched for the alternative to succeed. These can be:

  • A character literal. ?a matches the letter a. The result of parsing a character literal is the character itself.
  • A regular expression. "[aeiou]+" matches one or more vowels. The result of parsing a regular expression is the string that the expression matched, but if a regular expression is the last expression evaluated, the usual functions like match-beginning, match-end, and match-string let you extract groups.
  • The end of the buffer, eof. Use this at the end of your start rule to ensure your parser consumed all of the input in the buffer. The result is t.
  • The name of a rule. The result of a rule is whatever its actions return.
  • A filter, ... if s-exp. s-exp should be a function that takes one argument. The function is passed the result of the expression to the left of the if; if the function returns true, then the parse succeeds and the result is whatever the result of the expression on the left is.
  • A binding, x := .... This remembers the result of the expression on the right. Filters and actions can refer to the variable.
  • Lookahead & .... This runs ahead in the input. If the expression on the right matches, then the parse succeeds. No input is consumed past the &.
  • Negative lookahead ~ .... Like lookahead this runs ahead in the input, but the parse succeeds only if the expression on the right isn’t matched. Negative lookahead is typically used in PEG parses for implementing “longest match” rules by checking that whatever comes next couldn’t be parsed too.

The action is an s-expression written after ->. Actions compute the result of parsing the expression. In the example above the actions do arithmetic operations; typically actions build an abstract-syntax tree. When your action runs two variables, gg-start and gg-end, are set to the positions in the buffer at the start and end of the expression being parsed. You can use these annotate your abstract syntax tree with position information. Actions are optional. If you don’t write an action, the result of the alternative is the result of the rightmost expression.

Because a PEG parser may memoize results and reuse them after back-tracking, actions should be pure functions. For example, if you want to use Guruguru to colorize a buffer, you should build a list of attribute runs in your actions and then walk over them when the parser completes to colorize the buffer, instead of colorizing the buffer directly in the actions.

Debugging Grammars

Guruguru doesn’t have a grammar debugger yet. The following function is useful when debugging a grammar by hand; use this in an action and it prints the rule and the yield (the text that was parsed.)

(defun gg-debug ()
  (message "%s: %s"
           rule
           (buffer-substring-no-properties gg-start gg-end)))

Limitations

Guruguru has no built-in support for error reporting or error recovery. You can add error recovery to your parsers by adding low-priority alternatives that consume arbitrary input up to some resynchronization token.

Clone this wiki locally