Skip to content
coonsta edited this page Sep 13, 2010 · 19 revisions

Guruguru is a parsing library for Emacs.

ぐるぐる (guru-guru) is a Japanese onomatopoeiac word meaning “around and around.” PEG parsers memoize the result of parsing a given expression at a given input position. This means parsers operate in linear time, even if the parser has to backtrack, because given a grammar with a fixed number of expressions N and an input of length M the parser will compute at most N·M results.

Guruguru uses a modified PEG algorithm that can parse left-recursive rules by repeatedly evaluating left-recursive expressions. This means Guruguru parsers aren’t guaranteed to operate in linear time; instead they may spend time going “around and around” hence the name.

Example

The following example demonstrates evaluating simple arithmetic expressions with Guruguru, once you have put guruguru.el somewhere on your load path:

(require 'guruguru)

(defvar expressions
  (gg-grammar
    (start x := term eof -> x)
    (term
       x := term ?+ y := fact -> (+ x y)
     | x := term ?- y := fact -> (- x y)
     | fact)
    (fact
       x := fact ?* y := num -> (* x y)
     | x := fact ?/ y := num -> (/ x y)
     | num)
    (num
       ?0 -> 0 | ?1 -> 1 | ?2 -> 2 | ?3 -> 3 | ?4 -> 4
     | ?5 -> 5 | ?6 -> 6 | ?7 -> 7 | ?8 -> 8 | ?9 -> 9))))

(defun expression-eval ()
  (interactive)
  (save-excursion
    (goto-char (point-min))
    (let ((result (gg-parse expressions)))
      (message "%s" (if result (cdr result) "parse failed")))))

After you evaluate the above definitions (say, by putting them in the scratch buffer and typing M-x eval-buffer) you can evaluate simple arithmetic expressions by opening an empty buffer, entering 2*3+7, and then typing M-x expression-eval.

Describing Grammars

gg-grammar defines a grammar from a list of rules. Each rule is list surrounded by parentheses. The first symbol of each rule is its name. (There are four rules in the above example, start, term, fact, and num written (gg-grammar (start ...) (term ...) (fact ...) (num ...)).) The first rule in the grammar is the start rule. When you call gg-parse the first rule is interpreted against the contents of the current buffer at the current point. In the above example the rule is called start, but you can call it anything you like.

The rest of the symbols in a rule define one or more alternatives. If there is more than one alternative, the alternatives are separated by |. In the above example start has one alternative, term and fact have three, and num has ten—one for each digit. An alternative has two parts, a parsing expression and an action, separated by an arrow.

The parsing expression is a sequence of things that must be matched for the alternative to succeed. These can be:

  • A character literal. ?a matches the letter a. The result of parsing a character literal is the character itself.
  • The end of the buffer, eof. Use this to ensure your parser consumed all of the input in the buffer. The result of eof is t.
  • The name of a rule. The result of a rule is whatever its actions return.
  • A filter, ... if s-exp. s-exp should be a function that takes one argument. The function is passed the result of the expression to the left of the if; if the function returns true, then the parse succeeds and the result is whatever the result of the expression on the left is.
  • A binding, x := .... This remembers the result of the expression on the right. Filters and actions can refer to the variable.
  • Lookahead & .... This runs ahead in the input. If the expression on the right matches, then the parse succeeds. No input is consumed past the &.
  • Negative lookahead ~ .... Like lookahead this runs ahead in the input, but the parse succeeds only if the expression on the right isn’t matched. Negative lookahead is typically used in PEG parses for implementing “longest match” rules by checking that whatever comes next couldn’t be parsed too.

The action is an s-expression written after ->. Actions compute the result of parsing the expression. In the example above the actions do arithmetic operations; typically actions build an abstract-syntax tree. When your action runs two variables, gg-start and gg-end, are set to the positions in the buffer at the start and end of the expression being parsed. You can use these annotate your abstract syntax tree with position information. Actions are optional. If you don’t write an action, the result of the alternative is the result of the rightmost expression.

Because a PEG parser may memoize results and reuse them after back-tracking, actions should be pure functions. For example, if you want to use Guruguru to colorize a buffer, you should build a list of attribute runs in your actions and then walk over them when the parser completes to colorize the buffer, instead of colorizing the buffer directly in the actions.

Limitations

Guruguru has no built-in support for error reporting or error recovery. You can add error recovery to your parsers by adding low-priority alternatives that consume arbitrary input up to some resynchronization token.

Clone this wiki locally