4. A Technical Look at LRGrep

Compiling specifications

For efficient recognition, LRGrep translates a specification to a finite automaton.

The standard operators are handled as in a lexer generator.

The main challenge is the handling of reductions \([e]\).
To reason about them, I introduce the reduction automaton.

A finite automaton:

Derived from the LR automaton (alone).
Recognizing LR stacks.
Such that any reduce-filter pattern maps to some accepting states.

It comes with many useful properties. Let's illustrate on our arithmetic:

A stack (suffix) maps to a path.

A reduce-filter pattern translates to some accepting states.

If a stack reaches an accepting state, it is covered.

Conversely, a path not going through a target state represents the suffix of an uncovered stack.

This automaton almost answers all the questions we had about reduce-filter patterns: recognition, enumeration, and coverage*.

Static Analysis of Patterns

To compute coverage and enumeration, we need to characterize failing configurations.

The reduction automaton is promising...

Problems:

Practical parsers use variants of LR(1): need to account for lookahead symbols.
Worse: conflict resolution. Parsing actions can be removed to resolve ambiguities.

If we ignore this, our analyses are over approximations.
They can produce false positives (pointing out errors that are not).

Reachability in LR(1) Automaton

Fast reachability analysis for LR(1) parsers (Bour and Pottier, 2021).

Question: How to reach state \(s\) while looking ahead at terminal \(t\)?

Brute force solution: Shortest-paths matrix

Construct a square matrix \(M\) indexed by pairs \((s,t)\).
\(M_{i, j}\) is the minimum number of symbols to get from \(i\) to \(j\).
Computed as a fixed point using the \((\min,+)\) semiring.

Untractable in practice:

A medium-sized automaton has \(~10^3\) states and \(~10^2\) terminals.
Its matrix has \((10^3 \times 10^2)^2 = 10^{10}\) entries (tens of GBs!).

The best solution was to exploit sparsity of the matrix (Pottier, 2016).
Usable but not satisfying (taking 5GB and 5 minutes for OCaml grammar).

Key improvements:

The matrix has a very regular structure (many entries are redundant).
Fuse terminals with identical behaviors.
Decompose into sub-matrices to enable more fusion.

Two to three orders of magnitude faster!
(OCaml grammar processed in 0.45 seconds using 47MB)

Back to Coverage and Enumeration:

With the reachability matrix, we can enumerate the reachable stacks.
Intersecting them with the reduction automaton gives an exact solution.

Conclusion

Introduced Reduce-filter patterns to characterize erroneous situations.
Generalized them to the LRGrep domain-specific language.
Formalized these two languages (on CFGs and on LR automata).
Devised constructions and algorithms for deciding properties:
- The reduction automaton.
- An efficient reachability analysis for LR parsers.
To confirm the viability of this approach:
- Implemented a suite of tools (compiler, interpreter, coverage checker).
- Applied it to three programming languages; in particular to OCaml.
LRGrep prototype available at github.com/let-def/lrgrep

The reduction automaton provides insights on other problems.
(Error-resilient parsing, Syntactic completion, Conflict resolution)
LRGrep can be applied to GLR (Tree-Sitter).
LRGrep assumes a non-destructive use of the stack of LR parsers:
- Incompatible with optimized imperative implementations.
- Workarounds exist but can impact performance.

Thanks for Your Attention.

Sample Rules

OCaml:

  outer=constr_longident; inner=constr_longident;
  [_ / pattern_gen: constr_longident pattern .]

Detect sequences of identifiers in a pattern (incorrect use of curried syntax).

  LPAREN; typs=comma_type_list; rp=RPAREN
  /atomic_type: LPAREN _ RPAREN . _*

Detect an invalid list of comma-separated types, like (int, string), in contexts where a tuple like (int * string) would make sense.

Catala:

| /definition_parameters: 
    OF . separated_nonempty_list(COMMA,lident)
  { "expected the name of the parameter for this dependent \
     variable" }

Informations in suffix?

Analysis is done from left-to-right and stops at the first error: the suffix cannot be used to explain errors.
This can be problematic for brace-based languages:

for (int i = 0; i < scorers.length; i++) {
  if (scorers[i].nextDoc() == NO_MORE_DOCS)
    lastDoc = NO_MORE_DOCS;
    return;
  }
}

Syntax errors just aren't natural: improving error reporting with language models, Campbell et al.

(Exception: the current lookahead symbol)

How to design good error messages?

Compiler Error Messages Considered Unhelpful: The Landscape of Text-Based Programming Error Message Research, Becker et al.
How should compilers explain problems to developers?, Barik et al.
Error messages are classifiers, Wrenn and Krishnamurti

Extra features

Allowing multiple rules to match
- Decide in the semantic action whether a match is relevant or not
- Communicate information across rules
Restricting matches based on the lookahead symbol