2. LRGrep for Grammar Authors















If item \(T\to\bullet n\) is active or reachable

report "Expecting a number"

If item \(T\to(E\bullet)\) is active or reachable

report "A parenthesis has not been closed"

In LRGrep, this is written as two clauses with reduce-filter patterns: \[ \begin{array}{rcccll} [& \_* &/& T\to\bullet n &] & \texttt{ \{ "Expecting a number" \}} \\ [& \_* &/& T\to(E\bullet) &] & \texttt{ \{ "A parenthesis has not been closed" \}} \end{array} \]

\(\begin{array}{ll} \qquad \uparrow & \quad \uparrow \\ \text{Reduction target} & \text{Item filter} \\ \end{array}\)

  • More declarative than Menhir (no need to explicitly reduce \(T\) and \(E\)).
  • Semantics depends only on the grammar, not on the automaton.
    \(\Rightarrow\) More robust to changes in the grammar.

LRGrep Domain-Specific Language (DSL)


A dialect of regular expressions denoring sets of LR stacks:


\[ \begin{align*} &&e::=&\ X && \text{— Atom (terminal or non-terminal)} \\ &&|&\ \_ && \text{— Any symbol} \\ &&|&\ e_1;e_2 && \text{— Concatenation} \\ &&|&\ e_1|e_2 && \text{— Disjunction} \\ &&|&\ e^* && \text{— Kleene star} \\ &&|&\ x=e && \text{— Named capture (variable binding)} \\ \\ &&|&\ [e] && \text{— Matching up to reduction} \\ &&|&\ /A\to\beta\bullet\gamma && \text{— Filter for item} \\ \end{align*} \]

Revisiting the Example

We now have the tools to tackle the starting example.

¹ let x = 5;
² let y = 6
³ let z = 7

The ; changes the interpretation of the next let from a global definition to a local one.
It is likely the culprit.

We would like to warn the user in the error message:

File "example.ml", line 3, characters 0-3:
3 | let z = 7
    ^^^
Error: unterminated local let-binding, expecting "in".
Suggestion: remove the ";" at line 1, characters 9-10:
1 | let x = 5;
             ^

To recognize this situation, we decompose the input into three parts:

 let x = 5 – the prefix that forms a valid definition
 ; – the problematic semicolon
 let y = 6– the unterminated let-binding

Which translates to:

| [structure_item]; semi=SEMI; [let_binding]
  { 
    report "unterminated local let-binding, expecting \"in\".";
    hint "remove the \";\" at " $loc(semi) 
  }

Authoring Tools

The DSL has many decidable properties.

Besides a compiler, I have implemented tools to help grammar authors write error specifications.

Given a grammar and an invalid sentence, list the reduce-filter patterns by annotating stack trace.

Given a grammar, enumerate enough sentences to cover all reduce-filter patterns (stronger than just covering all states).

Given a grammar and an error specification, check that all failures have an explanation and that all explanations are reachable. Produce counter-examples if not.















3. Applications















OCaml

The first application, it heavily influenced the design of LRGrep.

Errors fall into two categories.

Cover all cases with reduce-filter patterns.

Automatically discovered using enumeration and coverage.

Refine confusing cases that come from unexpected interactions between grammatical constructions.

Use advanced features of the DSL.

Reported by OCaml users, specified using the interpreter.

  • I did not find a way to automate the discovery of complex errors.
  • But all reported problems could be described by patterns.

Catala

A language for formalizing tax law.
Error messages are even more important: it targets lawyers and not just developers.

The parser is implemented using Menhir with an exhaustive specification of errors.

Goal: Adapt the Menhir specification to an LRGrep specification.

  • Translate the specification automatically.
  • Often too specific, generalize and rank errors manually.
  • Enumeration-based testing prevent regressions.

The menhir specification has 3604 lines and 235 examples.
The LRGrep specification has 589 lines and 177 clauses.

Results:

  • A confirmation that LRGrep subsumes Menhir error specifications.
  • Translation could benefit from more assistance.

Elm

Elm is a functional programming language for web applications.
The compiler is renowned for the quality of its error messages.

Completely different technology:

  • it has an indentation-sensitive syntax
  • its parser is written using combinators with error-handling support

Approach:

  • Write an LR grammar for a representative subset of Elm
  • Use enumeration to fuzz the reference parser
  • Infer an error specification that mimics its behavior

Results:

  • Specification takes 2137 lines and contains 163 clauses.
  • Reduce-filter patterns can reproduce Elm's error messages.
  • Derive specification from a black-box automatically!