LRGrep

Selecting Error Messages for LR Parsers

Python:

from calendar import weekday, SUNDAY

[year for year in range(2008, 2122)
    if weekday(year, 12, 25) == SUNDAY]

Lua:

require("date")

for year=2008,2121 do
   if date(year, 12, 25):getweekday() == 1 then
      print(year)
   end
end

Javascript:

for (var year = 2008; year <= 2121; year++) {
    var xmas = new Date(year, 11, 25)
    if ( xmas.getDay() === 0 )
        console.log(year)
}

If the rules are broken?

example.ml

¹ let x = 5;
² let y = 6
³ let z = 7

ocamlc example.ml

File "example.ml", line 3, characters 0-3:
3 | let z = 7
    ^^^
Error: Syntax error

A frustrating experience for developers!

Plan

Problems with (LR) parsing and errors

LRGrep for grammar authors

Applications

A technical look at LRGrep

1. Problems with (LR) parsing and errors

Context-Free Grammars

\[ Sentence\quad \to\quad Subject\quad Verb\quad Complement \\ \]

\[ \begin{align*} Subject \quad &\to& \texttt{Jerry}\quad &|\quad \texttt{Mary} \\ Verb \quad &\to& \texttt{eats}\quad &|\quad \texttt{drinks} \\ Complement \quad &\to& \texttt{pizza}\quad &|\quad \texttt{coffee} \\ \end{align*} \]

Terminals: $\texttt{Jerry}$, $\texttt{Mary}$, $\texttt{eats}$, $\texttt{drinks}$, $\texttt{pizza}$, $\texttt{coffee}$

Nonterminals: $Sentence, Subject, Verb, Complement$

\[ \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \begin{align*} \texttt{Jerry} \quad \texttt{eats} \quad \texttt{pizza} \end{align*} \]

\[ \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \leaf{Subject}{Jerry} \quad \texttt{eats} \quad \texttt{pizza} \end{align*} \]

\[ \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \leaf{Subject}{Jerry} \quad \leaf{Verb}{eats} \quad \texttt{pizza} \end{align*} \]

\[ \begin{align*} \leaf{Subject}{Jerry} \quad \leaf{Verb}{eats} \quad \leaf{Complement}{pizza} \end{align*} \]

\[ \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \dfrac{ \leaf{Subject}{Jerry} \quad \leaf{Verb}{eats} \quad \leaf{Complement}{pizza} }{Sentence} \end{align*} \]

Second example: minimal arithmetic

How could we recognize an expression like: $1+2$

\[ \begin{align*} E &\to T \\ E &\to T + E \\ T &\to n \\ T &\to (E) \\ \end{align*} \] Terminals: $+$, $($, $)$, $n$

Nonterminals: $E$ (expression), $T$ (term)

\[ %\newcommand\node[3]{\dfrac{#1}{#2}}{\small(#2 \to #3)} \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \dfrac{ \dfrac{1}{T} \quad {\small+} \quad \dfrac{ \dfrac{2}{T} }{E} }{E} \end{align*} \]

$\Rightarrow$ Constructing these trees is the job of a parser

Generating parsers from grammars

Hand-writting parsers quickly appeared to be problematic: this process was tedious and error-prone.
This motivated a long line of research on parsing algorithms.

In this thesis, I worked on the family of LR parsers:

They are generated automatically from a grammar.
They are good at processing programming languages.

Yet, mainstream compilers tend to abandon generated parsers:

GCC moved away from generated parsers in 2006 for C and in 2004 for C++
Go in 2015
Ruby in 2023
...

Why?

"Error tolerance ..." (Ruby)
"Better error reporting with a hand-written parser" (C#)

Parsing 1/2: Shift-Reduce Parsing

We can recognize any sentence by repeating two actions:

Shift reads the next symbol
Reduce recognizes a rule

Let us write $\mathrm{stack} \blacktriangleright \mathrm{input}$ for a step, where:

$\mathrm{stack}$ is a partially recognized prefix
$\mathrm{input}$ is the suffix of the input that has not been read yet

\[ \newcommand\SR[2]{{#1}\quad\blacktriangleright\quad{#2}} \newcommand\SRD[1]{\phantom{#1}\quad\Downarrow\quad{#1}} \newcommand\SRS{\SRD{\text{Shift}}} \newcommand\SRR[1]{\SRD{\text{Reduce } #1}} \newcommand\?{{\!\!}} \]

\[ \SR{}{1+2} \]

\[ \SRS \]

\[ \SR{n}{\?+2} \]

\[ \SRR{T\to n} \]

\[ \SR{T}{\?+2} \]

\[ \SRS \]

\[ \SR{T+\?}{2} \]

\[ \SRS \]

\[ \SR{T+n}{} \]

\[ \SRR{T\to n} \]

\[ \SR{T+T}{} \]

\[ \SRR{E\to T} \]

\[ \SR{T+E}{} \]

\[ \SRR{E\to T+E} \]

\[ \SR{E}{} \]

Parsing 2/2: LR(0) states and automaton

Question: How do we know which actions are applicable?

Idea: Keep track of the grammar rules have been partially recognized.

An item mark a position in a rule with a $\bullet$.

For instance, $E \to \bullet T + E$ is an item.

\[ \newcommand\SRSt[1]{\boxed{\begin{align*}#1\end{align*}}} \]

When starting, we want to recognize an expression.

\[ \SRSt{ E &\to \bullet T \\ E &\to \bullet T + E \\ } \]

\[ \SRSt{ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } \]

This is the initial state.

A state is a set of items, representing the candidates applicable to a situation.

Let's repeat the parsing process while tracking states:

\[ \newcommand\SRB[3]{ \begin{align*} &\SRSt{#3} \\[1em] &\SR{#1}{#2} \end{align*} } \] $$ \SRSt{ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } $$

\[ \SR{}{ 1+2} \]

\[ \SRS \]

$$ \SRSt{ T &\to n \bullet \\ } $$

\[ \SR{n}{\?+2} \]

\[ \SRR{T\to n} \]

$$ \SRB{T}{\?+2}{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } $$

\[ \SRS \]

\[ \ldots \]

Two kinds of actions are sufficient to recognize every sentence:

$\text{Shift}$ consumes one input symbol
$\text{Reduce } A\to\beta$ replaces suffix $\beta$ by $A$ in the stack

An item, written $A\to\beta\bullet\gamma$, represents a position in a rule.
A state tracks active items and gives the applicable actions.

What happens on invalid input?

Let us look at examples of incomplete input.

First example: "$1+\?$"

We would like to report: "Expecting a number".

\[ \SR{}{1+\?} \]

\[ \SRS \]

\[ \SR{n}{+\?} \]

\[ \SRR{T\to n} \]

\[ \SR{T}{+} \]

\[ \SRS \]

\[ \SR{T+\?}{} \]

$$ \SR{T+\?}{} $$ $$ \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } $$ $$ \SR{T+\?}{} $$

We get stuck with a stack and a state.

Second Example: "$(1$"

We would like to report: "Unclosed parenthesis".

\[ \SR{}{(1} \]

\[ \SRS \]

\[ \SR{(}{1} \]

\[ \SRS \]

\[ \SR{(n}{} \]

$$ \SR{(n}{} $$

$$ \SRB{(n}{}{ T &\to n \bullet \\ } $$

\[ \SRR{T\to n} \]

$$ \SRB{(T}{}{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } $$

\[ \SRR{E\to T} \]

$$ \SRB{(E}{}{ T &\to ( E \bullet ) \\ } $$

The input is invalid if actions do not permit reaching the goal.

The position of failure is well-defined (after the last shift).

The concept of "failing configuration" is slippery because of reductions.

Negative Space of Grammars

Could we also specify the handling of errors?

Connecting a Failure to an Error Message

How to specify the association between a failure and a message?

How to realize this specification efficiently?

How to help author this specification?

We focus on the technical aspect and not the human aspect:

LRGrep extract information from failures.
It is up to the grammar author to make use of this information.

State-of-the-Art

If error looks like "1+"

report "Expecting a number"

If error looks like "(1"

report "A parenthesis has not been closed"

\[ \begin{array}{rlcll} \texttt{"1+"} &\to& \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } &\to& \texttt{"Expecting a number"} \\ \phantom{\texttt{"(1"}} && \phantom{ \small \left\{ \begin{array}{c} \SRSt{ T &\to n \bullet \\ } \ ? \\ \SRSt{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } \ ? \\ \SRSt{ T &\to ( E \bullet ) \\ } \ ? \end{array} \right\} } &\phantom{\to}& \phantom{\texttt{"A parenthesis has not been closed"}} \end{array} \]

\[ \begin{array}{rlcll} \texttt{"1+"} &\to& \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } &\to& \texttt{"Expecting a number"} \\ \texttt{"(1"} &\to& { \small \left\{ \begin{array}{c} \SRSt{ T &\to n \bullet \\ } \ ? \\ \SRSt{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } \ ? \\ \SRSt{ T &\to ( E \bullet ) \\ } \ ? \end{array} \right\} } &\to& \texttt{"A parenthesis has not been closed"} \end{array} \]

State-based identification using examples
No control over reductions: undefined behaviors

Used in earlier versions of Go.

In case of error, reduce T and E (if possible)

If error looks like "1+"

report "Expecting a number"

If error looks like "(1"

report "A parenthesis has not been closed"

Coarse-grained control of reductions
Exhaustivity check
State-based enumeration of counter-examples

Used in Compcert C, Catala, ...

LRGrep

Selecting Error Messages for LR Parsers

If the rules are broken?

Plan

1. Problems with (LR) parsing and errors

Context-Free Grammars

Second example: minimal arithmetic

Generating parsers from grammars

Parsing 1/2: Shift-Reduce Parsing

Parsing 2/2: LR(0) states and automaton

What happens on invalid input?

First example: "\(1+\?\)"

Second Example: "\((1\)"

Negative Space of Grammars

Connecting a Failure to an Error Message

State-of-the-Art