LRGrep

Selecting Error Messages for LR Parsers

Python:

from calendar import weekday, SUNDAY

[year for year in range(2008, 2122)
    if weekday(year, 12, 25) == SUNDAY]

Lua:

require("date")

for year=2008,2121 do
   if date(year, 12, 25):getweekday() == 1 then
      print(year)
   end
end

Javascript:

for (var year = 2008; year <= 2121; year++) {
    var xmas = new Date(year, 11, 25)
    if ( xmas.getDay() === 0 )
        console.log(year)
}

If the rules are broken?

example.ml

¹ let x = 5;
² let y = 6
³ let z = 7

ocamlc example.ml

File "example.ml", line 3, characters 0-3:
3 | let z = 7
    ^^^
Error: Syntax error

A frustrating experience for developers!

Plan




  1. Problems with (LR) parsing and errors

  1. LRGrep for grammar authors

  1. Applications

  1. A technical look at LRGrep















1. Problems with (LR) parsing and errors















Context-Free Grammars

\[ Sentence\quad \to\quad Subject\quad Verb\quad Complement \\ \]


\[ \begin{align*} Subject \quad &\to& \texttt{Jerry}\quad &|\quad \texttt{Mary} \\ Verb \quad &\to& \texttt{eats}\quad &|\quad \texttt{drinks} \\ Complement \quad &\to& \texttt{pizza}\quad &|\quad \texttt{coffee} \\ \end{align*} \]


Terminals: \(\texttt{Jerry}\), \(\texttt{Mary}\), \(\texttt{eats}\), \(\texttt{drinks}\), \(\texttt{pizza}\), \(\texttt{coffee}\)

Nonterminals: \(Sentence, Subject, Verb, Complement\)


\[ \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \begin{align*} \texttt{Jerry} \quad \texttt{eats} \quad \texttt{pizza} \end{align*} \]

\[ \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \leaf{Subject}{Jerry} \quad \texttt{eats} \quad \texttt{pizza} \end{align*} \]

\[ \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \leaf{Subject}{Jerry} \quad \leaf{Verb}{eats} \quad \texttt{pizza} \end{align*} \]

\[ \begin{align*} \leaf{Subject}{Jerry} \quad \leaf{Verb}{eats} \quad \leaf{Complement}{pizza} \end{align*} \]

\[ \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \dfrac{ \leaf{Subject}{Jerry} \quad \leaf{Verb}{eats} \quad \leaf{Complement}{pizza} }{Sentence} \end{align*} \]



Second example: minimal arithmetic

How could we recognize an expression like: \(1+2\)

\[ \begin{align*} E &\to T \\ E &\to T + E \\ T &\to n \\ T &\to (E) \\ \end{align*} \] Terminals: \(+\), \((\), \()\), \(n\)

Nonterminals: \(E\) (expression), \(T\) (term)

\[ %\newcommand\node[3]{\dfrac{#1}{#2}}{\small(#2 \to #3)} \begin{align*} \newcommand\leaf[2]{\dfrac{\texttt{#2}}{#1}} \dfrac{ \dfrac{1}{T} \quad {\small+} \quad \dfrac{ \dfrac{2}{T} }{E} }{E} \end{align*} \]


\(\Rightarrow\) Constructing these trees is the job of a parser

Generating parsers from grammars

  • Hand-writting parsers quickly appeared to be problematic: this process was tedious and error-prone.

  • This motivated a long line of research on parsing algorithms.

In this thesis, I worked on the family of LR parsers:

  • They are generated automatically from a grammar.
  • They are good at processing programming languages.

Yet, mainstream compilers tend to abandon generated parsers:

  • GCC moved away from generated parsers in 2006 for C and in 2004 for C++
  • Go in 2015
  • Ruby in 2023
  • ...

Why?

  • "Error tolerance ..." (Ruby)
  • "Better error reporting with a hand-written parser" (C#)

Parsing 1/2: Shift-Reduce Parsing

We can recognize any sentence by repeating two actions:

  • Shift reads the next symbol
  • Reduce recognizes a rule

Let us write \(\mathrm{stack} \blacktriangleright \mathrm{input}\) for a step, where:

  • \(\mathrm{stack}\) is a partially recognized prefix
  • \(\mathrm{input}\) is the suffix of the input that has not been read yet
\[ \newcommand\SR[2]{{#1}\quad\blacktriangleright\quad{#2}} \newcommand\SRD[1]{\phantom{#1}\quad\Downarrow\quad{#1}} \newcommand\SRS{\SRD{\text{Shift}}} \newcommand\SRR[1]{\SRD{\text{Reduce } #1}} \newcommand\?{{\!\!}} \]

\[ \SR{}{1+2} \]

\[ \SRS \]

\[ \SR{n}{\?+2} \]

\[ \SRR{T\to n} \]

\[ \SR{T}{\?+2} \]

\[ \SRS \]

\[ \SR{T+\?}{2} \]

\[ \SRS \]

\[ \SR{T+n}{} \]

\[ \SRR{T\to n} \]

\[ \SR{T+T}{} \]

\[ \SRR{E\to T} \]

\[ \SR{T+E}{} \]

\[ \SRR{E\to T+E} \]

\[ \SR{E}{} \]

Parsing 2/2: LR(0) states and automaton

Question: How do we know which actions are applicable?

Idea: Keep track of the grammar rules have been partially recognized.

An item mark a position in a rule with a \(\bullet\).

For instance, \(E \to \bullet T + E\) is an item.

\[ \newcommand\SRSt[1]{\boxed{\begin{align*}#1\end{align*}}} \]

When starting, we want to recognize an expression.

\[ \SRSt{ E &\to \bullet T \\ E &\to \bullet T + E \\ } \]

\[ \SRSt{ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } \]

This is the initial state.

A state is a set of items, representing the candidates applicable to a situation.

Let's repeat the parsing process while tracking states:

\[ \newcommand\SRB[3]{ \begin{align*} &\SRSt{#3} \\[1em] &\SR{#1}{#2} \end{align*} } \] $$ \SRSt{ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } $$

\[ \SR{}{ 1+2} \]

\[ \SRS \]

$$ \SRSt{ T &\to n \bullet \\ } $$

\[ \SR{n}{\?+2} \]

\[ \SRR{T\to n} \]

$$ \SRB{T}{\?+2}{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } $$

\[ \SRS \]

\[ \ldots \]

Two kinds of actions are sufficient to recognize every sentence:

  • \(\text{Shift}\) consumes one input symbol
  • \(\text{Reduce } A\to\beta\) replaces suffix \(\beta\) by \(A\) in the stack

An item, written \(A\to\beta\bullet\gamma\), represents a position in a rule.
A state tracks active items and gives the applicable actions.

What happens on invalid input?

Let us look at examples of incomplete input.

First example: "\(1+\?\)"

We would like to report: "Expecting a number".

\[ \SR{}{1+\?} \]

\[ \SRS \]

\[ \SR{n}{+\?} \]

\[ \SRR{T\to n} \]

\[ \SR{T}{+} \]

\[ \SRS \]

\[ \SR{T+\?}{} \]

$$ \SR{T+\?}{} $$ $$ \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } $$ $$ \SR{T+\?}{} $$

We get stuck with a stack and a state.

Second Example: "\((1\)"

We would like to report: "Unclosed parenthesis".

\[ \SR{}{(1} \]

\[ \SRS \]

\[ \SR{(}{1} \]

\[ \SRS \]

\[ \SR{(n}{} \]

$$ \SR{(n}{} $$
$$ \SRB{(n}{}{ T &\to n \bullet \\ } $$

\[ \SRR{T\to n} \]

$$ \SRB{(T}{}{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } $$

\[ \SRR{E\to T} \]

$$ \SRB{(E}{}{ T &\to ( E \bullet ) \\ } $$

The input is invalid if actions do not permit reaching the goal.

The position of failure is well-defined (after the last shift).

The concept of "failing configuration" is slippery because of reductions.

Negative Space of Grammars

Could we also specify the handling of errors?

Connecting a Failure to an Error Message



How to specify the association between a failure and a message?

How to realize this specification efficiently?

How to help author this specification?


We focus on the technical aspect and not the human aspect:

  • LRGrep extract information from failures.
  • It is up to the grammar author to make use of this information.

State-of-the-Art

If error looks like "1+"

report "Expecting a number"

If error looks like "(1"

report "A parenthesis has not been closed"

\[ \begin{array}{rlcll} \texttt{"1+"} &\to& \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } &\to& \texttt{"Expecting a number"} \\ \phantom{\texttt{"(1"}} && \phantom{ \small \left\{ \begin{array}{c} \SRSt{ T &\to n \bullet \\ } \ ? \\ \SRSt{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } \ ? \\ \SRSt{ T &\to ( E \bullet ) \\ } \ ? \end{array} \right\} } &\phantom{\to}& \phantom{\texttt{"A parenthesis has not been closed"}} \end{array} \]
\[ \begin{array}{rlcll} \texttt{"1+"} &\to& \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } &\to& \texttt{"Expecting a number"} \\ \texttt{"(1"} &\to& { \small \left\{ \begin{array}{c} \SRSt{ T &\to n \bullet \\ } \ ? \\ \SRSt{ E &\to {T \bullet} \\ E &\to {T \bullet} + E \\ } \ ? \\ \SRSt{ T &\to ( E \bullet ) \\ } \ ? \end{array} \right\} } &\to& \texttt{"A parenthesis has not been closed"} \end{array} \]
  • State-based identification using examples
  • No control over reductions: undefined behaviors

Used in earlier versions of Go.


In case of error, reduce T and E (if possible)

If error looks like "1+"

report "Expecting a number"

If error looks like "(1"

report "A parenthesis has not been closed"

\[ \begin{array}{rlcll} \texttt{"1+"} &\to& \SRSt{ E &\to T+\bullet E \\ E &\to \bullet T \\ E &\to \bullet T + E \\ T &\to \bullet n \\ T &\to \bullet (E) \\ } &\to& \texttt{"Expecting a number"} \\ \texttt{"(1"} &\to& \SRSt{ T &\to ( E \bullet ) \\ } &\to& \texttt{"A parenthesis has not been closed"} \end{array} \]
  • Coarse-grained control of reductions
  • Exhaustivity check
  • State-based enumeration of counter-examples

Used in Compcert C, Catala, ...