I.1

LRGrep

Selecting Error Messages for LR Parsers

Python:

from calendar import weekday, SUNDAY

[year for year in range(2008, 2122)
    if weekday(year, 12, 25) == SUNDAY]

Lua:

require("date")

for year=2008,2121 do
   if date(year, 12, 25):getweekday() == 1 then
      print(year)
   end
end

Javascript:

for (var year = 2008; year <= 2121; year++) {
    var xmas = new Date(year, 11, 25)
    if ( xmas.getDay() === 0 )
        console.log(year)
}

If the rules are broken?

example.ml

¹ let x = 5;
² let y = 6
³ let z = 7

ocamlc example.ml

File "example.ml", line 3, characters 0-3:
3 | let z = 7
    ^^^
Error: Syntax error

A frustrating experience for developers!

Plan




  1. Problems with (LR) parsing and errors

  1. LRGrep for grammar authors

  1. Applications

  1. A technical look at LRGrep















1. Problems with (LR) parsing and errors















Context-Free Grammars

SentenceSubjectVerbComplement


SubjectJerry|MaryVerbeats|drinksComplementpizza|coffee


Terminals: Jerry, Mary, eats, drinks, pizza, coffee

Nonterminals: Sentence,Subject,Verb,Complement


Jerryeatspizza

JerrySubjecteatspizza

JerrySubjecteatsVerbpizza

JerrySubjecteatsVerbpizzaComplement

JerrySubjecteatsVerbpizzaComplementSentence



Second example: minimal arithmetic

How could we recognize an expression like: 1+2

ETET+ETnT(E) Terminals: +, (, ), n

Nonterminals: E (expression), T (term)

1T+2TEE


Constructing these trees is the job of a parser

Generating parsers from grammars

  • Hand-writting parsers quickly appeared to be problematic: this process was tedious and error-prone.

  • This motivated a long line of research on parsing algorithms.

In this thesis, I worked on the family of LR parsers:

  • They are generated automatically from a grammar.
  • They are good at processing programming languages.

Yet, mainstream compilers tend to abandon generated parsers:

  • GCC moved away from generated parsers in 2006 for C and in 2004 for C++
  • Go in 2015
  • Ruby in 2023
  • ...

Why?

  • "Error tolerance ..." (Ruby)
  • "Better error reporting with a hand-written parser" (C#)

Parsing 1/2: Shift-Reduce Parsing

We can recognize any sentence by repeating two actions:

  • Shift reads the next symbol
  • Reduce recognizes a rule

Let us write stackinput for a step, where:

  • stack is a partially recognized prefix
  • input is the suffix of the input that has not been read yet

1+2

ShiftShift

n+2

Reduce TnReduce Tn

T+2

ShiftShift

T+2

ShiftShift

T+n

Reduce TnReduce Tn

T+T

Reduce ETReduce ET

T+E

Reduce ET+EReduce ET+E

E

Parsing 2/2: LR(0) states and automaton

Question: How do we know which actions are applicable?

Idea: Keep track of the grammar rules have been partially recognized.

An item mark a position in a rule with a .

For instance, ET+E is an item.

When starting, we want to recognize an expression.

ETET+E

ETET+ETnT(E)

This is the initial state.

A state is a set of items, representing the candidates applicable to a situation.

Let's repeat the parsing process while tracking states:

ETET+ETnT(E)

1+2

ShiftShift

Tn

n+2

Reduce TnReduce Tn

ETET+ET+2

ShiftShift

Two kinds of actions are sufficient to recognize every sentence:

  • Shift consumes one input symbol
  • Reduce Aβ replaces suffix β by A in the stack

An item, written Aβγ, represents a position in a rule.
A state tracks active items and gives the applicable actions.

What happens on invalid input?

Let us look at examples of incomplete input.

First example: "1+"

We would like to report: "Expecting a number".

1+

ShiftShift

n+

Reduce TnReduce Tn

T+

ShiftShift

T+

T+ ET+EETET+ETnT(E) T+

We get stuck with a stack and a state.

Second Example: "(1"

We would like to report: "Unclosed parenthesis".

(1

ShiftShift

(1

ShiftShift

(n

(n
Tn(n

Reduce TnReduce Tn

ETET+E(T

Reduce ETReduce ET

T(E)(E

The input is invalid if actions do not permit reaching the goal.

The position of failure is well-defined (after the last shift).

The concept of "failing configuration" is slippery because of reductions.

Negative Space of Grammars

Could we also specify the handling of errors?

Connecting a Failure to an Error Message



How to specify the association between a failure and a message?

How to realize this specification efficiently?

How to help author this specification?


We focus on the technical aspect and not the human aspect:

  • LRGrep extract information from failures.
  • It is up to the grammar author to make use of this information.

State-of-the-Art

If error looks like "1+"

report "Expecting a number"

If error looks like "(1"

report "A parenthesis has not been closed"

"1+"ET+EETET+ETnT(E)"Expecting a number""(1"{Tn ?ETET+E ?T(E) ?}"A parenthesis has not been closed"
"1+"ET+EETET+ETnT(E)"Expecting a number""(1"{Tn ?ETET+E ?T(E) ?}"A parenthesis has not been closed"
  • State-based identification using examples
  • No control over reductions: undefined behaviors

Used in earlier versions of Go.


In case of error, reduce T and E (if possible)

If error looks like "1+"

report "Expecting a number"

If error looks like "(1"

report "A parenthesis has not been closed"

"1+"ET+EETET+ETnT(E)"Expecting a number""(1"T(E)"A parenthesis has not been closed"
  • Coarse-grained control of reductions
  • Exhaustivity check
  • State-based enumeration of counter-examples

Used in Compcert C, Catala, ...