Even more compact lexer table

Some time ago, I blogged about the representation of lexer table. This post introduced a common scheme originally described in the Dragon Book together with a practical implementation.

In a footnote, I mentioned that I thought that the pseudo-code of the Dragon Book was wrong:

But recently, I worked on a project which exhibits a variant of the problem for which the Dragon Book typo might be beneficial.

Namely:

Given these constraints, it is worth spending some time on making the table more compact, at the cost of slightly worse lookups.

This opens up an interesting question: how to construct a compact table with this encoding? How to pick good fallback states?

Finding common transitions and fallback states

Let’s assume we have a set of states $Q$ and an alphabet $\Sigma$ and a transition function $\delta : Q \times \Sigma \rightarrow Q$.

For each state $q$, we want to decide whether to represent the function $\delta(q,\_)$ using either:

The original function $\delta$ can be recovered from this decomposition using:

$$ \delta(q,a) = \begin{cases} \delta_q(a) & \text{if } a \in \mathrm{dom}(\delta_q) \\ \delta(f_q,a) & \text{otherwise} \end{cases} $$

Let’s define a distance function $d(q_1,q_2)$ on states:

$$ d(q_1,q_2) = \lvert \left\{ a \in \Sigma \mid \delta(q_1,a) \neq \delta(q_2,a) \right\} \rvert $$

The distance function measures the number of transitions in which two states disagree.

We now consider the dense graph on $Q$ with weights defined by $d$. A minimum spanning tree of this graph relates each state to one of its closest neighbors, in terms of shared transitions.

Lets pick a root $q_r$ and call $\mathrm{parent}: Q\rightarrow Q$ the function that associates a state to its parent in the minimum spanning tree rooted at $q_r$.

For each state $q$ we can now define:

Picking a root and splitting trees

The method above already maximize sharing between transition functions. However it does not try to do anything to minimize lookup length.

In the worst case, lookup might start from one leaf of the tree and try all nodes on the branch to the root before succeeding. So we would like to minimize the length of the branches.

Various heuristics can be useful to choose the root:

Both of these metrics can be computed easily. When successively evaluated along the branches of the tree, they form a parabola: decreasing as we get closer to the optimal node, increasing after. So a simple traversal lets us pick the candidate root.

Finally, after picking a root, we might still be unhappy with the height of the tree. In this case, we can simply split into two trees (at the cost of losing some sharing), and repeat the process on each tree.

These are all greedy heuristics. They give a good but not optimal decomposition. The minimum diameter minimum spanning tree seems to be NP-hard in practice (and some variants APX-hard), so I did not spend much time on that (and the results were good enough).

Practical implementation

My final implementation uses both a default state and fallback states: the root of the tree has a default state and the children fall back to their parent.

Furthermore, the sharing algorithm is not applied to all states at once. Rather, we group states by their default state (the most common target of their transition function). Then we compute a tree for each group of states with the same default.

On my test, fallback states lead to 60% less transitions than default state alone. And as the vector where much more sparse, the packing heuristic performed better (though I did not measure the overhead precisely).