Even more compact lexer table
Some time ago, I blogged about the representation of lexer table. This post introduced a common scheme originally described in the Dragon Book together with a practical implementation.
In a footnote, I mentioned that I thought that the pseudocode of the Dragon Book was wrong:
 The traditional compacting scheme encodes transitions as a pair of default state and a sparse vector; the default state is the most common target, and transitions that target it are not explicitly represented.
 By recursively calling
nextState
, the code did not interpret the default state as a target but rather as a fallback state: if the transition is not part of the sparse vector, look again for the same transition in the default state.
But recently, I worked on a project which exhibits a variant of the problem for which the Dragon Book typo might be beneficial.
Namely:
 many states have a lot of transitions in common (for most of the alphabet, they have the same target states, only a few cases differ)
 lookup performance is not critical
 the alphabet and the automaton are quite large
Given these constraints, it is worth spending some time on making the table more compact, at the cost of slightly worse lookups.
This opens up an interesting question: how to construct a compact table with this encoding? How to pick good fallback states?
Finding common transitions and fallback states
Let’s assume we have a set of states $Q$ and an alphabet $\Sigma$ and a transition function $\delta : Q \times \Sigma \rightarrow Q$.
For each state $q$, we want to decide whether to represent the function $\delta(q,\_)$ using either:
 a total function $\delta_q : \Sigma \rightarrow Q$
 a pair of a partial function $\delta_q : \Sigma \rightharpoonup Q$ and a fallback state $f_q : Q$
The original function $\delta$ can be recovered from this decomposition using:
Let’s define a distance function $d(q_1,q_2)$ on states:
The distance function measures the number of transitions in which two states disagree.
We now consider the dense graph on $Q$ with weights defined by $d$. A minimum spanning tree of this graph relates each state to one of its closest neighbors, in terms of shared transitions.
Lets pick a root $q_r$ and call $\mathrm{parent}: Q\rightarrow Q$ the function that associates a state to its parent in the minimum spanning tree rooted at $q_r$.
For each state $q$ we can now define:

$\delta_q(a) = \delta(q, a)$, if $q$ = $q_r$ (the transition function of the root is total)

$\delta_q(a) = \delta(q,a)$, if $\delta(q,a) \neq \delta(\mathrm{parent}(q), a)$ (the transition function of a child is defined when it differs from the parent)

$f_q = \mathrm{parent}(q)$ (the fallback state is the parent)
Picking a root and splitting trees
The method above already maximize sharing between transition functions. However it does not try to do anything to minimize lookup length.
In the worst case, lookup might start from one leaf of the tree and try all nodes on the branch to the root before succeeding. So we would like to minimize the length of the branches.
Various heuristics can be useful to choose the root:
 A first approximation is to pick a node that is in the middle of a tree. This will minimize the length of the maximum branch (worst lookup will be half the diameter of the tree).
 Assuming that all states are equally likely to be looked up, we can also pick a root that minimizes the depth of nodes.
Both of these metrics can be computed easily. When successively evaluated along the branches of the tree, they form a parabola: decreasing as we get closer to the optimal node, increasing after. So a simple traversal lets us pick the candidate root.
Finally, after picking a root, we might still be unhappy with the height of the tree. In this case, we can simply split into two trees (at the cost of losing some sharing), and repeat the process on each tree.
These are all greedy heuristics. They give a good but not optimal decomposition. The minimum diameter minimum spanning tree seems to be NPhard in practice (and some variants APXhard), so I did not spend much time on that (and the results were good enough).
Practical implementation
My final implementation uses both a default state and fallback states: the root of the tree has a default state and the children fall back to their parent.
Furthermore, the sharing algorithm is not applied to all states at once. Rather, we group states by their default state (the most common target of their transition function). Then we compute a tree for each group of states with the same default.
On my test, fallback states lead to 60% less transitions than default state alone. And as the vector where much more sparse, the packing heuristic performed better (though I did not measure the overhead precisely).