Enumerating Context-Free Languages and Minimizing Regular Expressions

As I work on machine learning algorithms to combinatorial-optimization problems like compiler optimization, one vastly simplified version of a class of problems is to learning to minimize regular expressions. The problem is to learn a function that takes a regular expression as input and outputs a minimal equivalent regular expression that describes the same language. Since this is machine learning, a good starting point is a dataset of regular expressions and their minimal equivalents that can be used directly for supervised learning. To that end, this post describes my approach to generate a dataset of regular expressions and their minimal equivalents.

The key steps are:

Enumerating all regular expressions up to a certain length $n$.
Finding the minimal equivalent for each regular expression by leveraging DFA minimization and hashing.

Background on Regular Expressions and Equivalence

A regular expression over an alphabet $\Sigma$ is a symbolic representation of a regular language, using operations like concatenation, union, and Kleene star. For example, the expression $(a|b)^*$ represents the language of all strings consisting of any number of $a$’s and $b$’s.

Mathematically, the set of all regular expressions over $\Sigma$ can be recursively defined as follows:

The empty set $\emptyset$, the empty string $\epsilon$, and any single character $a \in \Sigma$ are regular expressions.
If $r_1$ and $r_2$ are regular expressions, then $r_1r_2$ (concatenation), $r_1 | r_2$ (union), and $r_1^*$ (Kleene star) are also regular expressions.

The equivalence of two regular expressions $r_1$ and $r_2$ means that they describe the same language:

\[L(r_1) = L(r_2)\]

That is, the set of strings accepted by $r_1$ is identical to that accepted by $r_2$. Unlike general program equivalence (which is undecidable), regular expression equivalence is decidable, making it a good candidate for minimization tasks.

I encountered this problem while working on generating a dataset of regular expressions and their minimal versions. This post describes the methods I used to achieve that goal.

Step 1: Enumerating Regular Expressions

The first step is to systematically generate all regular expressions up to a certain length $n$. This is a challenging combinatorial problem because the space of regular expressions grows exponentially with length.

To efficiently enumerate these expressions, we can represent them using a context-free grammar (CFG). A CFG provides a formal mechanism to define the structure of regular expressions through production rules. For instance, a simplified CFG for regular expressions could look like this:

\[S \rightarrow S + S \, | \, SS \, | \, S^* \, | \, (S) \, | \, a \, | \, b\]

where $S$ is a non-terminal symbol representing a regular expression, and $a, b$ are terminal symbols (characters from the alphabet).

The key insight here is that we can enumerate all regular expressions up to a fixed length by expanding these CFG rules recursively. This technique is based on Berstel and Brzozowski (2012), which provides a framework for enumerating regular expressions from the context-free language definition of regular expressions.

Formally, let $G = (V, \Sigma, P, S)$ be a CFG, where:

$V$ is the set of non-terminal symbols,
$\Sigma$ is the set of terminal symbols (our alphabet),
$P$ is the set of production rules,
$S$ is the start symbol.

The goal is to generate all strings in $L(G)$ (the language of the grammar) that have a length $\leq n$. This is essentially done by recursively applying the production rules until we reach strings of terminal symbols.

Here’s the Python code implementing this recursive expansion:

@dataclasses.dataclass(frozen=True)
class CFG:
    start: NonTerminal
    productions: List[Production]

def enumerate_cfg(cfg_info: EnumerateCFGInfo, size: int) -> Iterator[String]:
    """Enumerates all regular expressions up to a fixed length for a CFG."""
    def expand_rec(symb: NonTerminal, size: int) -> Iterator[String]:
        if size == 0:
            if symb in cfg_info.empty_non_terminals:
                yield tuple()
            return
        for rule, weight, num_non_terminals in cfg_info.productions[symb]:
            rem_size = size - weight
            if rem_size >= 0:
                for ns in partition(rem_size, num_non_terminals):
                    yield from itertools.product(*[expand_rec(s, n) for s, n in zip(rule, ns)])
    return expand_rec(cfg_info.grammar.start, size)

This code systematically enumerates all possible regular expressions up to length $n$, using the grammar $G$.

Step 2: Minimizing Regular Expressions Using DFA and Hashing

Once we can enumerate regular expressions, the next step is to find the minimal equivalent for each expression. This is done by:

Converting the regular expression into a DFA (deterministic finite automaton), which is a formal model for recognizing regular languages.
Minimizing the DFA to obtain the smallest possible automaton that recognizes the same language as the original expression. DFA minimization is a process to reduce the number of states in a DFA, while ensuring that the automaton is minimal with respect to recognizing the same language (DFA Minimization).
Hashing the minimized DFA to uniquely identify the language of the expression.

In step 2, by hashing the minimal DFA, we ensure that equivalent regular expressions (which describe the same language) have the same hash. In practice, I used the PADS library for handling DFAs and regular languages. Here is my fork of the library that supports hashing the DFAs.

Here is the Python-style pseudocode for this algorithm:

def generate_dataset(max_length, alphabet):
    minimals = {}  # Dictionary for storing minimal DFAs by hash
    for expr in enumerate_regular_expressions(max_length):
        M = convert_to_dfa(expr)  # Convert the regular expression to DFA
        h = hash(minimize_dfa(M))  # Hash of the minimized DFA

        if h in minimals:
            m = minimals[h]  # Retrieve the already stored minimal expression
        else:
            minimals[h] = expr  # Store the current expression as minimal
            m = expr

        yield expr, m  # Return the pair of original and minimal expression

Conclusion

By combining CFG-based enumeration with DFA minimization, we can generate a dataset of regular expressions (up to a specified length limit) and their minimal equivalents. Of course, the size of the dataset grows exponentially with the maximum length, so a machine learning approach that relies on such a dataset is likely only applicable to the toy problems of minimizing short regular expressions. Nonetheless, I found this approach to be a fun exercise in formal language theory and automata theory, and it provides a good starting point for exploring more complex problems in the future.

Background on Regular Expressions and Equivalence

Step 1: Enumerating Regular Expressions

Step 2: Minimizing Regular Expressions Using DFA and Hashing

Conclusion

Enjoy Reading This Article?