Basic Theory
Theoretical stuff gives everyone a headache, so I'll keep it minimal and just cover a few points.
Regular Expressions
Regular expressions are similar to the ones we commonly use, but for simplicity we only use three operators: $\cdot$, $*$ and $|$. In fact, most other operators' functionality can be replaced by these three operators1.
- $\cdot$ represents concatenation of two regular strings, like $a\cdot b$ matches ab, $\cdot$ are usually omitted by convention.
- $|$ represents union, like $a|b$ can match either $a$ or $b$
- $*$ represents zero or more repetitions, like $a*$ can match $\epsilon$, $a$, $aa$, $aaa$ etc., where $\epsilon$ represents empty string.
By convention, $*$ has highest precedence, followed by $\cdot$, and $|$ has lowest precedence. Of course we also have $()$ to indicate priority. Obviously:
- $\cdot$ and $|$ are associative
- $\cdot$ is distributive over $|$, like $a(b|c)\leftrightarrow ab|ac$
- $|$ is commutative
- $*$ is idempotent, $(a*)* \leftrightarrow a*$
- $\epsilon|a\leftrightarrow a$,$\epsilon \cdot a\leftrightarrow a$,$a \cdot \epsilon \leftrightarrow a$
Regular Grammar Productions
A production is something like this:
$$ A \rightarrow \epsilon $$ or $$ A \rightarrow abc $$ or $$ B \rightarrow aA $$ or $$ aC \rightarrow aAbBc $$ or $$ aAb \rightarrow Bac $$ or $$ A \rightarrow a|b* $$
The meaning is that the thing on the left represents the thing on the right. By convention we use lowercase letters to represent actual letters (called terminals) and uppercase letters to represent symbols that stand for other letters (called non-terminals).
Just having an intuitive understanding is enough - the formal definition just gives headaches without much value.
Regular Grammar
A regular grammar2 is:
A regular grammar is a set of productions3 that satisfy these conditions:
- The left side has exactly one non-terminal
- The right side is:
- $\epsilon$
- A terminal
- A terminal $\cdot$ a non-terminal
These productions are clearly convenient for substitution and later converting to NFA.
Practice
In practice, the algorithms in textbooks are too theoretical and have many issues in actual implementation. A lot of considerations are needed when implementing yourself.
Let's analyze with my code. The code is written in Kotlin, project at here.
Regular Expression
First we abstract the elements in regular expressions as an interface4:
interface RegexComponent
So what should implement this interface?
Letters
abstract class Character(private val name: String) : RegexComponent, Comparable<Character>
class NonTerminalCharacter(name: String) : Character(name)
open class TerminalCharacter(name: String) : Character(name)
object nullCharacter : TerminalCharacter("ε")
Results of Operations Between Regular Expression Elements
class Concated(val components: List<RegexComponent>) : RegexComponent
class Optioned(val components: Set<RegexComponent>) : RegexComponent
class Repeated(val toRepeat: RegexComponent) : RegexComponent
Initially I was uncertain about where to put the business logic for various operations, because while I wanted to maximize use of distributive and associative laws to simplify later work, this inevitably required runtime type checking, which feels like an anti-pattern since Replace Conditional with Polymorphism is typically recommended. However polymorphism actually makes this problem harder and would cause Duplicated Code, like when applying distributive law over $|$ we need to check for Optioned type separately in Character, Repeated and Concated. If we try using Polymorphism to add a property to all RegexComponents to return what should be distributed out, the code would unnecessarily increase and I couldn't even name this property properly (which usually indicates introducing it is not a good idea).
I finally decided to just put all business logic in one place, since this kind of academic problem won't have changing requirements anyway.
So I made a factory:
object RegexComponentFactory {
fun concated(component1: RegexComponent, component2: RegexComponent): RegexComponent
fun optioned(component1: RegexComponent, component2: RegexComponent): RegexComponent
fun repeated(component: RegexComponent): RegexComponent
// ...
}
Since it's a factory, might as well add methods for constructing RegexComponent from strings:
/**
* Add omitted '.' operators
*/
private fun addOmittedDotOperator(str: String): String {
val theString = StringBuilder()
for ((index, ch) in str.withIndex()) {
theString.append(ch)
if (index != str.length - 1) {
if ((ch.isLowerCase() || ch.isDigit() || ch == ')') &&
(str[index + 1].isLowerCase() || str[index + 1].isDigit() || str[index + 1] == '(') ||
(ch == '*' && !isRegexOperator(str[index + 1]) && str[index + 1] != ')')) {
theString.append('.')
}
}
}
return theString.toString()
}
/**
* Actual construction process
*/
private fun fromFormalString(string: String): RegexComponent {
if (string == "")
return nullCharacter
try {
return when {
string.startsWith('(') && pairedRightBracketIndex(string) == string.length - 1 ->
fromFormalString(eraseUselessBracketPairs(string))
string.length == 1 ->
TerminalCharacter(string)
string.firstLayerContain('|') ->
string.splitFirstLayerBy('|').map { fromFormalString(it) }.reduce { acc, regexPart -> acc or regexPart }
string.firstLayerContain('.') ->
string.splitFirstLayerBy('.').map { fromFormalString(it) }.reduce { acc, regexPart -> acc concat regexPart }
string.endsWith('*') ->
if (string.startsWith('(')) {
fromFormalString(string.slice(1 until string.length - 2)).repeat()
} else {
fromFormalString(string.slice(0 until string.length - 1)).repeat()
}
else ->
throw IllegalArgumentException("Can not construct from string $string")
}
} catch (_: StringIndexOutOfBoundsException) {
throw IllegalArgumentException("Can not construct from string $string")
} catch (_: IllegalArgumentException) {
throw IllegalArgumentException("Can not construct from string $string")
}
}
/**
* Construct RegexComponent from string
*/
fun fromString(string: String): RegexComponent {
return fromFormalString(addOmittedDotOperator(string))
}
Some helper functions used above can be found in Tools.kt
:
/**
* Check if @arg char is a regex operator
*/
fun isRegexOperator(char: Char) = char == '|' || char == '*' || char == '.'
/**
* Find matching ')' for a '('
*/
fun pairedRightBracketIndex(str: String, leftBracketIndex: Int = 0)
/**
* Remove useless brackets pairs from string sides
*/
fun eraseUselessBracketPairs(str: String): String
/**
* Check if a string contains char at first layer
* "First layer" means outside all brackets
*/
fun String.firstLayerContain(char: Char): Boolean
/**
* Split first layer by @arg char
*/
fun String.splitFirstLayerBy(char: Char): List<String>
At this point, we can make the computer "understand" a regex from our input string.
Now let's look at the grammar side.
Regular Grammar
For regular grammar, first we need productions:
class Generator(
val from: NonTerminalCharacter,
val to: RegexComponent
)
Our grammar follows the regular grammar from the textbook:
class Grammar(
val nonTerminals: Set<NonTerminalCharacter>,
val terminals: Set<TerminalCharacter>,
val rules: Set<Generator>,
val start: NonTerminalCharacter
) {
// We know we can initialize with some Generators
constructor(rles: Collection<Generator>) : this(
rles.map { it.from }.toSet(),
rles.map { it.alphabet }.reduce { acc, set -> acc + set },
rles.toSet(),
rles.minBy { it.from }!!.from
)
}
Conversion
Regular Expression $\rightarrow$ Regular Grammar
First we construct a production with a new non-terminal on the left and the regex to convert on the right.
For any production, we may face these situations:
-
The production's right side has outermost $\cdot$ operation In this case, we need to consider: - If the right side ends with a non-terminal, like $A → a ⋅ b ⋅ B$ or $A → ( a | b ) ⋅ B$ or $A → a ∗ ⋅ B$, we need to analyze the type of expression left of this non-terminal: - If this expression has outermost $\cdot$ operation, we can directly combine the rightmost two letters: $$ C → b ⋅ B $$ Then add the front part: $$ A → a ⋅ C $$ - We won't encounter cases where this expression has outermost $|$ operation because we already distributed $|$ operator using distributive law5 - If this expression has outermost $*$ operation, we can construct: $$ A → a A $$ and $$ A → B $$ - Otherwise, it's an expression like $$ A → a ⋅ b ⋅ c $$ We can convert it to $$ A → a ⋅ B $$ $$ B → b c $$
-
The production's right side has outermost $|$ operation, like $$ A → a | b | c $$ Just convert to $$ A → a $$ and $$ A → b $$ and $$ A → c $$
-
The production's right side has outermost $*$ operation, like $$ A → a ∗ $$ This is simplest, just convert to: $$ A → ϵ $$ and $$ A → a ⋅ A $$
We just need to repeatedly apply these rules to productions that don't yet conform to regular grammar until no production can be further simplified.
At this point the result is not completely regular grammar yet, as there are productions like $A\rightarrow B$ that directly derive to another non-terminal.
We can replace all $A$ in other productions with $B$ and remove this production.
Also remove any unused productions.
After this we've completed converting regex to regular grammar.
Regular Grammar $\rightarrow$ Regular Expression
To convert a regular grammar to regex, just follow this algorithm:
- Combine productions with same left side but different right sides using $|$ Like $B\rightarrow b$ and $B → b ⋅ A$ and $B → b ⋅ B$ becomes $B\rightarrow a|b\cdot A|b\cdot B$
- Convert productions where left side appears in right side to ones where it doesn't, like: $A → a ⋅ A$ becomes $A\rightarrow a*$ $B → a | b ⋅ A | b ⋅ B$ becomes $B \rightarrow b* \cdot a|b* \cdot b \cdot A$
- Take a production whose left side is not the grammar start symbol, replace its left side non-terminal in other productions with its corresponding string
- Repeat 3. until only one production remains.
- The right side of this production is what we want.
See code for details.
After defining these three operators, empty string $\epsilon$ and empty set $\varnothing$ on a finite alphabet, we get something magical: Kleene algebra
We're actually talking about right-linear grammar here, there's also left-linear grammar, both called regular grammars. We use right-linear grammar as example for simplicity
Technically regular grammar also includes alphabet, terminals, non-terminals etc., but these can all be derived from productions so they're not that important
This interface isn't just for regex parts, a better name would be "implements Kleene algebra" but that looks too deep...
But I still kept this branch in the code