Context-Free Grammar

What Regular Languages Cannot Do

For the following grammar:

A → 0A1
A → B
B → #

It is impossible to write a regular expression (or construct a finite automaton) that can recognize the language of this grammar.

This grammar is a context-free grammar that cannot be reduced to a regular grammar.

Formal Definition of Context-Free Grammar

A context-free grammar is defined by the following 4-tuple $(V, T, R, S)$ where:

  • V is the set of variables
  • T is the set of terminals
  • R is the set of grammar rules (A → 0A1 etc.)
  • S is the start variable, representing the whole language

Pushdown Automata

Similar to how finite state machines can recognize regular grammars, context-free grammars can be recognized by a type of state machine.

Looking at the grammar above, we can see that it generates a language where "there are equal numbers of 0s and 1s on either side of #".

We can push a 0 onto a stack when we "see" a 0, and pop when we see a 1. If the stack can be emptied at the end, then the string conforms to this grammar.

Formal Definition of Pushdown Automata

A pushdown automaton is defined by the following 6-tuple $(Q, \Sigma, S, \delta, q_0, F)$ where:

  • Q is the set of states
  • $\Sigma$ is the input alphabet
  • S is the stack alphabet
  • $\delta$ is the transition function, which takes a state, an input symbol and a stack top symbol (which can be empty) as parameters, outputs a new state and a new stack top symbol, representing the new state and stack operation when the machine receives certain input in a certain state with a certain stack top symbol. A (state, input symbol, stack top symbol) may correspond to multiple outputs
  • $q_0$ is the start state
  • F is the set of final or accepting states

Constructing a Pushdown Automaton to Recognize a Context-Free Grammar

The simplest method is to ignore states and keep only one state $q_0$, with all transitions happening on the stack.

  • For all variables A and their derivations A → X, regardless of whether X is a non-terminal (variable), terminal or any formula, we add a stack symbol representing X and define the value of $\delta$ at point A: $\delta(q, ε, A) := (q, X)$1.
  • For all terminals b, we define the value of $\delta$ at point b: $\delta(q, ε, b) := (q, ε)$.

If the resulting automaton can empty its stack, it means it can accept this language.

Deterministic Pushdown Automata and Deterministic Context-Free Grammar

There exists a subset of pushdown automata where in their transition functions, a (state, input symbol, stack top symbol) corresponds to only one output - these are called deterministic pushdown automata, and the grammars they can recognize are called deterministic context-free grammars.

This type of pushdown automata only needs to scan the input string once to determine if it can be accepted, indicating that parsing such grammars has linear complexity, which makes these grammars (especially several of their subsets) very important in the practical engineering of various parsers.

Determining if a Grammar is Deterministic

The determination method is called the DK-test. When studying LR(0) analysis, we learned how to construct LR(0) state machines, which should actually be called DK state machines.

If in this DK state machine, the states corresponding to accepting states in the original grammar:

  • Have one and only one production representing a completed match (i.e., ⋅ is at the end of the production)
  • None of the productions are of the form $B → X ⋅ a Y$, where a is a terminal

Then the grammar is deterministic, and any deterministic grammar can pass this DK-test.

1

X may not be a "letter", but you can assign a letter from the stack alphabet as a substitute for each X.