Skip to main content

Context Free Grammar

Context free grammar allows rules in the form of,

AxA \rightarrow x

Where ANA \in \mathcal{N}, x(V+N)x \in (\mathcal{V} + \mathcal{N})^*.

Leftmost and Rightmost Derivations

Because a derivation operation allows us to replace any substring that is applicable to a rule, which results in uncertainty, we must have a deterministic way to derivate a sentence.

The leftmost derivation lm\Rightarrow_{lm} force us to replace the leftmost non-terminal symbol in the sentence. The rightmost derivation rm\Rightarrow_{rm} force us to replace the rightmost non-terminal symbol in the sentence.

Because we only change the order of the derivation, if only leftmost derivation or only rightmost derivation is allowed, the grammar generates the same language as the original grammar without restriction.

For example,

S(S)SS()S \rightarrow (S) | SS | ()

Is a context free grammar that can generate paired parentheses.

Let's consider the sequence,

(()()())(()()())

There are many ways to generate this sequence, for example,

S(S)(SS)(S())(SS())(()S())(()()())S \Rightarrow (S) \Rightarrow (SS) \Rightarrow (S()) \Rightarrow (SS()) \Rightarrow (()S()) \Rightarrow (()()())

We didn't choose the non-terminal to replace on the leftmost side nor the rightmost side. Thus, it is neither leftmost nor rightmost derivation.

But the following sequence,

Slm(S)lm(SS)lm(()S)lm(()SS)lm(()()S)lm(()()())S \Rightarrow_{lm} (S) \Rightarrow_{lm} (SS) \Rightarrow_{lm} (()S) \Rightarrow_{lm} (()SS) \Rightarrow_{lm} (()()S) \Rightarrow_{lm} (()()())

Is a leftmost derivation sequence.

Of course, the leftmost derivation sequence isn't always unique, consider,

Slm(S)lm(SS)lm(SSS)lm(()SS)lm(()()S)lm(()()())S \Rightarrow_{lm} (S) \Rightarrow_{lm} (SS) \Rightarrow_{lm} (SSS) \Rightarrow_{lm} (()SS) \Rightarrow_{lm} (()()S) \Rightarrow_{lm} (()()())

However, since leftmost or rightmost derivation doesn't change the yielded language, we usually choose one so that it makes the analysis easier.

Parse Tree

Parse tree is another way to represent the derivation of a sentence. We can use parse tree to prove that leftmost and rightmost derivation has the same result as the original derivation.

The root of a parse tree is always the starting symbol.

The leaves of a parse tree is always the terminal symbols.

The internal nodes of a parse tree is always the non-terminal symbols.

If a non-terminal generates sequence vv, then each letter in vv is a child of the non-terminal node (in oreder).

For,

Slm(S)lm(SS)lm(SSS)lm(()SS)lm(()()S)lm(()()())S \Rightarrow_{lm} (S) \Rightarrow_{lm} (SS) \Rightarrow_{lm} (SSS) \Rightarrow_{lm} (()SS) \Rightarrow_{lm} (()()S) \Rightarrow_{lm} (()()())

We have the following parse tree,

Because the we can traverse through the tree with DFS in any order (left first or right first), so the leftmost and rightmost derivation has the same result, so is for derivation without restriction.

Ambiguity

If a sentence can be generated in more than one way using leftmost or rightmost derivation, then the grammar is ambiguous.

For sequence (()()())(()()()), we listed two different leftmost derivation,

Slm(S)lm(SS)lm(()S)lm(()SS)lm(()()S)lm(()()())S \Rightarrow_{lm} (S) \Rightarrow_{lm} (SS) \Rightarrow_{lm} (()S) \Rightarrow_{lm} (()SS) \Rightarrow_{lm} (()()S) \Rightarrow_{lm} (()()()) Slm(S)lm(SS)lm(SSS)lm(()SS)lm(()()S)lm(()()())S \Rightarrow_{lm} (S) \Rightarrow_{lm} (SS) \Rightarrow_{lm} (SSS) \Rightarrow_{lm} (()SS) \Rightarrow_{lm} (()()S) \Rightarrow_{lm} (()()())

Thus, the grammar,

S(S)SS()S \rightarrow (S) | SS | ()

Is ambiguous.

note

Ambiguous grammar can be defined in any of the following ways,

  • Existing a sequence that has more than one ways to derive with leftmost or rightmost derivation.
  • Existing a sequence that can be generated by more than one parse tree.

Please note that ambiguous grammar doesn't mean ambiguous language. The same language can be generated with an unambiguous grammar,

S(S)SϵS → ( S ) S | \epsilon

This is an unambiguous grammar.

tip

After the next chapter, you will see that this is actually an LL(1)LL(1) grammar that generates valid parentheses.

However, things aren't always so nice- some languages has it's inherent ambiguity. That all grammar yields it is ambiguous. For example,

0i1j2k  s.t.  i=jj=k0^i1^j2^k \; s.t. \; i = j \lor j = k

Is an inherently ambiguous language.

Chomsky Normal Form

CNF and Nullable CNF

The context free grammar has too little restriction, that it is hard to study. Instead, we use an equal form called Chomsky Normal Form (CNF).

CNF grammar enforce rules to be in either of

ABCAaA \rightarrow BC \\ A \rightarrow a

For CFG G\mathcal{G}, there always exist an CNF Gnf\mathcal{G}_{nf} that,

L(G){λ}=L(Gnf)\mathcal{L}(\mathcal{G}) - \{ \lambda \} = \mathcal{L}(\mathcal{G}_{nf})

We can also use nullable CNF, that is, we specifically allow,

SλS \rightarrow \lambda

And enforce no other rule has SS as the target. Then,

L(G)=L(Gnf  nullable)\mathcal{L}(\mathcal{G}) = \mathcal{L}(\mathcal{G}_{nf\;\text{nullable}})

Reduction into Chomsky Normal Form

To convert a CFG to CNF, we follow the given steps. The following step is logically valid and can be proven to have the listed properties on the generated language, but proving them is too trivial and grinding- so this note only shows you how to do it.

The reduction will yield a CNF that generates the same language but may be without the λ\lambda.

Please notice that the following steps must be done in order. Or else some non-CNF rules may exist after the steps.

Removing Empty Production

We find all,

AiλA_i \rightarrow \lambda

Then we remove all of them.

Removing Unit Production

A unit production is a rule of the form,

ABA \rightarrow B

That a non-terminal only generates a new non-terminal.

If,

ABBbA \rightarrow B \\ B \rightarrow b

We abridge ABA \rightarrow B to,

AbA \rightarrow b

This should be done recursively, until there is no more unit production.

Removing Dead Non-Terminal

Some non-terminal may never yield a terminal character. For example,

BbBB \rightarrow bB

And there are no rule,

BbB \rightarrow b'

Then BB is a dead non-terminal. We should just remove all rules that yields string that contains BB.

Removing Unreachable Non-Terminal

After our previous steps, some non-terminals may become out of reach from the start symbol. We simply remove them and any rules that contains them.

Decomposing Non-Terminal

For every rule that generates more than two symbols, for example,

AawA \rightarrow aw

Where aa can either be terminal or non-terminal, w(N+V)+w \in (\mathcal{N} + \mathcal{V})^+.

We decompose it to,

AA1A2A1aA2wA \rightarrow A_1A_2 \\ A_1 \rightarrow a \\ A_2 \rightarrow w

We repeat this process until the grammar is a CNF grammar.

Example

That was a hell lot of things to memorize. Let's look at an example,

SABAaAaλBCbCDDdDEdDEeES \rightarrow A | B \\ A \rightarrow aA | a | \lambda \\ B \rightarrow C | b \\ C \rightarrow D \\ D \rightarrow dDE | dD \\ E \rightarrow eE \\

Removing Empty Production

The only empty production is,

AλA \rightarrow \lambda

We simply remove it, so that,

SABAaAaBCbCDDdDEdDEeES \rightarrow A | B \\ A \rightarrow aA | a \\ B \rightarrow C | b \\ C \rightarrow D \\ D \rightarrow dDE | dD \\ E \rightarrow eE \\

Removing Unit Production

The unit production here includes,

SABBCCDS \rightarrow A | B \\ B \rightarrow C \\ C \rightarrow D \\

We can draw a tree based on the generated-by relation,

After abbreviating C,DC,D,

SABAaAaBCbCdDEdDDdDEdDEeES \rightarrow A | B \\ A \rightarrow aA | a \\ B \rightarrow C | b \\ C \rightarrow dDE | dD \\ D \rightarrow dDE | dD \\ E \rightarrow eE \\

Then B,CB,C,

SABAaAaBdDEdDbCdDEdDDdDEdDEeES \rightarrow A | B \\ A \rightarrow aA | a \\ B \rightarrow dDE | dD | b \\ C \rightarrow dDE | dD \\ D \rightarrow dDE | dD \\ E \rightarrow eE \\

Then S,BS,B,

SAdDEdDbAaAaBdDEdDbCdDEdDDdDEdDEeES \rightarrow A | dDE | dD | b \\ A \rightarrow aA | a \\ B \rightarrow dDE | dD | b \\ C \rightarrow dDE | dD \\ D \rightarrow dDE | dD \\ E \rightarrow eE \\

Then S,AS,A,

SaAadDEdDbAaAaBdDEdDbCdDEdDDdDEdDEeES \rightarrow aA | a | dDE | dD | b \\ A \rightarrow aA | a \\ B \rightarrow dDE | dD | b \\ C \rightarrow dDE | dD \\ D \rightarrow dDE | dD \\ E \rightarrow eE \\

Removing Dead Non-Terminal

Here, D,ED,E never yields a fully terminal string, so we remove all rules that contains D,ED,E.

SaAadDbAaAaBbS \rightarrow aA | a | dD | b \\ A \rightarrow aA | a \\ B \rightarrow b \\

Removing Unreachable Non-Terminal

Now we draw a graph based on the generated-by relation of all non-terminals.

We can see there is no path from SS to BB, so we remove BB and all the rules that contains it.

SaAadDbAaAaS \rightarrow aA | a | dD | b \\ A \rightarrow aA | a \\

Decomposing Non-Terminal

Now we look at all the rules that go against CNF, that is,

SaAdDAaAS \rightarrow aA | dD \\ A \rightarrow aA \\

We break it by adding intermediate non-terminals,

SSaASdDAAaAAaaSaaSddS \rightarrow S_aA | S_dD \\ A \rightarrow A_aA \\ A_a \rightarrow a \\ S_a \rightarrow a \\ S_d \rightarrow d \\

Now this is the CNF of the original grammar.

Later, our analysis will be basedon the CNF form of the CFG.