Skip to main content

LL(1) Grammar

For some unambiguous language, there is a way that we can scan the sentence from left to right one by one, and find the only possible derivation based on the next one character (if there exists two, it's ambiguous). This is called an LL(1)LL(1) grammar. The first LL stands for left starting. The second LL stands for leftmost derivation. 11 means that the grammar only need to look ahead of one symbol to know which rule to apply.

This part, we will focus on how we can test wether a CFG is LL(1)LL(1), and how we can build a parse tree for an LL(1)LL(1) language.

Testing a CFG for LL(1)LL(1)

Eliminate Left Recursion

Left recursion refers to the rules in form of,

AAaA \rightarrow Aa

Or there may exists indirect left recursion,

ABbBAaA \rightarrow Bb \\ B \rightarrow Aa

If AAxA \Rightarrow Ax can be inferred from the given rule set, the given rule set exists left recursion.

All CFG that has left recursion can't be LL(1)LL(1). However, we can eliminate left recursion by rewriting. That is,

Rewrite Left Recursion

AAαβA \rightarrow A \alpha | \beta

To eliminate the left recursion,

  1. Create a new non-terminal (let’s call it AA') to handle the recursive part.
  2. Reformulate the rules so that the recursive calls happen in the right-hand side and are postponed.

That is, making it, into:

AβAA \rightarrow \beta A’ AαAλA’ \rightarrow \alpha A’ | \lambda

So that it is right recursion instead of left recursion.

Flatten Indirect Left Recursion

For indirect left recursion, we can flat them into direct left recursion. That is, for,

EiAixiziA0A1x1y1A1A2x2y2AnA0x0y0E_i \rightarrow A_ix'_i | z_i \\ A_0 \rightarrow A_1x_1 | y_1 \\ A_1 \rightarrow A_2x_2 | y_2 \\ \ldots \\ A_{n} \rightarrow A_0x_0 | y_0

We should flatten this by iteratively substitution. That is, we can replace,

E1A1x1z1A0A1x1y1A1A2x2y2E_1 \rightarrow A_1x'_1 | z_1 \\ A_0 \rightarrow A_1x_1 | y_1 \\ A_1 \rightarrow A_2x_2 | y_2 \\

With,

A0A2x2x1y2y1y1E1A2x2x1y2x1z1A_0 \rightarrow A_2x_2x_1 | y_2y_1 | y_1 \\ E_1 \rightarrow A_2x_2x'_1 | y_2x'_1 | z_1 \\

This can be done iteratively until,

A0A0x0xnx1A_0 \rightarrow A_0 x_0 x_n \ldots x_1 | \ldots

So that there is no indirect left recursion.

Example

SACAAbCaaCAefS \rightarrow A | C \\ A \rightarrow A b | C a | a \\ C \rightarrow A e | f

We first flatten the indirect left recursion caused by,

ACaCAeA \rightarrow Ca \\ C \rightarrow A e

So, we can eliminate CC,

SAAefAAbAeafaS \rightarrow A | Ae | f \\ A \rightarrow A b | A e a | f | a \\

Now there is no indirect left recursion.

Then we eliminate left recursion,

SAAefAfAaAfAaAAbAλAeAλS \rightarrow A | Ae | f \\ A \rightarrow fA' | aA' | fA'' | aA'' \\ A' \rightarrow bA' | \lambda \\ A'' \rightarrow eA'' | \lambda \\

This is the final grammar without left recursion.

Factor Common Prefixes

Sometimes we may have,

AxBxCA \rightarrow xB | xC

There are two targets with same prefix. We should factor them out. That is,

AxAABCA \rightarrow xA' \\ A' \rightarrow B | C

This isn't present in our example.

Calculate the FIRST Set of Every Non-Terminal

The FIRST set of a non-terminal is defined as,

FIRST(A)={0@vAv}\text{FIRST}(A) = \{0@v | A \Rightarrow^* v \}

If AλA \Rightarrow^* \lambda, then,

FIRST(A)={0@vAv}+{λ}\text{FIRST}(A) = \{0@v | A \Rightarrow^* v \} + \{\lambda\}

We also define the FIRST set of a terminal as,

FIRST(a)={a}\text{FIRST}(a) = \{a\}

For a sentence xx, the FIRST set is defined as,

FIRST(x)=FIRST(0@x)\text{FIRST}(x) = \text{FIRST}(0@x)

That is, for xx, FIRST(x)\text{FIRST}(x) contains the first character of every possible derivation of xx. For null sentence, we treat the first character as λ\lambda.

Because we have eliminated left recursion, getting FIRST set is simple- we try every path of the derivation and, once we have a terminal symbol at the beginning, we add it to the set and end this path.

To be more specific, for each rule, if it's in form of,

AawA \rightarrow aw

So that aa should be added to the FIRST set of AA.

Or,

ABwA \rightarrow Bw

So then every symbol in \text{FIRST}(B) - \{\lambda\}$$ should be added to the FIRST set of A$.

If BλB \Rightarrow^* \lambda and wλw \rightarrow^* \lambda, then we also add λ\lambda to the FIRST set of AA.

Sometimes, we also have,

AλA \rightarrow \lambda

We will add λ\lambda to the FIRST set of AA.

Because there is no left recursion, the second case won't case any problem.

For example,

SAAefAfAaAAbAλS \rightarrow A | Ae | f \\ A \rightarrow fA' | aA' \\ A' \rightarrow bA' | \lambda

This is a subset of our previous left recursion elimination example. Thus there is also no left recursion.

We usually draw a graph that denotes the generated-by relationship between non-terminals. If a non-terminal AA can be generated by BB, we draw an edge from BB to AA.

We ignore self-loops.

So,

Then we usually do this in the order of topological sort. So that if we need to use the FIRST set of a non-terminal to calculate the FIRST set of another, they must have already been calculated.

tip

The topological order of a set of nodes in a graph is a linear ordering of the nodes such that for every edge uvu\rightarrow v in the graph, uu comes before vv in the ordering.

In practice, we take the first node that doesn't have any incoming edge as the start of the ordering. Then we remove this node and repeat the process until all nodes have been visited.

For acyclic graphs, we can always do thi.

So we start with AA', because the only rules for AA' are,

AbAλA' \rightarrow bA' | \lambda

So FIRST(A)={b,λ}\text{FIRST}(A') = \{b, \lambda\}.

Then the AA. Obviously, we can have,

FIRST(A)={f,a}\text{FIRST}(A) = \{f, a\}

Lastly for SS, because SS yields non-terminals AA at the left, or terminal ff, so,

FIRST(S)={f}+FIRST(A)={f,a}\text{FIRST}(S) = \{f\} + \text{FIRST}(A) = \{f, a\}

Calculate the FOLLOW Set of Every Non-Terminal

The FOLLOW set of a non-terminal is defined as,

FOLLOW(A)={avI,v=xAay}\text{FOLLOW}(A) = \{ a | v \in \mathcal{I}, v = xAay\}

Where aa is the terminal symbol that is after AA. xx and yy are arbitrary symbols.

That is to say, the FOLLOW set of a non-terminal is all the possible terminal symbols that are after the non-terminal during the derivation.

Specially, we will always add S\sout{S} to the FOLLOW set of SS. This is a special character denoting the end of the sentence. Because empty input is accepted, SS can always have S\sout{S} in the FOLLOW set.

Or alternatively, if you don't what to remember such an abrupt rule, you can add the following grammar,

SSSSS' \rightarrow \sout{S}S\sout{S}

Use SS' as the new starting symbol, then everything will play out naturally.

To get the FOLLOW set, more practically, we check every rule.

If it's in form of,

AzBaxA \rightarrow zBax

Then we add aa to the FOLLOW set of BB (x,zx, z is any intermediate sentence and aa is a terminal symbol).

If it's in form of,

AzBCxA \rightarrow zBCx

Then we add all of FIRST(C)\text{FIRST}(C) to the FOLLOW set of BB.

If it's in form of,

AzBββλA \rightarrow zB\beta \quad \beta \Rightarrow^* \lambda

Then we add all of FOLLOW(A)\text{FOLLOW}(A) to the FOLLOW set of BB.

If,

AaAsββλA \rightarrow aA s\beta \quad \beta \Rightarrow^* \lambda

We ignore it because it doesn't contribute to the FOLLOW set of AA.

Specially, we will always add S\sout{S} to the FOLLOW set of SS.

We still use,

SAAefAfAaAAbAλS \rightarrow A | Ae | f \\ A \rightarrow fA' | aA' \\ A' \rightarrow bA' | \lambda

And do it in reversed topological order.

Previously, we have FIRST sets,

FIRST(S)={f,a}FIRST(A)={f,a}FIRST(A)={b,λ}\text{FIRST}(S) = \{f, a\} \\ \text{FIRST}(A) = \{f, a\} \\ \text{FIRST}(A') = \{b, \lambda\}

There are no rules generating SS, and because SS is the start symbol, so, FOLLOW(S)={S}\text{FOLLOW}(S)=\{\sout{S}\}.

For AA,

SASAeS \rightarrow A \\ S \rightarrow Ae

Thus, FOLLOW(A)=FOLLOW(S)+{e}={e,S}\text{FOLLOW}(A) = \text{FOLLOW}(S) + \{e\} = \{e, \sout{S} \}

For AA',

AfAaAAbAA \rightarrow fA' | aA' \\ A' \rightarrow bA'

Thus, FOLLOW(A)=FOLLOW(A)={e,S}\text{FOLLOW}(A') = \text{FOLLOW}(A) = \{e, \sout{S} \}.

Calculate the SELECT Set of Every Rule

The SELECT set of a rule AxA \rightarrow x is defined as,

SELECT(Ax)={FIRST(x)FOLLOW(A)if λFIRST(x)FIRST(x)if λ∉FIRST(x)\text{SELECT}(A \rightarrow x) = \begin{cases} \text{FIRST}(x) \cup \text{FOLLOW}(A) & \text{if } \lambda \in \text{FIRST}(x) \\ \text{FIRST}(x) & \text{if } \lambda \not\in \text{FIRST}(x) \end{cases}

SELECT(Ax)\text{SELECT}(A \rightarrow x) is all the possible terminal symbols that can exist in a final sentence that can be generated from the rule.

Previously, we had,

FIRST(S)={f,a}FOLLOW(S)={S}FIRST(A)={f,a}FOLLOW(A)={e,S}FIRST(A)={b,λ}FOLLOW(A)={e,S}\text{FIRST}(S) = \{f, a\} \\ \text{FOLLOW}(S) = \{\sout{S}\} \\ \text{FIRST}(A) = \{f, a\} \\ \text{FOLLOW}(A) = \{e, \sout{S}\} \\ \text{FIRST}(A') = \{b, \lambda\} \\ \text{FOLLOW}(A') = \{e, \sout{S}\}

And rules,

SAAefAfAaAAbAλS \rightarrow A | Ae | f \\ A \rightarrow fA' | aA' \\ A' \rightarrow bA' | \lambda

So,

SELECT(SA)=FIRST(A)={f,a}SELECT(SAe)=FIRST(A)={f,a}SELECT(AfA)=FIRST(fA)={f}SELECT(AaA)=FIRST(aA)={a}SELECT(AbA)=FIRST(bA)+FOLLOW(A)={b,e,S}SELECT(Aλ)=FIRST(λ)={λ}\text{SELECT}(S \rightarrow A) = \text{FIRST}(A) = \{f, a\} \\ \text{SELECT}(S \rightarrow Ae) = \text{FIRST}(A) = \{f, a\} \\ \text{SELECT}(A \rightarrow fA') = \text{FIRST}(fA') = \{f\} \\ \text{SELECT}(A \rightarrow aA') = \text{FIRST}(aA') = \{a\} \\ \text{SELECT}(A' \rightarrow bA') = \text{FIRST}(bA') + \text{FOLLOW}(A') = \{b, e, \sout{S}\} \\ \text{SELECT}(A' \rightarrow \lambda) = \text{FIRST}(\lambda) = \{\lambda\}

Test for LL(1)LL(1)

For every rule in form of,

Ax0x1xnA \rightarrow x_0 | x_1 | \ldots | x_n

If

SELECT(Ax0)SELECT(Ax1)SELECT(Axn)\text{SELECT}(A\rightarrow x_0) \cap \text{SELECT}(A\rightarrow x_1) \cap \ldots \cap \text{SELECT}(A\rightarrow x_n) \neq \emptyset

Then the grammar is not LL(1)LL(1).

In this example,

SELECT(SA)=FIRST(A)={f,a}SELECT(SAe)=FIRST(A)={f,a}SELECT(AfA)=FIRST(fA)={f}SELECT(AaA)=FIRST(aA)={a}SELECT(AbA)=FIRST(bA)+FOLLOW(A)={b,e,S}\text{SELECT}(S \rightarrow A) = \text{FIRST}(A) = \{f, a\} \\ \text{SELECT}(S \rightarrow Ae) = \text{FIRST}(A) = \{f, a \} \\ \text{SELECT}(A \rightarrow fA') = \text{FIRST}(fA') = \{f\} \\ \text{SELECT}(A \rightarrow aA') = \text{FIRST}(aA') = \{a\} \\ \text{SELECT}(A' \rightarrow bA') = \text{FIRST}(bA') + \text{FOLLOW}(A') = \{b, e, \sout{S}\}

SAS \rightarrow A and SAeS \rightarrow Ae generates overlapping SELECT sets, so the grammar is not LL(1)LL(1).

Constructing LL Table

Now we need to create a table based on the FIRST and FOLLOW sets. This table can tell us what rule we should use to derive a sequence.

The table has row indices of every non-terminal, and column indices of every terminals. We note this table as, LLTb(A,a)\mathrm{LLTb}(A, a).

  • For every AxA \rightarrow x, if aFIRST(x)\exist a \in FIRST(x), such that a=0@xa = 0 @ x or aFIRST(0@x)a \in \text{FIRST}(0 @ x), then we add AxA \rightarrow x to the table cell LLTb(A,a)\mathrm{LLTb}(A, a).
  • For every AxA \rightarrow x, if x=λx = \lambda or λFIRST(x)\lambda \in \text{FIRST}(x), for each bFOLLOW(A)b \in \text{FOLLOW}(A), we add AxA \rightarrow x to the table cell LLTb(b,a)\mathrm{LLTb}(b, a).
tip

Do both checks for every rule.

Let's consider rules,

SAdAfAAbAλS \rightarrow A | d \\ A \rightarrow fA' \\ A' \rightarrow bA' | \lambda
Non-terminalFIRST
SS{d,f}\{d, f\}
AA{f}\{f\}
AA'{b,λ}\{b, \lambda\}
Non-terminalFOLLOW
SS{S}\{\sout{S}\}
AA{S}\{\sout{S}\}
AA'{S}\{\sout{S}\}

If you calculate the SELECT, you can find out that it is LL(1)LL(1).

Based on the given rules, we create the LL table,

For SdS \rightarrow d, dFIRST(d)d \in \text{FIRST}(d), so we add SdS \rightarrow d to the table cell LLTb(S,d)\mathrm{LLTb}(S, d).

For SAS \rightarrow A, fFIRST(A)f \in \text{FIRST}(A), so we add SAS \rightarrow A to the table cell LLTb(S,f)\mathrm{LLTb}(S, f).

For AfAA \rightarrow fA', fFIRST(fA)f \in \text{FIRST}(fA'), so we add AfAA \rightarrow fA' to the table cell LLTb(A,f)\mathrm{LLTb}(A, f).

For AbAA' \rightarrow bA', bFIRST(bA)b \in \text{FIRST}(bA'), so we add AbAA' \rightarrow bA' to the table cell LLTb(A,b)\mathrm{LLTb}(A', b).

Because AλA' \rightarrow \lambda generates λ\lambda, we should add AλA' \rightarrow \lambda to all LLTb(A,c)cFOLLOW(A)\mathrm{LLTb}(A', c) \quad \forall c \in \text{FOLLOW}(A').

That is, we have the following LL table,

bbddffS\sout{S}
SSSdS\rightarrow dSAS\rightarrow A
AAAfAA\rightarrow fA'
AA'AbAA'\rightarrow bA'AλA'\rightarrow \lambda
tip

If there exists a cell that has more than one rule, then the grammar is not LL(1)LL(1). If you met this, you should check if you got the correct FIRST, FOLLOW sets, and correctly tested it.

LL(1) Derivation

LL table tells us, if we want a sequence xx, how we should derive it.

  1. We append S\sout{S} to xx, then, we have a pointer at 00, call it ii. Now we have the intermediate sentence mm. Initially, m=Sm = S.
  2. We check the table cell, LLTb(Leftmost(m),i@x)\mathrm{LLTb}(\text{Leftmost}(m), i@x). We use the rule in this cell to perform leftmost derivation on the current intermediate sentence. If there exists no rule in the cell, it means that the given sequence cannot be derived from the grammar. If there exists no more non-terminals, the derivation is complete.
  3. Let i+ki + k be the new pointer, then repeat 2. kk is the length of the newly generated terminals on the left side.

We collect every derivation in the step 22. This is how we can use LL table to derive a sequence with leftmost derivation.

For our table,

bbddffS\sout{S}
SSSdS\rightarrow dSAS\rightarrow A
AAAfAA\rightarrow fA'
AA'AbAA'\rightarrow bA'AλA'\rightarrow \lambda

Let's consider sequence x=fbbx = fbb.

  1. We make x=fbbSx = fbb\sout{S}. Let pointer i=0i = 0. Our initial sentence is SS.
  2. We check the cell LLTb(S,f)\mathrm{LLTb}(S, f). It tells to derive SAS \Rightarrow A.
  3. i=0i = 0
  4. We check LLTb(A,f)\mathrm{LLTb}(A, f). It tells to derive AfAA \Rightarrow fA'.
  5. i=1i = 1
  6. We check LLTb(A,b)\mathrm{LLTb}(A', b). It tells to derive fAfbAfA' \Rightarrow fbA'.
  7. i=2i = 2
  8. We check LLTb(A,b)\mathrm{LLTb}(A', b). It tells to derive fbAfbbAfbA' \Rightarrow fbbA'.
  9. i=3i = 3
  10. We check LLTb(A,S)\mathrm{LLTb}(A', \sout{S}). It tells to derive AλA' \Rightarrow \lambda.

So the leftmost derivation to get fbbfbb is,

SlmAlmfAlmfbAlmfbbS \Rightarrow_{lm} A \Rightarrow_{lm} fA' \Rightarrow_{lm} fbA' \Rightarrow_{lm} fbb

Now we have the leftmost derivation, we can get the parse tree. Later, we will see how the parse tree of a programming language LL(1)LL(1) can be converted to intermediate representation.

However, LL(1)LL(1) has it's intrinsic flaw- that it can't represent some languages. The next chapter, we will introduce LR(1)LR(1), which is more powerful than LL(1)LL(1).