For some unambiguous language, there is a way that we can scan the sentence from left to right one by one, and find the only possible derivation based on the next one character (if there exists two, it's ambiguous). This is called an LL(1) grammar. The first L stands for left starting. The second L stands for leftmost derivation. 1 means that the grammar only need to look ahead of one symbol to know which rule to apply.
This part, we will focus on how we can test wether a CFG is LL(1), and how we can build a parse tree for an LL(1) language.
Testing a CFG for LL(1)
Eliminate Left Recursion
Left recursion refers to the rules in form of,
A→Aa
Or there may exists indirect left recursion,
A→BbB→Aa
If A⇒Ax can be inferred from the given rule set, the given rule set exists left recursion.
All CFG that has left recursion can't be LL(1). However, we can eliminate left recursion by rewriting. That is,
Rewrite Left Recursion
A→Aα∣β
To eliminate the left recursion,
- Create a new non-terminal (let’s call it A′) to handle the recursive part.
- Reformulate the rules so that the recursive calls happen in the right-hand side and are postponed.
That is, making it, into:
A→βA’
A’→αA’∣λ
So that it is right recursion instead of left recursion.
Flatten Indirect Left Recursion
For indirect left recursion, we can flat them into direct left recursion. That is, for,
Ei→Aixi′∣ziA0→A1x1∣y1A1→A2x2∣y2…An→A0x0∣y0
We should flatten this by iteratively substitution. That is, we can replace,
E1→A1x1′∣z1A0→A1x1∣y1A1→A2x2∣y2
With,
A0→A2x2x1∣y2y1∣y1E1→A2x2x1′∣y2x1′∣z1
This can be done iteratively until,
A0→A0x0xn…x1∣…
So that there is no indirect left recursion.
Example
S→A∣CA→Ab∣Ca∣aC→Ae∣f
We first flatten the indirect left recursion caused by,
A→CaC→Ae
So, we can eliminate C,
S→A∣Ae∣fA→Ab∣Aea∣f∣a
Now there is no indirect left recursion.
Then we eliminate left recursion,
S→A∣Ae∣fA→fA′∣aA′∣fA′′∣aA′′A′→bA′∣λA′′→eA′′∣λ
This is the final grammar without left recursion.
Factor Common Prefixes
Sometimes we may have,
A→xB∣xC
There are two targets with same prefix. We should factor them out. That is,
A→xA′A′→B∣C
This isn't present in our example.
Calculate the FIRST Set of Every Non-Terminal
The FIRST set of a non-terminal is defined as,
FIRST(A)={0@v∣A⇒∗v}
If A⇒∗λ, then,
FIRST(A)={0@v∣A⇒∗v}+{λ}
We also define the FIRST set of a terminal as,
FIRST(a)={a}
For a sentence x, the FIRST set is defined as,
FIRST(x)=FIRST(0@x)
That is, for x, FIRST(x) contains the first character of every possible derivation of x. For null sentence, we treat the first character as λ.
Because we have eliminated left recursion, getting FIRST set is simple- we try every path of the derivation and, once we have a terminal symbol at the beginning, we add it to the set and end this path.
To be more specific, for each rule, if it's in form of,
A→aw
So that a should be added to the FIRST set of A.
Or,
A→Bw
So then every symbol in \text{FIRST}(B) - \{\lambda\}$$ should be added to the FIRST set of A$.
If B⇒∗λ and w→∗λ, then we also add λ to the FIRST set of A.
Sometimes, we also have,
A→λ
We will add λ to the FIRST set of A.
Because there is no left recursion, the second case won't case any problem.
For example,
S→A∣Ae∣fA→fA′∣aA′A′→bA′∣λ
This is a subset of our previous left recursion elimination example. Thus there is also no left recursion.
We usually draw a graph that denotes the generated-by relationship between non-terminals. If a non-terminal A can be generated by B, we draw an edge from B to A.
We ignore self-loops.
So,
Then we usually do this in the order of topological sort. So that if we need to use the FIRST set of a non-terminal to calculate the FIRST set of another, they must have already been calculated.
The topological order of a set of nodes in a graph is a linear ordering of the nodes such that for every edge u→v in the graph, u comes before v in the ordering.
In practice, we take the first node that doesn't have any incoming edge as the start of the ordering. Then we remove this node and repeat the process until all nodes have been visited.
For acyclic graphs, we can always do thi.
So we start with A′, because the only rules for A′ are,
A′→bA′∣λ
So FIRST(A′)={b,λ}.
Then the A. Obviously, we can have,
FIRST(A)={f,a}
Lastly for S, because S yields non-terminals A at the left, or terminal f, so,
FIRST(S)={f}+FIRST(A)={f,a}
Calculate the FOLLOW Set of Every Non-Terminal
The FOLLOW set of a non-terminal is defined as,
FOLLOW(A)={a∣v∈I,v=xAay}
Where a is the terminal symbol that is after A. x and y are arbitrary symbols.
That is to say, the FOLLOW set of a non-terminal is all the possible terminal symbols that are after the non-terminal during the derivation.
Specially, we will always add S to the FOLLOW set of S. This is a special character denoting the end of the sentence. Because empty input is accepted, S can always have S in the FOLLOW set.
Or alternatively, if you don't what to remember such an abrupt rule, you can add the following grammar,
S′→SSS
Use S′ as the new starting symbol, then everything will play out naturally.
To get the FOLLOW set, more practically, we check every rule.
If it's in form of,
A→zBax
Then we add a to the FOLLOW set of B (x,z is any intermediate sentence and a is a terminal symbol).
If it's in form of,
A→zBCx
Then we add all of FIRST(C) to the FOLLOW set of B.
If it's in form of,
A→zBββ⇒∗λ
Then we add all of FOLLOW(A) to the FOLLOW set of B.
If,
A→aAsββ⇒∗λ
We ignore it because it doesn't contribute to the FOLLOW set of A.
Specially, we will always add S to the FOLLOW set of S.
We still use,
S→A∣Ae∣fA→fA′∣aA′A′→bA′∣λ
And do it in reversed topological order.
Previously, we have FIRST sets,
FIRST(S)={f,a}FIRST(A)={f,a}FIRST(A′)={b,λ}
There are no rules generating S, and because S is the start symbol, so, FOLLOW(S)={S}.
For A,
S→AS→Ae
Thus, FOLLOW(A)=FOLLOW(S)+{e}={e,S}
For A′,
A→fA′∣aA′A′→bA′
Thus, FOLLOW(A′)=FOLLOW(A)={e,S}.
Calculate the SELECT Set of Every Rule
The SELECT set of a rule A→x is defined as,
SELECT(A→x)={FIRST(x)∪FOLLOW(A)FIRST(x)if λ∈FIRST(x)if λ∈FIRST(x)
SELECT(A→x) is all the possible terminal symbols that can exist in a final sentence that can be generated from the rule.
Previously, we had,
FIRST(S)={f,a}FOLLOW(S)={S}FIRST(A)={f,a}FOLLOW(A)={e,S}FIRST(A′)={b,λ}FOLLOW(A′)={e,S}
And rules,
S→A∣Ae∣fA→fA′∣aA′A′→bA′∣λ
So,
SELECT(S→A)=FIRST(A)={f,a}SELECT(S→Ae)=FIRST(A)={f,a}SELECT(A→fA′)=FIRST(fA′)={f}SELECT(A→aA′)=FIRST(aA′)={a}SELECT(A′→bA′)=FIRST(bA′)+FOLLOW(A′)={b,e,S}SELECT(A′→λ)=FIRST(λ)={λ}
Test for LL(1)
For every rule in form of,
A→x0∣x1∣…∣xn
If
SELECT(A→x0)∩SELECT(A→x1)∩…∩SELECT(A→xn)=∅
Then the grammar is not LL(1).
In this example,
SELECT(S→A)=FIRST(A)={f,a}SELECT(S→Ae)=FIRST(A)={f,a}SELECT(A→fA′)=FIRST(fA′)={f}SELECT(A→aA′)=FIRST(aA′)={a}SELECT(A′→bA′)=FIRST(bA′)+FOLLOW(A′)={b,e,S}
S→A and S→Ae generates overlapping SELECT sets, so the grammar is not LL(1).
Constructing LL Table
Now we need to create a table based on the FIRST and FOLLOW sets. This table can tell us what rule we should use to derive a sequence.
The table has row indices of every non-terminal, and column indices of every terminals. We note this table as, LLTb(A,a).
- For every A→x, if ∃a∈FIRST(x), such that a=0@x or a∈FIRST(0@x), then we add A→x to the table cell LLTb(A,a).
- For every A→x, if x=λ or λ∈FIRST(x), for each b∈FOLLOW(A), we add A→x to the table cell LLTb(b,a).
Do both checks for every rule.
Let's consider rules,
S→A∣dA→fA′A′→bA′∣λ
Non-terminal | FIRST |
---|
S | {d,f} |
A | {f} |
A′ | {b,λ} |
Non-terminal | FOLLOW |
---|
S | {S} |
A | {S} |
A′ | {S} |
If you calculate the SELECT, you can find out that it is LL(1).
Based on the given rules, we create the LL table,
For S→d, d∈FIRST(d), so we add S→d to the table cell LLTb(S,d).
For S→A, f∈FIRST(A), so we add S→A to the table cell LLTb(S,f).
For A→fA′, f∈FIRST(fA′), so we add A→fA′ to the table cell LLTb(A,f).
For A′→bA′, b∈FIRST(bA′), so we add A′→bA′ to the table cell LLTb(A′,b).
Because A′→λ generates λ, we should add A′→λ to all LLTb(A′,c)∀c∈FOLLOW(A′).
That is, we have the following LL table,
| b | d | f | S |
---|
S | | S→d | S→A | |
A | | | A→fA′ | |
A′ | A′→bA′ | | | A′→λ |
If there exists a cell that has more than one rule, then the grammar is not LL(1). If you met this, you should check if you got the correct FIRST, FOLLOW sets, and correctly tested it.
LL(1) Derivation
LL table tells us, if we want a sequence x, how we should derive it.
- We append S to x, then, we have a pointer at 0, call it i. Now we have the intermediate sentence m. Initially, m=S.
- We check the table cell, LLTb(Leftmost(m),i@x). We use the rule in this cell to perform leftmost derivation on the current intermediate sentence. If there exists no rule in the cell, it means that the given sequence cannot be derived from the grammar. If there exists no more non-terminals, the derivation is complete.
- Let i+k be the new pointer, then repeat 2. k is the length of the newly generated terminals on the left side.
We collect every derivation in the step 2. This is how we can use LL table to derive a sequence with leftmost derivation.
For our table,
| b | d | f | S |
---|
S | | S→d | S→A | |
A | | | A→fA′ | |
A′ | A′→bA′ | | | A′→λ |
Let's consider sequence x=fbb.
- We make x=fbbS. Let pointer i=0. Our initial sentence is S.
- We check the cell LLTb(S,f). It tells to derive S⇒A.
- i=0
- We check LLTb(A,f). It tells to derive A⇒fA′.
- i=1
- We check LLTb(A′,b). It tells to derive fA′⇒fbA′.
- i=2
- We check LLTb(A′,b). It tells to derive fbA′⇒fbbA′.
- i=3
- We check LLTb(A′,S). It tells to derive A′⇒λ.
So the leftmost derivation to get fbb is,
S⇒lmA⇒lmfA′⇒lmfbA′⇒lmfbb
Now we have the leftmost derivation, we can get the parse tree. Later, we will see how the parse tree of a programming language LL(1) can be converted to intermediate representation.
However, LL(1) has it's intrinsic flaw- that it can't represent some languages. The next chapter, we will introduce LR(1), which is more powerful than LL(1).