Sethi first defines context free grammars using the notation of Backus-Naur forms (BNF), and then later also discusses extended BNF (EBNF) and syntax charts. These three formalisms are equivalent, in the sense that exactly the same set of languages can be defined using any of them as using the others; moreover, there are several other formalisms that are equivalent to these, including context free grammars in the sense of Chomsky, which replace the symbol ::= by ->, and do not use the | symbol. Moreover, there are several other kinds of grammar, including recursive grammars and Post systems, which are more expressive than context free grammars, in the sense that the set of languages that can be defined using them is strictly larger than the set that can be defined using context free grammars; and there are also grammars that are less expressive than the context free grammars, including regular expressions.
Perhaps it will help to be really formal about some of this, since Sethi is less formal. So here we go:-
Mathematically, a language is a set of (finite) strings, where a string is a sequence of letters, from another set that is usually called the alphabet of the language. Formally, we write A* for the set of all finite strings from the alphabet A, including the empty string, which we will usually write [] - but beware, because many other notations are used for it, including the Greek letter lambda and the Greek letter epsilon; Sethi uses the notation <empty>.
For programming, context free languages are especially important, and as remarked above, BNF is just one kind of grammar for defining these languages. Here is Chomsky's original version:
A context free grammar or CFG is a 4-tuple (N,T,P,S), where:
N0 -> w0 N1 w1 N2 ... wn-1 Nn wnwhere each Ni is a non-terminal, and each wi is a string of terminals (possibly empty);
Given a CFG G, let N be a nonterminal, p a production N -> r, and e some string in A*. Then we write
e => e' with pto indicate that some single instance of N in e is replaced by r, and we may also omit "with p" when the above holds for some p in P. Next, we write
e =*=> e'when e => e1 => ... => en = e' for some n; this is called a derivation. Finally, we can now define
L(G) = { t | S =*=> t and t in A*} .
and we can make an important definition: A grammar G is
ambiguous if there is an element of L(G) that has more than
one derivation; this is the same as that element having more than one parse
tree. Note that this is a syntactic notion of ambiguity, and that
not all syntactic ambiguities lead to semantic ambiguities in programming
languages (e.g., the syntactic ambiguity of a+b+c), though some do (e.g.,
the infamous dangling else ambiguity).
Sethi seems a bit confused about the notion of abstract syntax and in particular, the discussion of 4-2-1 on page 44 seems misleading, because it is actually impossible to get the subexpression 4-2 in the right associative grammar R. For our purposes, we can just define the abstract syntax of an expression to be just the intended parse tree for that expression. (It is possible to be more precise about this, but not worth the trouble for this course.)
Here is a "recipe" for eliminating ambiguities in infix operations: If there are N levels of precedence, use N+1 nonterminals, one for each level plus one for atoms (i.e., the smallest subterms); if an operation @ is left associative, use the production L -> L @ K, and if it is right associative, use the production R -> K @ R, where @ is any (binary infix) operator and K is the nonterminal for expressions having the next lower precedence from @.
<rexp> ::= Ø
<rexp> ::= (<rexp> <rexp>)
<rexp> ::= (<rexp> + <rexp>)
<rexp> ::= <rexp>*
<rexp> ::= a for each a in A
where <rexp> is the nonterminal for regular expressions. In the first
line, "Ø" is a Danish letter, standing for the empty set, as usual in
mathematics. The second line has an implicit binary infix operator, that is
usually called juxtaposition or concatenation. The last line is
really an abbreviation for a finite number of lines, one for each "letter" (or
terminal symbol) in the alphabet A. It is usual to leave out the
parentheses when they can be inferred by the associativity of + or of
justaxposition, or by assuming a stronger binding for juxtaposition than for
+. For a simple example of a regular expression,
(a b)(a b)*defines the language consisting of finite nonempty strings of ab. Here is another example,
(a* b)(a* b)* + (b* a)(b* a)*defining the language of nonempty strings of b's with some number (possibly zero) of a's in front of each b, and of nonempty strings of a's with some number (possibly zero) of b's in front of each a.
Many computer scientists' ideas about natural language seem to have been influenced by concepts like the above, which take a formal view along the lines Noam Chomsky, rather than that of linguists who study what real people actually write and say. An important point is that in natural language, context determines whether or not something makes sense; formal syntax and semantics are very far from adequate, and indeed, the distinction among syntax, semantics and pragmatics does not hold up under close examination. On the other hand, the formal linguists' way of looking at syntax and semantics works well for programming languages, because we can define things to work that way, and because traditionally, programming language designers want to achieve as much independence from context as possible (though this might change for future computer systems!).
It follows that the above definition of "language" is not suitable for natural language, where context determines whether something is acceptable or not, and where even then, there may be degrees of acceptability, and these may vary among different subcommunities, and even individuals, who speak a given langauge; morever, all of this changes (usually slowly) over time. For example, "yeah" is listed as a word in some dictionaries but not in others; perhaps it will gradually become more and more accepted, as have many other words over the course of time. At the level of syntax, English as spoken in black Harlem (a neighborhood of New York City) differs in some significant ways from "standard English" (if there is such a thing), in its lexicon, its syntax, and its pronunciation; this dialect has been studied under the name "Black English" by William Labov and others, who have shown that in some ways it is more coherent than so called "standard English".
It may be interesting to note that the first (known) grammar was given by Panini for Sanskrit, a sacred language of ancient India, more than 2,500 years ago. This work included very sophisticated components for phonetics, lexicon, and syntax, and was in many ways similar to modern context free grammars. The motivation for this work was to preserve the ancient sacred texts of Hinduism.
If we denote the empty set by "{}" and the empty string by the empty character, then there will be no way to tell the difference between the empty set and the set containing the empty string. So this is a bad idea, and it helps to explain why we use [] or maybe the Greek letter epsilon for the empty string, and Ø for the empty set. I am afraid you will have to get used to there being a lot of notational variantion in mathematics, just as there is a lot of notational variation for programming languages; in fact, I am afraid that notational variation will be an ongoing part of the life of every computer science professional for the foreseeable future, so that developing a tolerance for it should be part of your professional training.
[[_]] : R -> P(A*) ,where A* indicates the set of all finite strings over A and P(A*) indicates the "power set" of A*, i.e., the set of all subsets of A (another notation uses exponentiation, so that 2 ** X indicates the set of all subsets of X). Then we can actually define this denotation function recursively as follows:
[[Ø]] = Ø
[[E E']] = [[E]] º [[E']]
[[E + E']] = [[E]] U [[E']]
[[E*]] = [[E]]*
[[a]] = {a} for each a in A
where the operations º and * on sets of strings are
defined as follows:
A º B = { ab | a in A and b in B }
A* = { [] } U { a1 a2...an | ai in A, n > 0 }
Using all these equations, we can compute the denotation of any regular
expression, just by applying them recursively until all [[_]] pairs have been
eliminated. Notice that this semantics is compositional, in the sense
that the denotation of each part is "composed from" (i.e., computed from) the
denotations of its parts. For example,
[[ (a b) (a b)* ]] = [[ (a b) ]] [[ (a b)* ]]
= ([[ a ]] [[ b ]]) [[ (a b) ]]*
= ({ a } { b }) ({ a } { b })*
= { ab } { ab }*
= { (ab)n | n > 0 }
Although there are clever ways to give a compositional semantics the property that the meaning of a part depends on the context within which it occurs, in the sense of the other parts of which it is a part, a mathematical semantics for a programming language is not going to give us what we usually intuitively think of as the meaning of programs written in it. For example, the fact that a certain FORTRAN program computes a certain formula does not tell us that it gives the estimated yield of a certain variety of corn under various weather conditions, let alone that this figure is only accurate under certain assumptions about the soil, and even then it is only accurate to within about 5 percent. Yet all this is an important part of the meaning of the program.
For another example, just a little more complex, here is a context free grammar G for a very simple class of expressions:
E -> 0
E -> 1
E -> (E + E)
E -> (E * E)
where N = {E} with E as the start symbol, and T = {0,
1, (, ), +, *}. Note that this grammar is ambiguous. We can define a
denotation function
[[_]] : L(G) -> NATfor these expressions, where NAT is the set of natural numbers, with the following:
[[ 0 ]] = 0
[[ 1 ]] = 1
[[ E + E' ]] = [[ E ]] + [[ E' ]]
[[ E * E' ]] = [[ E ]] * [[ E' ]]
From this, we can compute, for example, that
[[ (1 + 1) * (1 + 1) ]] = 4
S.val = E.val
E.val = 0
E.val = 1
E1.val = E2.val + E3.val
E1.val = E2.val * E3.val
Then a parse tree for the expression (1 + 1) * (1 + 0) looks as
follows
S
|
E
/ | \
/ | \
/ * \
/ \
E E
/ | \ / | \
E + E E + E
| | | |
1 1 1 0
where for simplicity the parentheses are left out. Then the synthesized
attribute will percolate up the tree, by applying the equations from the
bottom up, producing the values that are shown here
2
|
2
/ | \
/ | \
/ * \
/ \
2 1
/ | \ / | \
1 + 1 1 + 0
where we see that the final value is 2.
For another illustration, we can do the semantics of binary digits; the grammar here is
B -> D
B -> DB
D -> 0
D -> 1
and the equations associated with these four rules are
B.pos = 0 B.val = D.val
B1.pos = B2.pos + 1 B1.val = D.val *(2 ** B1.pos) + B2.val
D.val = 0
D.val = 1
It is now a good exercise to compute the value of the binary number
1010 by first writing its parse tree and then computing how the
values of the two (synthesized) attributes perculate up the tree.