|
We've barely started trying to figure out exactly what an expression is and then how to code a recursive descent parser for it and already we've encountered an embarassing problem -- how to decide between several options for a given production -- particularly when it causes our code to call itself over and over again, infinitely, or else simply quit too early. This is a common problem in setting up useful productions for any language. And algebraic expressions are no exception. Left RecursionHere are the productions, so far:
The first production, expression, has its last two choices using the production expression and, as such, it recursively calls the production expression. But note that this recursive use is on the left side of each of these last two choices. In other words, it implies that we'd need to recursively call the Expression function before a check is made for a plus or minus, not after. Of course, we already know this from trying to code it. And it is a problem. This is called left-recursion and in recursive descent parsing circles, this is a bad thing. What you really want is right-recursion, as that means you only make a recursive call if and only if everything else has already been verified. It's just fine to recursively call a production, if it's the last thing you do as by then you've already figured out which choice is better. You really want to arrange things so that recursive calls are the last thing to do -- sometimes called 'tail recursion.' Right RecursionWell, as you might guess, there's a trick in switching from a left-recursive form to a right recursive form. And once you see the result you'll see why it's worth doing. It works kind of like this -- if your right-recursive form can be though of as being, in some abstract sense:
Then it can be re-written into two new abstract productions, looking like this:
This is one of those parser guru tricks (well, actually, it's a first semester compiler guru trick) used to get rid of left recursion in productions. And it's not hard to memorize (or figure out.) Let's see how the above works. Look again at the left-recursive form (the one with only one production.) It's simply a z production followed by zero or more x productions. Note, I said followed by. Why not write it like that?
We don't yet know how the production B should be written, yet. But it is easy to imagine that it is zero or more x's. So how to write that? Like this:
That simply says that B can be nothing at all (empty) or it can be an x followed by a B. This is what gets us the "zero or more" meaning. The meaning of <null> is simply that it matches the empty set. In other words, B can be nothing at all -- empty -- and still be quite valid. Anyway, you put those two together and it works to get rid of the left-recursion and replace it with right-recursion, which is our desired result. What about the case with several left-recursive forms and several non-recursive forms? Well, you simply define A to be any one of the non-recursive forms, where each is followed by a B that is any one of the recursive forms or <null>. So if you first have:
Then this would become:
I hope that makes how it works a little clearer. Now, getting back to our expressions, if you look at the first production:
It can be expressed in this slightly altered form:
That is, it can be if and only if we've also arbitrarily declared that,
Now we can follow those rules, so it can be transformed such that:
Substituting back in for x and y, now, we get:
Or, renaming B to something more meaningful, like moreterms, gives us:
Recalling our production for term and adding that back to the list, we get:
The main benefit in doing all this is to lead off with options that start with plus and minus, so that we have an easy way to detect which of the options to take. In other words, we can predict based on the next token. Predictive parsing! Note that we now have right-recursion and that we have three productions, instead of two. Plus, one of them can be matched by nothing. But the main result here is that our production moreterms can actually be _decided_ upon easily by a parser, so that which of the three possibilities should be accepted is clear and concrete. It is either going to start with a +, start with a -, or else it simply is the third option, the empty set, and requires no tokens at all. But there is no ambiguity about which. Having neatly tucked the recursive call to moreterms at the end of each option,we've improved the situation and we can now write code that won't get into an infinite, recursive call. A nice thing! We can now easily decide which option is the right option for a production. In the case of moreterms, all we need to is check for a plus or minus. If either are found, we know exactly which option to use. If not, there is only one other choice. By the way, you might also try and imagine rewriting the last set of productions above as:
This might occur to you because of the patterns you might match, by eye. However, this set of productions wouldn't be the same thing as the earlier set. It would associate the minus, for example, from right to left instead of from left to right. In other words, the expression (2 - 3 - 5) would calculate as 4 instead of -6. Do you see why this is true? Coding?In this case, we create the routines called MoreTerms, Terms, Expression, and Number. With these, it's now pretty clear how to write the code for them, as well, so that they can easily determine which of their options to select. When there are options from which to select, the options is always clear from the next character being examined. This makes things much nicer. Again avoiding Number, we have:
Note that now I'm using Match to match either of two characters, in MoreTerms, above. An example of a working program for this is provided here: We can also take advantage of the tail recursion and combine Expression and MoreTerms, this way:
If you don't see how I did that quite yet, that's fine. But it might be worth thinking through it at this point, too. Here's an example set of code to try out:
A NoteWe could have tried something like this:
That would do away with expression. But it creates other problems. Which routine gets called first, to evaluate an expression? If you call Term, you can't just have an expression of 5+3 because it's not a number, but neither does it begin with an open paren. So, it doesn't work. We could call two routines in consecutive order, namely Term and then MoreTerms, but that forces the any user of these functions to know that detail, too, because it will have to replicate it. This pushes knowledge back onto the user's code, where it shouldn't be. We could try and expand it to read something like:
But then we are taken squarely back into that left-recusion problem in trying to figure out which option to take. That's what we were trying to avoid, in the first place. No. The correct way is to make a new production using expression and refer to that.
Last updated: Friday, January 13, 2006 23:58
|