Parsing

1. Expressions
2. Add/Subtract
3. Right Recursion
4. Times/Divide
5. Powers
6. Numbers
7. Calculating?

Parsing is an interesting and common need in programming. Compilers and interpreters do a lot of it. But there are many times when it's nice for an application to be able to accept a convenient ASCII file that an operator has prepared -- for example, games may need to parse user-written action routines. It's useful to learn how to write simple parsers. It's another tool you can use when the time is right.

There are many approaches to developing parsing algorithms. In fact, a number of very thick books are written on this general subject area. Some of the methods aren't intuitive, yet are very powerful and widely used. Some of the methods are easily learned and still quite useful.

There are also programs that can take your specifications and automatically write a sophisticated lexer (tokenizer) or write a sophisticated parser, or both. But often, even learning to write the specifications requires a learning curve to be effective. And, in the end, there are times when you just won't have those tools available or cannot use them for some reason and you'll need to fall back on some basic skills of your own. It's worth developing those skills. (BASIC is exactly one of those cases, by the way, because I don't know of any automated tools that generates BASIC code for lexical analysis or parsing.)

Parsing?

In an automated processing sense, parsing is a process of finding out if a series string of tokens can be generated by a rigorously defined grammar. Let's say we have the following tokens available in a set of 1-character tokens:

( ) + - 0 1 2 3 4 5 6 7 8 9

That's 10 digits, a plus and minus, and two different parentheses (open and close.) Let's then say we are allowed to combine these in any combination, order, or of any length (including none.) In that case, 81+4 is one possible case, and so is )+3. These are just two possible productions drawn from the token set. In contrast, 43+6#791 is not valid, as it includes a character not in the token set. It's not even a possible production.

This is an example of having a starting set of symbols and having little or no rules about how they can be combined to produce something. It's possible, and often useful, to instead provide some restrictive rules about how symbols can be combined. Having rules means that a specific production may be unsound, either because it includes tokens not in the original set of tokens or because it combines them in ways the rules cannot explain.

Parsing, in this sense, is simply examining a hypothetical production and seeing if it actually is one of the possible productions, given a set of tokens and rigorously deductive rules.

But the practical essence of parsing is also gathering some kind of meaning and purpose out of words, phrases, and grammatical marks so that they can be acted on, appropriately. In other words, it's not just deciding if some producion can be produced by some grammar (parsers limited to producing yes/no decisions are called 'recognizers'), but often more about doing something with what was learned in the process of parsing. Parsers will often include "action routines" at appropriate places in order to do something useful.

For example, in parsing an algebraic mathematical expression, one could simply parse it to find out if it is valid (using a recognizer.) But often, the programming will go further than this -- yielding a resulting value or converting algebraic notation into reverse polish notation. In compilers, instead of either of these steps, it may be to prepare a parse tree, to be later be turned into directed acyclic graph after optimizing, to be converted into code, to be fed to a peep-hole optimizer before being emitted into an object file that a linker will then use to generate a program which can then efficiently evaluate this expression when it runs. Get the idea?

Lexing?

Parsing is distinquished from just converting ASCII into tokens. That process is called lexical analysis and the part of the program that handles it is often called a 'lexer,' which is often a precursor step that then feeds a parser. It's possible for the parser to simply check the characters itself and to avoid having a distinct lexer part, but one of the advantages of having a lexer is that it aids efficient parsing. A lexer does the testing for particular sequences once and generates a code-efficient token value to represent them and the parser is saved from re-testing the same text with string comparisons that are slower. But a distinct lexer is not always necessary or desired.

Actually, a lexer is really just another parser, but at a "lower level" of analysis. For example, lexers are often delegated the task of deciding what a number looks like (scientific notation, for example) and then must parse that number before passing on the token and value to the parser. That process of figuring out what a number is requires, itself, the act of parsing. I suppose one could imagine many of these levels, each passing on increasingly complex "tokens" to the next, higher level parsing stage.

Preface to the Lessons

Although professional, production parsers are rarely written in BASIC, it's usually been Quick BASIC programmers asking me about how to parse. So I'll discuss the elementary details and develop an algebraic expression analyzer for Quick BASIC. The same ideas apply as well to C and many other languages, but QBASIC is readily available on most PCs (and if you don't have it, you should look on your O/S CD under the 'oldmsdos' subdirectory and get it copied over.)

If you find you don't have the MS-DOS program, QBASIC.EXE, on your system, then you can download it at Microsoft's site. Every program I provide in this tutorial will run fine using this program. You can go "up" the parent page here for links to get QBASIC or one of the BASIC compilers.

To keep this first experience useful, while at the same time relying only on basic programming skills, I'm selecting an algebraic expression evaluator. There are several factors in such a program that keep things manageable.

Multiple lines won't be supported. All expressions must fit into a single string in BASIC. It wouldn't be difficult to extend it to support more lines, but why bother? And the details of doing it would only serve as an obstacle to understanding the basic thrust of writing the evaluator.
A recursive descent approach will be used. This is very easy to understand and follow, as a rule. Certainly, by comparison with other methods. Making it a predictive parser will complicate setting up the rules a bit (right recursive form instead of left recursive, for example), but not much.
At first, I'll dispense with writing a lexer for expression tokens and/or a separate expression execution engine. Recursive descent alone is quite sufficient for evaluating and executing simple expressions, so the process of parsing, execution, and lexical analysis will be conflated into a single approach.
We'll start with a modest concept and grow it over time to include other features, such as a variable table for naming and saving the values from prior calculations for use in later expressions.

Algebraic expressions are interesting, well-understood by most people with some high-school grounding in algebra and math, and present some interesting challenges. It'll be fun to work through the details.

Feel free to email me.

Last updated: Sunday, July 18, 2004 19:32