Skip to content

Pratt Parsing 1

So I thought I would start a series on how Pratt parsing works. This is actually to help me understand this 3 months from now. This time, we will look at parsing -7+3*4 + 9

We first tokenize getting -,7, +, 3, *, 4, +, 9

One can use a regex for tokenizing such as

/(?:([a-zA-Z][a-zA-Z0-9]*)|((?:[0-9]*\.[0-9]+|[0-9]+)(?:[E][\-+]?[0-9]+)?)|([^a-zA-Z0-9\s]))/g

and a word match is in the first, a number in the second, and other symbols in the third (one character taken). You can use this in a stream like fashion where each token is taken one at a time. A whole list of tokens is not needed prior to parsing and there is good reason not to (foreign code parsing or comment parsing, for example).

The expression code is something like:

function expression (bp) {
   var t = node; //node is a global variable containing current node; we are about to get a new one.
   loadNewNodeFromTokens; 
   left = t.nud();  
   while (rightBindingPower < node.leftBindingPower) {
      t = node; 
      left = t.led(left); 
   }
}

nud basically creates a representation of the object as a standalone. For a number, it creates a number token. For a prefix operator, such as negation, it will actually parse an expression that sucks up the next one and returns a suitable node for further processing.

led takes what is to the left, created by the nud, and then does something, say get a right node to combine for infix operators such as addition

Once we have the token stream, we start processing.

  1. First up -. This is a prefix. It has a nud that says, parse expression with bp = 30 (high bp).
  2. The 7 is seen next with a nud that returns it.
  3. + is parsed. Its bp is 10. So the loop does not commence and 7 is returned up to the nud of negation
  4. Negation is now done. It takes 7 and negates it. + is still the next node to deal with.
  5. + has a bp of 10 and the rbp is 0. So loop commences.
  6. Next symbol parsed is node= 3 and + is stored in t.
  7. We process + led with the left of (-7) being passed in. In that led, it calls expression with a rbp of 10.
  8. 3 is processed at the start of the +’s right expression parsing. It returns itself and * is up next.
  9. * has a binding power of 20. This is greater than 10 and so we enter the loop
  10. node is 4, we do an led of *’s passing a bp of 20 to the expression.
  11. 4 is now nud’d. The next node is + whose bp = 10 < 20 of the *. We do not process the loop and we return 4
  12. 4 is now passed to * as the right node.
  13. * returns a node structure like *(3, 4) and this goes back to the + level
  14. Does the + loop iterate? The next token is a +. So it will not iterate with strict inequality. With equality, it would and this choice will change whether the node structure is +(+(-7, *(3,4)),9) or +(-7, +(*(3,4), 9). Both are equivalent because of the associativity of addition, but the first is what comes with strict inequality, I believe.
  15. So at this point we have everything done to the left of the second plus. It is now contained in the left variable in the first, outer expression loop (bp =0)
  16. We process the + and grab out the 9 into the right.

So ends pratt parsing 1

Post a Comment

Your email is never published nor shared. Required fields are marked *