Pratt Parser from scratch

A few months ago I went down the rabbit hole of writing my own interpreter from scratch. That project eventually became Liakos, a full-on interpreted language with features: variables, functions, environments, built-ins, conditionals, pretty much everything

At the core of it though, everything starts with parsing expressions. And for that, I used a Pratt parser

We're gonna build a small pratt parser as a REPL, walking through the core pieces.

What are we even building?

By the end, we want this:

>> 5 + 3 * 2
11

>> a = 2
>> a + 4
6

No parser generators, no third party libraries, nothing extra

Pratt parser is basically:

parse expressions by assigning operators a binding power and letting recursion do the rest

Tokens

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum Token {
    Atom(char),
    Op(char),
    EOF,
}

Atom: numbers or variables (a, x, 3)
Op: everything else (+, *, -, =, (, etc)
EOF: end of file (input in this case)

Lexer

the lexer just removes the whitespace, turns chars into tokens and let us peek() and next()

struct Lexer {
    tokens: Vec<Token>,
}

impl Lexer {
    fn new(input: &str) -> Self {
        let mut tokens = input
            .chars()
            .filter(|it| !it.is_ascii_whitespace())
            .map(|c| match c {
                '0'..='9' | 'a'..='z' | 'A'..='Z' => Token::Atom(c),
                _ => Token::Op(c),
            })
            .collect::<Vec<_>>();

        tokens.reverse();
        Lexer { tokens }
    }

    fn next(&mut self) -> Token {
        self.tokens.pop().unwrap_or(Token::EOF)
    }

    fn peek(&mut self) -> Token {
        self.tokens.last().copied().unwrap_or(Token::EOF)
    }
}

fyi: We reverse the vector so pop() can be O(1), pretty neat huh :D

The AST

enum Expression {
    Atom(char),
    Operation(char, Vec<Expression>),
}

Some examples to better understand this enumeration:

5 => Atom('5')
1 + 2 => Operation('+', [Atom('1'), Atom('2')])
a = 3 => Operation('=', [Atom('a'), Atom('3')])

Binding Power

fn infix_binding_power(op: char) -> (f32, f32) {
    match op {
        '=' => (0.2, 0.1),
        '+' | '-' => (1.0, 1.1),
        '*' | '/' => (2.0, 2.1),
        _ => panic!("unknown operator: {:?}", op),
    }
}

What does this even mean?

Well, for starters, higher number = tighter binding.

* beats +
= binds weakest
left/right binding power difference controls how all connect with each other

This is how:

a = 2 + 3 * 4

becomes:

a = (2 + (3 * 4))

The Parser itself

fn parse_expression(lexer: &mut Lexer, min_bp: f32) -> Expression {
    let mut lhs = match lexer.next() {
        Token::Atom(it) => Expression::Atom(it),
        Token::Op('(') => {
            let lhs = parse_expression(lexer, 0.0);
            assert_eq!(lexer.next(), Token::Op(')'));
            lhs
        }
        t => panic!("bad token: {:?}", t),
    };

    loop {
        let op = match lexer.peek() {
            Token::EOF => break,
            Token::Op(')') => break,
            Token::Op(op) => op,
            t => panic!("bad token: {:?}", t),
        };

        let (l_bp, r_bp) = infix_binding_power(op);
        if l_bp < min_bp {
            break;
        }

        lexer.next();
        let rhs = parse_expression(lexer, r_bp);
        lhs = Expression::Operation(op, vec![lhs, rhs]);
    }

    lhs
}

This is quite a lot to take in at once, but this is what's happening:

The most important thing here is min_bp, think of it like: "How strong does the next operator have to be for me to even care?"

We start with parsing the left hand side:

let mut lhs = match lexer.next() {
    Token::Atom(it) => Expression::Atom(it),
    Token::Op('(') => {
        let lhs = parse_expression(lexer, 0.0);
        assert_eq!(lexer.next(), Token::Op(')'));
        lhs
    }
    t => panic!("bad token: {:?}", t),
};

Every expression starts with something. Either a number, a variable. At this point, we haven't handled any operators yet

In the next step, we try to extend the expression

loop {
    let op = match lexer.peek() {
        Token::EOF => break,
        Token::Op(')') => break,
        Token::Op(op) => op,
        t => panic!("bad token: {:?}", t),
    };

we peek, not consume. Why? because we need to decide whether this operator is allowed to bind

let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
    break;
}

This is the most important check in the entire parser. If the operator is too weak, we stop. If it's strong enough, it becomes part of the current expression

Finally,

let rhs = parse_expression(lexer, r_bp);
lhs = Expression::Operation(op, vec![lhs, rhs]);

Once we accept the operator, we consume it, recursively parse the right hand side, and we build a new AST node.

We then loop again and try to extend the expression even further

An Example

Let's parse this:

5 + 3 * 2

5 becomes `lhs`
See `+` => allowed
Parse RHS with higher `min_bp`
inside RHS:
    3 becomes `lhs`
    See `*` => stronger than `+`
    Parse 2
Build `3 * 2`
Build `5 + (3 * 2)`

Evaluation

fn eval(&self, variables: &HashMap<char, f32>) -> f32 {
    match self {
        Expression::Atom(c) => match c {
            '0'..='9' => c.to_digit(10).unwrap() as f32,
            'a'..='z' | 'A'..='Z' => *variables
                .get(c)
                .unwrap_or_else(|| panic!("undefined variabe {}", c)),
            _ => unreachable!(),
        },
        Expression::Operation(operator, operands) => {
            let lhs = operands.first().unwrap().eval(variables);
            let rhs = operands.last().unwrap().eval(variables);

            match operator {
                '+' => lhs + rhs,
                '-' => lhs - rhs,
                '*' => lhs * rhs,
                '/' => lhs / rhs,
                op => panic!("Bad operator: {}", op),
            }
        }
    }
}

This is simple, atoms are either:

numbers => return the value
variables => look them up in the environment

What you need to take away from the evaluation is that it mirrors the structure of the AST

If the parser grouped something (3 * 2) , evaluation must evaluate that first

So if you're gonna build:

a language
an expression engine
a query language

…learn Pratt parsing and you’re ready to start!

If this post helped you, check out Liakos here:

Liakos Language

That’s the grown-up version of what we just built

Now go break it, extend it, and make it yours