A few months ago I went down the rabbit hole of writing my own interpreter from scratch. That project eventually became Liakos, a full-on interpreted language with features: variables, functions, environments, built-ins, conditionals, pretty much everything
At the core of it though, everything starts with parsing expressions. And for that, I used a Pratt parser
We're gonna build a small pratt parser as a REPL, walking through the core pieces.
What are we even building?
By the end, we want this:
>> 5 + 3 * 2
11
>> a = 2
>> a + 4
6
No parser generators, no third party libraries, nothing extra
Pratt parser is basically:
parse expressions by assigning operators a binding power and letting recursion do the rest
Tokens
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum Token {
Atom(char),
Op(char),
EOF,
}
- Atom: numbers or variables (a, x, 3)
- Op: everything else (+, *, -, =, (, etc)
- EOF: end of file (input in this case)
Lexer
the lexer just removes the whitespace, turns chars into tokens and let us peek() and next()
struct Lexer {
tokens: Vec<Token>,
}
impl Lexer {
fn new(input: &str) -> Self {
let mut tokens = input
.chars()
.filter(|it| !it.is_ascii_whitespace())
.map(|c| match c {
'0'..='9' | 'a'..='z' | 'A'..='Z' => Token::Atom(c),
_ => Token::Op(c),
})
.collect::<Vec<_>>();
tokens.reverse();
Lexer { tokens }
}
fn next(&mut self) -> Token {
self.tokens.pop().unwrap_or(Token::EOF)
}
fn peek(&mut self) -> Token {
self.tokens.last().copied().unwrap_or(Token::EOF)
}
}
fyi: We reverse the vector so pop() can be O(1), pretty neat huh :D
The AST
enum Expression {
Atom(char),
Operation(char, Vec<Expression>),
}
Some examples to better understand this enumeration:
5 => Atom('5')
1 + 2 => Operation('+', [Atom('1'), Atom('2')])
a = 3 => Operation('=', [Atom('a'), Atom('3')])
Binding Power
fn infix_binding_power(op: char) -> (f32, f32) {
match op {
'=' => (0.2, 0.1),
'+' | '-' => (1.0, 1.1),
'*' | '/' => (2.0, 2.1),
_ => panic!("unknown operator: {:?}", op),
}
}
What does this even mean?
Well, for starters, higher number = tighter binding.
* beats +
= binds weakest
left/right binding power difference controls how all connect with each other
This is how:
a = 2 + 3 * 4
becomes:
a = (2 + (3 * 4))
The Parser itself
fn parse_expression(lexer: &mut Lexer, min_bp: f32) -> Expression {
let mut lhs = match lexer.next() {
Token::Atom(it) => Expression::Atom(it),
Token::Op('(') => {
let lhs = parse_expression(lexer, 0.0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
t => panic!("bad token: {:?}", t),
};
loop {
let op = match lexer.peek() {
Token::EOF => break,
Token::Op(')') => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
lexer.next();
let rhs = parse_expression(lexer, r_bp);
lhs = Expression::Operation(op, vec![lhs, rhs]);
}
lhs
}
This is quite a lot to take in at once, but this is what's happening:
The most important thing here is min_bp, think of it like: "How strong does the next operator have to be for me to even care?"
We start with parsing the left hand side:
let mut lhs = match lexer.next() {
Token::Atom(it) => Expression::Atom(it),
Token::Op('(') => {
let lhs = parse_expression(lexer, 0.0);
assert_eq!(lexer.next(), Token::Op(')'));
lhs
}
t => panic!("bad token: {:?}", t),
};
Every expression starts with something. Either a number, a variable. At this point, we haven't handled any operators yet
In the next step, we try to extend the expression
loop {
let op = match lexer.peek() {
Token::EOF => break,
Token::Op(')') => break,
Token::Op(op) => op,
t => panic!("bad token: {:?}", t),
};
we peek, not consume. Why? because we need to decide whether this operator is allowed to bind
let (l_bp, r_bp) = infix_binding_power(op);
if l_bp < min_bp {
break;
}
This is the most important check in the entire parser. If the operator is too weak, we stop. If it's strong enough, it becomes part of the current expression
Finally,
let rhs = parse_expression(lexer, r_bp);
lhs = Expression::Operation(op, vec![lhs, rhs]);
Once we accept the operator, we consume it, recursively parse the right hand side, and we build a new AST node.
We then loop again and try to extend the expression even further
An Example
Let's parse this:
5 + 3 * 2
5 becomes `lhs`
See `+` => allowed
Parse RHS with higher `min_bp`
inside RHS:
3 becomes `lhs`
See `*` => stronger than `+`
Parse 2
Build `3 * 2`
Build `5 + (3 * 2)`
Evaluation
fn eval(&self, variables: &HashMap<char, f32>) -> f32 {
match self {
Expression::Atom(c) => match c {
'0'..='9' => c.to_digit(10).unwrap() as f32,
'a'..='z' | 'A'..='Z' => *variables
.get(c)
.unwrap_or_else(|| panic!("undefined variabe {}", c)),
_ => unreachable!(),
},
Expression::Operation(operator, operands) => {
let lhs = operands.first().unwrap().eval(variables);
let rhs = operands.last().unwrap().eval(variables);
match operator {
'+' => lhs + rhs,
'-' => lhs - rhs,
'*' => lhs * rhs,
'/' => lhs / rhs,
op => panic!("Bad operator: {}", op),
}
}
}
}
This is simple, atoms are either:
numbers => return the value
variables => look them up in the environment
What you need to take away from the evaluation is that it mirrors the structure of the AST
If the parser grouped something (3 * 2) , evaluation must evaluate that first
So if you're gonna build:
- a language
- an expression engine
- a query language
…learn Pratt parsing and you’re ready to start!
If this post helped you, check out Liakos here:
Liakos Language
That’s the grown-up version of what we just built
Now go break it, extend it, and make it yours