
I don’t know why exactly, but the other day I got curious about how Excel’s backend might work at a high level.
So, I decided to mess around and code a tiny version of it, and it actually got pretty interesting pretty quickly.
Here’s a short write-up to give you a peek behind the curtain.
The Core Data Unit - Cells
Every cell can hold:
- a number (10, 3.14)
- a boolean (true,false)
- a string ("hello")
- a or an expression (like =A1 + B2 * 2)
Each cell looks like this:
type Cell struct {
    Raw string // what the user typed
    Val any // the actual evaluated result
}
When we set a cell, we store the raw input, but if it starts with =, that means it's an expression that we'll need to parse and evaluate later
Lexical Analysis (aka Tokenization)
First things first, we need to break the input into pieces. This is called lexical analysis, or tokenization
Input: =A1 + 3 * b2
Tokenizer output:
[A1] [+] [3] [*] [(] [B2] [-] [1] [)]
Each token has a type (identifier, number, operator, parenthesis) That's lexical analysis, basically chopping the string into tokens
Parsing - Building the AST out of Tokens
Now we'll parse those tokens into an abstract syntax tree (AST) that respects operator precedence
For example:
=A1 + 3 * (B2 - 1)
The AST would look like this:
        (+)
       /   \
     A1     (*)
           /   \
         3     (-)
              /   \
            B2     1
Notice how * binds tigher than +, and parentheses change the evaluation order. That's operator precedence
The parser in the code does this in multiple layers:
- parseExpression() handles `+` and `-`
- parseMulDiv() handles `*` and `/`
- parsePrimary() handles literals, parenthees, or cell references
That way A1 + 3 * 2 becomes A1 + (3 * 2)
Evaluating
Once we have a tree, evaluating is simple recursion:
- if it's a number -> just return it
- if it's a cell reference -> look up that cell's current value
- if it's an operator -> evaluate left and right, then combine
But... what happens when one cell depends on another?
| Cell | Formula | Value | 
|---|---|---|
| A1 | 10 | 10 | 
| A2 | 5 | 5 | 
| A3 | =A1 + A2 * 2 | 20 | 
When you change A1 to 20, A3 should now become 30. How do we know that A3 depends on A3?
Dependency Tracking
We track two graphs:
- `deps[cell]` -> all the cells that this **this cell depends on**
- `revDeps[cell]` -> all the cells that **depend on this cell**
When we parse a formula like for cell A3: =A1 + A2 * 2, we extract its dependencies (A1, A2) and record them
For example, for A3 = A1 + A2 * 3 we store:
deps[A3] = {A1, A2}
revDeps[A1] = {A3}
revDeps[A2] = {A3}
Then if A1 changes, we know we must re-evaluate A3 (and any cells depending on A3)
Recalculation - Topological Sort
The thing is, we can't just recalculate arbitrarily all those cells. We must process cells from the least dependent to the most dependent
For example
- *A3* depends on both *A2* and *A1*
- *A2* depends on *A1*
Dependency graph:
`A1 -> A2 -> A3`
If A1 changes, we must evaluate
`A1 -> A2 -> A3`
And we achieve this by running a topological sort (classic graph algorithm)
- Nodes: `A1, A2, A3`
- Edges: `A1 -> A2`, `A2 -> A3`
Topological sort returns:
`[A1, A2, A3]`
That's the order in which we evaluate cells safely
Eager Recalculation
I chose to use an eager update strategy, for simplicity, when a cell changes:
- We first mark it as dirty
- Collect all dependents (via revDeps)
- Run topological sort
- Re-evaluate everything in order
In Conlusion
- We tokenize to make sense of raw formulas
- We parse tokens into a structure that respects operator precedence
- We evaluate by looking up cell values recursively
- We track dependencies between cells
- We recalculate efficiently using topological sorting
GitHub