Skip to content

Formula Engine Deep-Dive

The CalcBridge Formula Engine processes Excel formulas safely and efficiently using Abstract Syntax Tree (AST) parsing. This approach provides security, performance, and maintainability benefits over alternatives.

Architecture Overview

flowchart TB
    subgraph Input["Input"]
        FORMULA["=IF(A1>0, SUM(B1:B10), 0)"]
    end

    subgraph Lexer["Lexical Analysis"]
        TOKENIZER["Tokenizer"]
        TOKENS["Token Stream"]
    end

    subgraph Parser["Syntax Analysis"]
        PARSER["Parser"]
        AST["Abstract Syntax Tree"]
    end

    subgraph Semantic["Semantic Analysis"]
        VALIDATOR["Validator"]
        RESOLVER["Reference Resolver"]
    end

    subgraph Execution["Safe Execution"]
        TRANSLATOR["Translator"]
        EVALUATOR["Safe Evaluator"]
        VECTORIZED["Vectorized Ops"]
    end

    subgraph Output["Output"]
        RESULT["Result Value"]
    end

    FORMULA --> TOKENIZER
    TOKENIZER --> TOKENS
    TOKENS --> PARSER
    PARSER --> AST
    AST --> VALIDATOR
    VALIDATOR --> RESOLVER
    RESOLVER --> TRANSLATOR
    TRANSLATOR --> EVALUATOR
    EVALUATOR --> VECTORIZED
    VECTORIZED --> RESULT

    style EVALUATOR fill:#DCFCE7,stroke:#22C55E

Tokenizer (Lexical Analysis)

The tokenizer converts formula strings into a stream of tokens.

Token Types

# src/calculations/parser/tokenizer.py
class TokenType(Enum):
    """Token types for Excel formula lexer."""

    # Literals
    NUMBER = auto()      # 42, 3.14, 1.5E10
    STRING = auto()      # "Hello", 'World'
    BOOLEAN = auto()     # TRUE, FALSE
    ERROR = auto()       # #DIV/0!, #N/A

    # References
    CELL_REF = auto()    # A1, $A$1
    RANGE_REF = auto()   # A1:B10
    COLUMN_REF = auto()  # A:A
    ROW_REF = auto()     # 1:1
    SHEET_REF = auto()   # Sheet1!

    # Structured References
    STRUCTURED_REF = auto()  # [@[Column Name]]
    TABLE_REF = auto()       # Table1[Column]

    # Operators
    PLUS = auto()        # +
    MINUS = auto()       # -
    MULTIPLY = auto()    # *
    DIVIDE = auto()      # /
    POWER = auto()       # ^
    CONCAT = auto()      # &

    # Comparators
    EQ = auto()          # =
    NE = auto()          # <>
    LT = auto()          # <
    GT = auto()          # >
    LE = auto()          # <=
    GE = auto()          # >=

    # Functions
    FUNCTION = auto()    # SUM, IF, XLOOKUP

Tokenization Example

# Input formula
formula = "=IF(A1>0, SUM(B1:B10)*2, 0)"

# Tokenization result
tokens = [
    Token(FUNCTION, "IF", pos=1),
    Token(LPAREN, "(", pos=3),
    Token(CELL_REF, "A1", pos=4),
    Token(GT, ">", pos=6),
    Token(NUMBER, "0", pos=7),
    Token(COMMA, ",", pos=8),
    Token(FUNCTION, "SUM", pos=10),
    Token(LPAREN, "(", pos=13),
    Token(RANGE_REF, "B1:B10", pos=14),
    Token(RPAREN, ")", pos=20),
    Token(MULTIPLY, "*", pos=21),
    Token(NUMBER, "2", pos=22),
    Token(COMMA, ",", pos=23),
    Token(NUMBER, "0", pos=25),
    Token(RPAREN, ")", pos=26),
    Token(EOF, "", pos=27),
]

Structured Reference Handling

The tokenizer handles Excel's structured reference syntax:

Pattern Type Example
[@Column] This row [@[Sales Amount]]
[@] Entire current row [@]
[#Headers] Header row [#Headers]
[#Totals] Totals row [#Totals]
Table[Column] Table column Sales[Amount]

Parser (Syntax Analysis)

The parser builds an Abstract Syntax Tree from the token stream.

AST Node Types

classDiagram
    class ASTNode {
        <<abstract>>
        +node_type: NodeType
        +location: SourceLocation
        +accept(visitor): Any
        +children(): List~ASTNode~
    }

    class NumberNode {
        +value: float
    }

    class StringNode {
        +value: str
    }

    class CellRefNode {
        +column: str
        +row: int
        +column_absolute: bool
        +row_absolute: bool
        +sheet: Optional~str~
    }

    class RangeRefNode {
        +start: CellRefNode
        +end: CellRefNode
    }

    class FunctionNode {
        +name: str
        +arguments: List~ASTNode~
    }

    class BinaryOpNode {
        +operator: BinaryOperator
        +left: ASTNode
        +right: ASTNode
    }

    ASTNode <|-- NumberNode
    ASTNode <|-- StringNode
    ASTNode <|-- CellRefNode
    ASTNode <|-- RangeRefNode
    ASTNode <|-- FunctionNode
    ASTNode <|-- BinaryOpNode

AST Example

For formula =IF(A1>0, A1*2, 0):

graph TB
    IF["FunctionNode(IF)"]
    COMP["BinaryOpNode(>)"]
    MULT["BinaryOpNode(*)"]
    A1_1["CellRefNode(A1)"]
    ZERO1["NumberNode(0)"]
    A1_2["CellRefNode(A1)"]
    TWO["NumberNode(2)"]
    ZERO2["NumberNode(0)"]

    IF --> COMP
    IF --> MULT
    IF --> ZERO2
    COMP --> A1_1
    COMP --> ZERO1
    MULT --> A1_2
    MULT --> TWO

Visitor Pattern

The AST uses the Visitor pattern for traversal:

# src/calculations/parser/ast_nodes.py
class ASTVisitor(ABC):
    """Base visitor for AST traversal."""

    @abstractmethod
    def visit_number(self, node: NumberNode) -> Any: pass

    @abstractmethod
    def visit_cell_ref(self, node: CellRefNode) -> Any: pass

    @abstractmethod
    def visit_function(self, node: FunctionNode) -> Any: pass

    @abstractmethod
    def visit_binary_op(self, node: BinaryOpNode) -> Any: pass
    # ... other visit methods

class DependencyExtractor(DefaultASTVisitor):
    """Extract cell references from a formula."""

    def __init__(self):
        self.dependencies: set[str] = set()

    def visit_cell_ref(self, node: CellRefNode) -> None:
        self.dependencies.add(node.address)

    def visit_range_ref(self, node: RangeRefNode) -> None:
        # Expand range to individual cells
        for cell in expand_range(node):
            self.dependencies.add(cell)

Translator (AST to Executable Code)

The translator converts AST nodes to safe, executable operations.

Translation Strategy

flowchart LR
    subgraph AST["AST Nodes"]
        FN["FunctionNode(SUM)"]
        RNG["RangeRefNode(A1:A100)"]
    end

    subgraph Ops["Pandas Operations"]
        SEL["df['A'].iloc[0:100]"]
        AGG[".sum()"]
    end

    FN --> SEL
    RNG --> SEL
    SEL --> AGG

Function Mapping

Excel Function Python/Pandas Implementation
SUM(range) series.sum()
AVERAGE(range) series.mean()
IF(cond, t, f) np.where(cond, t, f)
VLOOKUP(...) series.map(lookup_dict)
SUMIFS(...) df.loc[mask, col].sum()
COUNTIF(...) series[mask].count()

Safe Expression Evaluator

The safe evaluator executes translated operations without dynamic code execution.

Security Features

# src/calculations/safe_evaluator.py
class SafeExpressionEvaluator:
    """Safe expression evaluator - NO dynamic code execution."""

    # Maximum expression length
    MAX_EXPRESSION_LENGTH = 10000

    # Maximum recursion depth
    MAX_DEPTH = 50

    # Dangerous patterns - BLOCKED
    DANGEROUS_PATTERNS = [
        r'__\w+__',      # Dunder methods
        r'\bimport\b',   # Import statements
        r'\bexec\b',     # Dynamic execution
        r'\bcompile\b',  # Code compilation
        r'\bopen\b',     # File operations
        r'\bgetattr\b',  # Attribute access
        r'\bglobals\b',  # Globals access
        r'\blocals\b',   # Locals access
        r'\bos\.\w+',    # OS module
        r'\bsys\.\w+',   # Sys module
    ]

    # Whitelisted functions ONLY
    SAFE_FUNCTIONS = {
        'abs': abs,
        'round': round,
        'min': min,
        'max': max,
        'sum': sum,
        'SUM': lambda *args: sum(args),
        'AVERAGE': lambda *args: sum(args) / len(args),
        'SQRT': math.sqrt,
        # ... other safe functions
    }

Evaluation Flow

flowchart TB
    subgraph Input
        EXPR["Expression String"]
    end

    subgraph Validate["1. Validation"]
        LEN["Check Length"]
        PAT["Check Patterns"]
    end

    subgraph Parse["2. Parse"]
        AST["Python AST"]
    end

    subgraph Evaluate["3. Evaluate"]
        WALK["Walk AST Nodes"]
        OPS["Apply Operations"]
    end

    subgraph Output
        RESULT["EvaluationResult"]
    end

    EXPR --> LEN
    LEN --> PAT
    PAT --> AST
    AST --> WALK
    WALK --> OPS
    OPS --> RESULT

    style PAT fill:#FEF3C7,stroke:#F59E0B
    style OPS fill:#DCFCE7,stroke:#22C55E

Supported Operations

Category Operations
Arithmetic +, -, *, /, **, //, %
Comparison ==, !=, <, >, <=, >=
Logical and, or, not
Unary +x, -x
Conditional x if condition else y

Supported Excel Functions (50+)

Logical Functions

Function Description Example
IF Conditional =IF(A1>0, "Yes", "No")
IFS Multiple conditions =IFS(A1>90, "A", A1>80, "B")
IFERROR Error handling =IFERROR(A1/B1, 0)
AND Logical AND =AND(A1>0, B1>0)
OR Logical OR =OR(A1>0, B1>0)
NOT Logical NOT =NOT(A1=0)
SWITCH Value matching =SWITCH(A1, 1, "One", 2, "Two")

Lookup Functions

Function Description Example
VLOOKUP Vertical lookup =VLOOKUP(A1, B:D, 3, FALSE)
HLOOKUP Horizontal lookup =HLOOKUP(A1, 1:3, 2, FALSE)
XLOOKUP Modern lookup =XLOOKUP(A1, B:B, C:C)
INDEX Index retrieval =INDEX(A:A, 5)
MATCH Position finding =MATCH(A1, B:B, 0)
XMATCH Modern match =XMATCH(A1, B:B)

Math & Statistical Functions

Function Description Example
SUM Sum values =SUM(A1:A100)
SUMIF Conditional sum =SUMIF(A:A, ">0")
SUMIFS Multi-condition sum =SUMIFS(A:A, B:B, "X")
AVERAGE Mean =AVERAGE(A1:A100)
COUNT Count numbers =COUNT(A:A)
COUNTIF Conditional count =COUNTIF(A:A, ">0")
MIN Minimum =MIN(A1:A100)
MAX Maximum =MAX(A1:A100)

Text Functions

Function Description Example
CONCATENATE Join text =CONCATENATE(A1, B1)
TEXTJOIN Join with delimiter =TEXTJOIN(",", TRUE, A1:A10)
LEFT Left characters =LEFT(A1, 5)
RIGHT Right characters =RIGHT(A1, 3)
MID Middle characters =MID(A1, 2, 5)
LEN Length =LEN(A1)
TRIM Remove spaces =TRIM(A1)

Vectorized Operations

For performance, calculations use vectorized operations:

# src/calculations/functions/vectorized.py

def vectorized_if(
    condition: pd.Series,
    value_if_true: Any,
    value_if_false: Any
) -> pd.Series:
    """Vectorized IF - processes entire column at once."""
    return np.where(condition, value_if_true, value_if_false)

def vectorized_xlookup(
    lookup_values: pd.Series,
    lookup_dict: dict,
    default: Any = None
) -> pd.Series:
    """Vectorized XLOOKUP using dictionary mapping."""
    return lookup_values.map(lookup_dict).fillna(default)

def vectorized_sumifs(
    sum_range: pd.Series,
    *criteria_pairs  # (range, criteria, range, criteria, ...)
) -> float:
    """Vectorized SUMIFS with multiple criteria."""
    mask = pd.Series([True] * len(sum_range))
    for i in range(0, len(criteria_pairs), 2):
        criteria_range = criteria_pairs[i]
        criteria = criteria_pairs[i + 1]
        mask &= (criteria_range == criteria)
    return sum_range[mask].sum()

Performance Comparison

Operation Row-by-Row Vectorized Speedup
SUM (10K rows) 150ms 1.5ms 100x
IF (10K rows) 200ms 2ms 100x
XLOOKUP (10K rows) 500ms 5ms 100x
SUMIFS (10K rows) 300ms 3ms 100x

Financial Precision

For monetary calculations, the engine supports Decimal precision:

# src/calculations/types.py
from decimal import Decimal, ROUND_HALF_UP

class Money:
    """Money type with 2 decimal place precision."""
    PRECISION = Decimal("0.01")

    def __init__(self, value: Union[int, float, Decimal], currency: str = "USD"):
        self.value = Decimal(str(value)).quantize(
            self.PRECISION,
            rounding=ROUND_HALF_UP
        )
        self.currency = currency

class FinancialRate:
    """Financial rate with 6 decimal place precision."""
    PRECISION = Decimal("0.000001")

    def __init__(self, value: Union[int, float, Decimal]):
        self.value = Decimal(str(value)).quantize(
            self.PRECISION,
            rounding=ROUND_HALF_UP
        )

Precision Modes

Mode Precision Use Case
float ~15 digits General calculations
money 2 decimals Currency amounts
rate 6 decimals Interest rates
percentage 4 decimals Percentages
exact Unlimited Maximum precision

Error Handling

The engine provides detailed error information:

# Error types
class FormulaError(Enum):
    DIV_ZERO = "#DIV/0!"
    VALUE = "#VALUE!"
    REF = "#REF!"
    NAME = "#NAME?"
    NA = "#N/A"
    NUM = "#NUM!"

# Error propagation
def handle_division(numerator: Any, denominator: Any) -> Any:
    if denominator == 0:
        return FormulaError.DIV_ZERO
    return numerator / denominator

def handle_lookup_miss(lookup_value: Any, found: bool) -> Any:
    if not found:
        return FormulaError.NA
    return lookup_value

Extension Points

Adding New Functions

# 1. Define the function
def custom_function(arg1: Any, arg2: Any) -> Any:
    """Custom Excel function implementation."""
    return arg1 + arg2

# 2. Register in SAFE_FUNCTIONS
SAFE_FUNCTIONS["CUSTOM"] = custom_function

# 3. Add to tokenizer FUNCTIONS set
FUNCTIONS.add("CUSTOM")

Adding New Operators

# 1. Add token type
class TokenType(Enum):
    CUSTOM_OP = auto()

# 2. Add to tokenizer patterns
PATTERNS.append((r'\$\$', TokenType.CUSTOM_OP))

# 3. Add to binary ops
BINARY_OPS[ast.CustomOp] = lambda a, b: custom_operation(a, b)