Formula Engine Deep-Dive¶

The CalcBridge Formula Engine processes Excel formulas safely and efficiently using Abstract Syntax Tree (AST) parsing. This approach provides security, performance, and maintainability benefits over alternatives.

Architecture Overview¶

flowchart TB
    subgraph Input["Input"]
        FORMULA["=IF(A1>0, SUM(B1:B10), 0)"]
    end

    subgraph Lexer["Lexical Analysis"]
        TOKENIZER["Tokenizer"]
        TOKENS["Token Stream"]
    end

    subgraph Parser["Syntax Analysis"]
        PARSER["Parser"]
        AST["Abstract Syntax Tree"]
    end

    subgraph Semantic["Semantic Analysis"]
        VALIDATOR["Validator"]
        RESOLVER["Reference Resolver"]
    end

    subgraph Execution["Safe Execution"]
        TRANSLATOR["Translator"]
        EVALUATOR["Safe Evaluator"]
        VECTORIZED["Vectorized Ops"]
    end

    subgraph Output["Output"]
        RESULT["Result Value"]
    end

    FORMULA --> TOKENIZER
    TOKENIZER --> TOKENS
    TOKENS --> PARSER
    PARSER --> AST
    AST --> VALIDATOR
    VALIDATOR --> RESOLVER
    RESOLVER --> TRANSLATOR
    TRANSLATOR --> EVALUATOR
    EVALUATOR --> VECTORIZED
    VECTORIZED --> RESULT

    style EVALUATOR fill:#DCFCE7,stroke:#22C55E

Tokenizer (Lexical Analysis)¶

The tokenizer converts formula strings into a stream of tokens.

Token Types¶

# src/calculations/parser/tokenizer.py
class TokenType(Enum):
    """Token types for Excel formula lexer."""

    # Literals
    NUMBER = auto()      # 42, 3.14, 1.5E10
    STRING = auto()      # "Hello", 'World'
    BOOLEAN = auto()     # TRUE, FALSE
    ERROR = auto()       # #DIV/0!, #N/A

    # References
    CELL_REF = auto()    # A1, $A$1
    RANGE_REF = auto()   # A1:B10
    COLUMN_REF = auto()  # A:A
    ROW_REF = auto()     # 1:1
    SHEET_REF = auto()   # Sheet1!

    # Structured References
    STRUCTURED_REF = auto()  # [@[Column Name]]
    TABLE_REF = auto()       # Table1[Column]

    # Operators
    PLUS = auto()        # +
    MINUS = auto()       # -
    MULTIPLY = auto()    # *
    DIVIDE = auto()      # /
    POWER = auto()       # ^
    CONCAT = auto()      # &

    # Comparators
    EQ = auto()          # =
    NE = auto()          # <>
    LT = auto()          # <
    GT = auto()          # >
    LE = auto()          # <=
    GE = auto()          # >=

    # Functions
    FUNCTION = auto()    # SUM, IF, XLOOKUP

Tokenization Example¶

# Input formula
formula = "=IF(A1>0, SUM(B1:B10)*2, 0)"

# Tokenization result
tokens = [
    Token(FUNCTION, "IF", pos=1),
    Token(LPAREN, "(", pos=3),
    Token(CELL_REF, "A1", pos=4),
    Token(GT, ">", pos=6),
    Token(NUMBER, "0", pos=7),
    Token(COMMA, ",", pos=8),
    Token(FUNCTION, "SUM", pos=10),
    Token(LPAREN, "(", pos=13),
    Token(RANGE_REF, "B1:B10", pos=14),
    Token(RPAREN, ")", pos=20),
    Token(MULTIPLY, "*", pos=21),
    Token(NUMBER, "2", pos=22),
    Token(COMMA, ",", pos=23),
    Token(NUMBER, "0", pos=25),
    Token(RPAREN, ")", pos=26),
    Token(EOF, "", pos=27),
]

Structured Reference Handling¶

The tokenizer handles Excel's structured reference syntax:

Pattern	Type	Example
`[@Column]`	This row	`[@[Sales Amount]]`
`[@]`	Entire current row	`[@]`
`[#Headers]`	Header row	`[#Headers]`
`[#Totals]`	Totals row	`[#Totals]`
`Table[Column]`	Table column	`Sales[Amount]`

Parser (Syntax Analysis)¶

The parser builds an Abstract Syntax Tree from the token stream.

AST Node Types¶

classDiagram
    class ASTNode {
        <<abstract>>
        +node_type: NodeType
        +location: SourceLocation
        +accept(visitor): Any
        +children(): List~ASTNode~
    }

    class NumberNode {
        +value: float
    }

    class StringNode {
        +value: str
    }

    class CellRefNode {
        +column: str
        +row: int
        +column_absolute: bool
        +row_absolute: bool
        +sheet: Optional~str~
    }

    class RangeRefNode {
        +start: CellRefNode
        +end: CellRefNode
    }

    class FunctionNode {
        +name: str
        +arguments: List~ASTNode~
    }

    class BinaryOpNode {
        +operator: BinaryOperator
        +left: ASTNode
        +right: ASTNode
    }

    ASTNode <|-- NumberNode
    ASTNode <|-- StringNode
    ASTNode <|-- CellRefNode
    ASTNode <|-- RangeRefNode
    ASTNode <|-- FunctionNode
    ASTNode <|-- BinaryOpNode

AST Example¶

For formula =IF(A1>0, A1*2, 0):

graph TB
    IF["FunctionNode(IF)"]
    COMP["BinaryOpNode(>)"]
    MULT["BinaryOpNode(*)"]
    A1_1["CellRefNode(A1)"]
    ZERO1["NumberNode(0)"]
    A1_2["CellRefNode(A1)"]
    TWO["NumberNode(2)"]
    ZERO2["NumberNode(0)"]

    IF --> COMP
    IF --> MULT
    IF --> ZERO2
    COMP --> A1_1
    COMP --> ZERO1
    MULT --> A1_2
    MULT --> TWO

Visitor Pattern¶

The AST uses the Visitor pattern for traversal:

# src/calculations/parser/ast_nodes.py
class ASTVisitor(ABC):
    """Base visitor for AST traversal."""

    @abstractmethod
    def visit_number(self, node: NumberNode) -> Any: pass

    @abstractmethod
    def visit_cell_ref(self, node: CellRefNode) -> Any: pass

    @abstractmethod
    def visit_function(self, node: FunctionNode) -> Any: pass

    @abstractmethod
    def visit_binary_op(self, node: BinaryOpNode) -> Any: pass
    # ... other visit methods

class DependencyExtractor(DefaultASTVisitor):
    """Extract cell references from a formula."""

    def __init__(self):
        self.dependencies: set[str] = set()

    def visit_cell_ref(self, node: CellRefNode) -> None:
        self.dependencies.add(node.address)

    def visit_range_ref(self, node: RangeRefNode) -> None:
        # Expand range to individual cells
        for cell in expand_range(node):
            self.dependencies.add(cell)

Translator (AST to Executable Code)¶

The translator converts AST nodes to safe, executable operations.

Translation Strategy¶

flowchart LR
    subgraph AST["AST Nodes"]
        FN["FunctionNode(SUM)"]
        RNG["RangeRefNode(A1:A100)"]
    end

    subgraph Ops["Pandas Operations"]
        SEL["df['A'].iloc[0:100]"]
        AGG[".sum()"]
    end

    FN --> SEL
    RNG --> SEL
    SEL --> AGG

Function Mapping¶

Excel Function	Python/Pandas Implementation
`SUM(range)`	`series.sum()`
`AVERAGE(range)`	`series.mean()`
`IF(cond, t, f)`	`np.where(cond, t, f)`
`VLOOKUP(...)`	`series.map(lookup_dict)`
`SUMIFS(...)`	`df.loc[mask, col].sum()`
`COUNTIF(...)`	`series[mask].count()`

Safe Expression Evaluator¶

The safe evaluator executes translated operations without dynamic code execution.

Security Features¶

# src/calculations/safe_evaluator.py
class SafeExpressionEvaluator:
    """Safe expression evaluator - NO dynamic code execution."""

    # Maximum expression length
    MAX_EXPRESSION_LENGTH = 10000

    # Maximum recursion depth
    MAX_DEPTH = 50

    # Dangerous patterns - BLOCKED
    DANGEROUS_PATTERNS = [
        r'__\w+__',      # Dunder methods
        r'\bimport\b',   # Import statements
        r'\bexec\b',     # Dynamic execution
        r'\bcompile\b',  # Code compilation
        r'\bopen\b',     # File operations
        r'\bgetattr\b',  # Attribute access
        r'\bglobals\b',  # Globals access
        r'\blocals\b',   # Locals access
        r'\bos\.\w+',    # OS module
        r'\bsys\.\w+',   # Sys module
    ]

    # Whitelisted functions ONLY
    SAFE_FUNCTIONS = {
        'abs': abs,
        'round': round,
        'min': min,
        'max': max,
        'sum': sum,
        'SUM': lambda *args: sum(args),
        'AVERAGE': lambda *args: sum(args) / len(args),
        'SQRT': math.sqrt,
        # ... other safe functions
    }

Evaluation Flow¶

flowchart TB
    subgraph Input
        EXPR["Expression String"]
    end

    subgraph Validate["1. Validation"]
        LEN["Check Length"]
        PAT["Check Patterns"]
    end

    subgraph Parse["2. Parse"]
        AST["Python AST"]
    end

    subgraph Evaluate["3. Evaluate"]
        WALK["Walk AST Nodes"]
        OPS["Apply Operations"]
    end

    subgraph Output
        RESULT["EvaluationResult"]
    end

    EXPR --> LEN
    LEN --> PAT
    PAT --> AST
    AST --> WALK
    WALK --> OPS
    OPS --> RESULT

    style PAT fill:#FEF3C7,stroke:#F59E0B
    style OPS fill:#DCFCE7,stroke:#22C55E

Supported Operations¶

Category	Operations
Arithmetic	`+`, `-`, ``, `/`, `*`, `//`, `%`
Comparison	`==`, `!=`, `<`, `>`, `<=`, `>=`
Logical	`and`, `or`, `not`
Unary	`+x`, `-x`
Conditional	`x if condition else y`

Supported Excel Functions (50+)¶

Logical Functions¶

Function	Description	Example
`IF`	Conditional	`=IF(A1>0, "Yes", "No")`
`IFS`	Multiple conditions	`=IFS(A1>90, "A", A1>80, "B")`
`IFERROR`	Error handling	`=IFERROR(A1/B1, 0)`
`AND`	Logical AND	`=AND(A1>0, B1>0)`
`OR`	Logical OR	`=OR(A1>0, B1>0)`
`NOT`	Logical NOT	`=NOT(A1=0)`
`SWITCH`	Value matching	`=SWITCH(A1, 1, "One", 2, "Two")`

Lookup Functions¶

Function	Description	Example
`VLOOKUP`	Vertical lookup	`=VLOOKUP(A1, B:D, 3, FALSE)`
`HLOOKUP`	Horizontal lookup	`=HLOOKUP(A1, 1:3, 2, FALSE)`
`XLOOKUP`	Modern lookup	`=XLOOKUP(A1, B:B, C:C)`
`INDEX`	Index retrieval	`=INDEX(A:A, 5)`
`MATCH`	Position finding	`=MATCH(A1, B:B, 0)`
`XMATCH`	Modern match	`=XMATCH(A1, B:B)`

Math & Statistical Functions¶

Function	Description	Example
`SUM`	Sum values	`=SUM(A1:A100)`
`SUMIF`	Conditional sum	`=SUMIF(A:A, ">0")`
`SUMIFS`	Multi-condition sum	`=SUMIFS(A:A, B:B, "X")`
`AVERAGE`	Mean	`=AVERAGE(A1:A100)`
`COUNT`	Count numbers	`=COUNT(A:A)`
`COUNTIF`	Conditional count	`=COUNTIF(A:A, ">0")`
`MIN`	Minimum	`=MIN(A1:A100)`
`MAX`	Maximum	`=MAX(A1:A100)`

Text Functions¶

Function	Description	Example
`CONCATENATE`	Join text	`=CONCATENATE(A1, B1)`
`TEXTJOIN`	Join with delimiter	`=TEXTJOIN(",", TRUE, A1:A10)`
`LEFT`	Left characters	`=LEFT(A1, 5)`
`RIGHT`	Right characters	`=RIGHT(A1, 3)`
`MID`	Middle characters	`=MID(A1, 2, 5)`
`LEN`	Length	`=LEN(A1)`
`TRIM`	Remove spaces	`=TRIM(A1)`

Vectorized Operations¶

For performance, calculations use vectorized operations:

# src/calculations/functions/vectorized.py

def vectorized_if(
    condition: pd.Series,
    value_if_true: Any,
    value_if_false: Any
) -> pd.Series:
    """Vectorized IF - processes entire column at once."""
    return np.where(condition, value_if_true, value_if_false)

def vectorized_xlookup(
    lookup_values: pd.Series,
    lookup_dict: dict,
    default: Any = None
) -> pd.Series:
    """Vectorized XLOOKUP using dictionary mapping."""
    return lookup_values.map(lookup_dict).fillna(default)

def vectorized_sumifs(
    sum_range: pd.Series,
    *criteria_pairs  # (range, criteria, range, criteria, ...)
) -> float:
    """Vectorized SUMIFS with multiple criteria."""
    mask = pd.Series([True] * len(sum_range))
    for i in range(0, len(criteria_pairs), 2):
        criteria_range = criteria_pairs[i]
        criteria = criteria_pairs[i + 1]
        mask &= (criteria_range == criteria)
    return sum_range[mask].sum()

Performance Comparison¶

Operation	Row-by-Row	Vectorized	Speedup
SUM (10K rows)	150ms	1.5ms	100x
IF (10K rows)	200ms	2ms	100x
XLOOKUP (10K rows)	500ms	5ms	100x
SUMIFS (10K rows)	300ms	3ms	100x

Financial Precision¶

For monetary calculations, the engine supports Decimal precision:

# src/calculations/types.py
from decimal import Decimal, ROUND_HALF_UP

class Money:
    """Money type with 2 decimal place precision."""
    PRECISION = Decimal("0.01")

    def __init__(self, value: Union[int, float, Decimal], currency: str = "USD"):
        self.value = Decimal(str(value)).quantize(
            self.PRECISION,
            rounding=ROUND_HALF_UP
        )
        self.currency = currency

class FinancialRate:
    """Financial rate with 6 decimal place precision."""
    PRECISION = Decimal("0.000001")

    def __init__(self, value: Union[int, float, Decimal]):
        self.value = Decimal(str(value)).quantize(
            self.PRECISION,
            rounding=ROUND_HALF_UP
        )

Precision Modes¶

Mode	Precision	Use Case
`float`	~15 digits	General calculations
`money`	2 decimals	Currency amounts
`rate`	6 decimals	Interest rates
`percentage`	4 decimals	Percentages
`exact`	Unlimited	Maximum precision

Error Handling¶

The engine provides detailed error information:

# Error types
class FormulaError(Enum):
    DIV_ZERO = "#DIV/0!"
    VALUE = "#VALUE!"
    REF = "#REF!"
    NAME = "#NAME?"
    NA = "#N/A"
    NUM = "#NUM!"

# Error propagation
def handle_division(numerator: Any, denominator: Any) -> Any:
    if denominator == 0:
        return FormulaError.DIV_ZERO
    return numerator / denominator

def handle_lookup_miss(lookup_value: Any, found: bool) -> Any:
    if not found:
        return FormulaError.NA
    return lookup_value

Extension Points¶

Adding New Functions¶

# 1. Define the function
def custom_function(arg1: Any, arg2: Any) -> Any:
    """Custom Excel function implementation."""
    return arg1 + arg2

# 2. Register in SAFE_FUNCTIONS
SAFE_FUNCTIONS["CUSTOM"] = custom_function

# 3. Add to tokenizer FUNCTIONS set
FUNCTIONS.add("CUSTOM")

Adding New Operators¶

# 1. Add token type
class TokenType(Enum):
    CUSTOM_OP = auto()

# 2. Add to tokenizer patterns
PATTERNS.append((r'\$\$', TokenType.CUSTOM_OP))

# 3. Add to binary ops
BINARY_OPS[ast.CustomOp] = lambda a, b: custom_operation(a, b)