Formula Engine Deep-Dive¶
The CalcBridge Formula Engine processes Excel formulas safely and efficiently using Abstract Syntax Tree (AST) parsing. This approach provides security, performance, and maintainability benefits over alternatives.
Architecture Overview¶
flowchart TB
subgraph Input["Input"]
FORMULA["=IF(A1>0, SUM(B1:B10), 0)"]
end
subgraph Lexer["Lexical Analysis"]
TOKENIZER["Tokenizer"]
TOKENS["Token Stream"]
end
subgraph Parser["Syntax Analysis"]
PARSER["Parser"]
AST["Abstract Syntax Tree"]
end
subgraph Semantic["Semantic Analysis"]
VALIDATOR["Validator"]
RESOLVER["Reference Resolver"]
end
subgraph Execution["Safe Execution"]
TRANSLATOR["Translator"]
EVALUATOR["Safe Evaluator"]
VECTORIZED["Vectorized Ops"]
end
subgraph Output["Output"]
RESULT["Result Value"]
end
FORMULA --> TOKENIZER
TOKENIZER --> TOKENS
TOKENS --> PARSER
PARSER --> AST
AST --> VALIDATOR
VALIDATOR --> RESOLVER
RESOLVER --> TRANSLATOR
TRANSLATOR --> EVALUATOR
EVALUATOR --> VECTORIZED
VECTORIZED --> RESULT
style EVALUATOR fill:#DCFCE7,stroke:#22C55E Tokenizer (Lexical Analysis)¶
The tokenizer converts formula strings into a stream of tokens.
Token Types¶
# src/calculations/parser/tokenizer.py
class TokenType(Enum):
"""Token types for Excel formula lexer."""
# Literals
NUMBER = auto() # 42, 3.14, 1.5E10
STRING = auto() # "Hello", 'World'
BOOLEAN = auto() # TRUE, FALSE
ERROR = auto() # #DIV/0!, #N/A
# References
CELL_REF = auto() # A1, $A$1
RANGE_REF = auto() # A1:B10
COLUMN_REF = auto() # A:A
ROW_REF = auto() # 1:1
SHEET_REF = auto() # Sheet1!
# Structured References
STRUCTURED_REF = auto() # [@[Column Name]]
TABLE_REF = auto() # Table1[Column]
# Operators
PLUS = auto() # +
MINUS = auto() # -
MULTIPLY = auto() # *
DIVIDE = auto() # /
POWER = auto() # ^
CONCAT = auto() # &
# Comparators
EQ = auto() # =
NE = auto() # <>
LT = auto() # <
GT = auto() # >
LE = auto() # <=
GE = auto() # >=
# Functions
FUNCTION = auto() # SUM, IF, XLOOKUP
Tokenization Example¶
# Input formula
formula = "=IF(A1>0, SUM(B1:B10)*2, 0)"
# Tokenization result
tokens = [
Token(FUNCTION, "IF", pos=1),
Token(LPAREN, "(", pos=3),
Token(CELL_REF, "A1", pos=4),
Token(GT, ">", pos=6),
Token(NUMBER, "0", pos=7),
Token(COMMA, ",", pos=8),
Token(FUNCTION, "SUM", pos=10),
Token(LPAREN, "(", pos=13),
Token(RANGE_REF, "B1:B10", pos=14),
Token(RPAREN, ")", pos=20),
Token(MULTIPLY, "*", pos=21),
Token(NUMBER, "2", pos=22),
Token(COMMA, ",", pos=23),
Token(NUMBER, "0", pos=25),
Token(RPAREN, ")", pos=26),
Token(EOF, "", pos=27),
]
Structured Reference Handling¶
The tokenizer handles Excel's structured reference syntax:
| Pattern | Type | Example |
|---|---|---|
[@Column] | This row | [@[Sales Amount]] |
[@] | Entire current row | [@] |
[#Headers] | Header row | [#Headers] |
[#Totals] | Totals row | [#Totals] |
Table[Column] | Table column | Sales[Amount] |
Parser (Syntax Analysis)¶
The parser builds an Abstract Syntax Tree from the token stream.
AST Node Types¶
classDiagram
class ASTNode {
<<abstract>>
+node_type: NodeType
+location: SourceLocation
+accept(visitor): Any
+children(): List~ASTNode~
}
class NumberNode {
+value: float
}
class StringNode {
+value: str
}
class CellRefNode {
+column: str
+row: int
+column_absolute: bool
+row_absolute: bool
+sheet: Optional~str~
}
class RangeRefNode {
+start: CellRefNode
+end: CellRefNode
}
class FunctionNode {
+name: str
+arguments: List~ASTNode~
}
class BinaryOpNode {
+operator: BinaryOperator
+left: ASTNode
+right: ASTNode
}
ASTNode <|-- NumberNode
ASTNode <|-- StringNode
ASTNode <|-- CellRefNode
ASTNode <|-- RangeRefNode
ASTNode <|-- FunctionNode
ASTNode <|-- BinaryOpNode AST Example¶
For formula =IF(A1>0, A1*2, 0):
graph TB
IF["FunctionNode(IF)"]
COMP["BinaryOpNode(>)"]
MULT["BinaryOpNode(*)"]
A1_1["CellRefNode(A1)"]
ZERO1["NumberNode(0)"]
A1_2["CellRefNode(A1)"]
TWO["NumberNode(2)"]
ZERO2["NumberNode(0)"]
IF --> COMP
IF --> MULT
IF --> ZERO2
COMP --> A1_1
COMP --> ZERO1
MULT --> A1_2
MULT --> TWO Visitor Pattern¶
The AST uses the Visitor pattern for traversal:
# src/calculations/parser/ast_nodes.py
class ASTVisitor(ABC):
"""Base visitor for AST traversal."""
@abstractmethod
def visit_number(self, node: NumberNode) -> Any: pass
@abstractmethod
def visit_cell_ref(self, node: CellRefNode) -> Any: pass
@abstractmethod
def visit_function(self, node: FunctionNode) -> Any: pass
@abstractmethod
def visit_binary_op(self, node: BinaryOpNode) -> Any: pass
# ... other visit methods
class DependencyExtractor(DefaultASTVisitor):
"""Extract cell references from a formula."""
def __init__(self):
self.dependencies: set[str] = set()
def visit_cell_ref(self, node: CellRefNode) -> None:
self.dependencies.add(node.address)
def visit_range_ref(self, node: RangeRefNode) -> None:
# Expand range to individual cells
for cell in expand_range(node):
self.dependencies.add(cell)
Translator (AST to Executable Code)¶
The translator converts AST nodes to safe, executable operations.
Translation Strategy¶
flowchart LR
subgraph AST["AST Nodes"]
FN["FunctionNode(SUM)"]
RNG["RangeRefNode(A1:A100)"]
end
subgraph Ops["Pandas Operations"]
SEL["df['A'].iloc[0:100]"]
AGG[".sum()"]
end
FN --> SEL
RNG --> SEL
SEL --> AGG Function Mapping¶
| Excel Function | Python/Pandas Implementation |
|---|---|
SUM(range) | series.sum() |
AVERAGE(range) | series.mean() |
IF(cond, t, f) | np.where(cond, t, f) |
VLOOKUP(...) | series.map(lookup_dict) |
SUMIFS(...) | df.loc[mask, col].sum() |
COUNTIF(...) | series[mask].count() |
Safe Expression Evaluator¶
The safe evaluator executes translated operations without dynamic code execution.
Security Features¶
# src/calculations/safe_evaluator.py
class SafeExpressionEvaluator:
"""Safe expression evaluator - NO dynamic code execution."""
# Maximum expression length
MAX_EXPRESSION_LENGTH = 10000
# Maximum recursion depth
MAX_DEPTH = 50
# Dangerous patterns - BLOCKED
DANGEROUS_PATTERNS = [
r'__\w+__', # Dunder methods
r'\bimport\b', # Import statements
r'\bexec\b', # Dynamic execution
r'\bcompile\b', # Code compilation
r'\bopen\b', # File operations
r'\bgetattr\b', # Attribute access
r'\bglobals\b', # Globals access
r'\blocals\b', # Locals access
r'\bos\.\w+', # OS module
r'\bsys\.\w+', # Sys module
]
# Whitelisted functions ONLY
SAFE_FUNCTIONS = {
'abs': abs,
'round': round,
'min': min,
'max': max,
'sum': sum,
'SUM': lambda *args: sum(args),
'AVERAGE': lambda *args: sum(args) / len(args),
'SQRT': math.sqrt,
# ... other safe functions
}
Evaluation Flow¶
flowchart TB
subgraph Input
EXPR["Expression String"]
end
subgraph Validate["1. Validation"]
LEN["Check Length"]
PAT["Check Patterns"]
end
subgraph Parse["2. Parse"]
AST["Python AST"]
end
subgraph Evaluate["3. Evaluate"]
WALK["Walk AST Nodes"]
OPS["Apply Operations"]
end
subgraph Output
RESULT["EvaluationResult"]
end
EXPR --> LEN
LEN --> PAT
PAT --> AST
AST --> WALK
WALK --> OPS
OPS --> RESULT
style PAT fill:#FEF3C7,stroke:#F59E0B
style OPS fill:#DCFCE7,stroke:#22C55E Supported Operations¶
| Category | Operations |
|---|---|
| Arithmetic | +, -, *, /, **, //, % |
| Comparison | ==, !=, <, >, <=, >= |
| Logical | and, or, not |
| Unary | +x, -x |
| Conditional | x if condition else y |
Supported Excel Functions (50+)¶
Logical Functions¶
| Function | Description | Example |
|---|---|---|
IF | Conditional | =IF(A1>0, "Yes", "No") |
IFS | Multiple conditions | =IFS(A1>90, "A", A1>80, "B") |
IFERROR | Error handling | =IFERROR(A1/B1, 0) |
AND | Logical AND | =AND(A1>0, B1>0) |
OR | Logical OR | =OR(A1>0, B1>0) |
NOT | Logical NOT | =NOT(A1=0) |
SWITCH | Value matching | =SWITCH(A1, 1, "One", 2, "Two") |
Lookup Functions¶
| Function | Description | Example |
|---|---|---|
VLOOKUP | Vertical lookup | =VLOOKUP(A1, B:D, 3, FALSE) |
HLOOKUP | Horizontal lookup | =HLOOKUP(A1, 1:3, 2, FALSE) |
XLOOKUP | Modern lookup | =XLOOKUP(A1, B:B, C:C) |
INDEX | Index retrieval | =INDEX(A:A, 5) |
MATCH | Position finding | =MATCH(A1, B:B, 0) |
XMATCH | Modern match | =XMATCH(A1, B:B) |
Math & Statistical Functions¶
| Function | Description | Example |
|---|---|---|
SUM | Sum values | =SUM(A1:A100) |
SUMIF | Conditional sum | =SUMIF(A:A, ">0") |
SUMIFS | Multi-condition sum | =SUMIFS(A:A, B:B, "X") |
AVERAGE | Mean | =AVERAGE(A1:A100) |
COUNT | Count numbers | =COUNT(A:A) |
COUNTIF | Conditional count | =COUNTIF(A:A, ">0") |
MIN | Minimum | =MIN(A1:A100) |
MAX | Maximum | =MAX(A1:A100) |
Text Functions¶
| Function | Description | Example |
|---|---|---|
CONCATENATE | Join text | =CONCATENATE(A1, B1) |
TEXTJOIN | Join with delimiter | =TEXTJOIN(",", TRUE, A1:A10) |
LEFT | Left characters | =LEFT(A1, 5) |
RIGHT | Right characters | =RIGHT(A1, 3) |
MID | Middle characters | =MID(A1, 2, 5) |
LEN | Length | =LEN(A1) |
TRIM | Remove spaces | =TRIM(A1) |
Vectorized Operations¶
For performance, calculations use vectorized operations:
# src/calculations/functions/vectorized.py
def vectorized_if(
condition: pd.Series,
value_if_true: Any,
value_if_false: Any
) -> pd.Series:
"""Vectorized IF - processes entire column at once."""
return np.where(condition, value_if_true, value_if_false)
def vectorized_xlookup(
lookup_values: pd.Series,
lookup_dict: dict,
default: Any = None
) -> pd.Series:
"""Vectorized XLOOKUP using dictionary mapping."""
return lookup_values.map(lookup_dict).fillna(default)
def vectorized_sumifs(
sum_range: pd.Series,
*criteria_pairs # (range, criteria, range, criteria, ...)
) -> float:
"""Vectorized SUMIFS with multiple criteria."""
mask = pd.Series([True] * len(sum_range))
for i in range(0, len(criteria_pairs), 2):
criteria_range = criteria_pairs[i]
criteria = criteria_pairs[i + 1]
mask &= (criteria_range == criteria)
return sum_range[mask].sum()
Performance Comparison¶
| Operation | Row-by-Row | Vectorized | Speedup |
|---|---|---|---|
| SUM (10K rows) | 150ms | 1.5ms | 100x |
| IF (10K rows) | 200ms | 2ms | 100x |
| XLOOKUP (10K rows) | 500ms | 5ms | 100x |
| SUMIFS (10K rows) | 300ms | 3ms | 100x |
Financial Precision¶
For monetary calculations, the engine supports Decimal precision:
# src/calculations/types.py
from decimal import Decimal, ROUND_HALF_UP
class Money:
"""Money type with 2 decimal place precision."""
PRECISION = Decimal("0.01")
def __init__(self, value: Union[int, float, Decimal], currency: str = "USD"):
self.value = Decimal(str(value)).quantize(
self.PRECISION,
rounding=ROUND_HALF_UP
)
self.currency = currency
class FinancialRate:
"""Financial rate with 6 decimal place precision."""
PRECISION = Decimal("0.000001")
def __init__(self, value: Union[int, float, Decimal]):
self.value = Decimal(str(value)).quantize(
self.PRECISION,
rounding=ROUND_HALF_UP
)
Precision Modes¶
| Mode | Precision | Use Case |
|---|---|---|
float | ~15 digits | General calculations |
money | 2 decimals | Currency amounts |
rate | 6 decimals | Interest rates |
percentage | 4 decimals | Percentages |
exact | Unlimited | Maximum precision |
Error Handling¶
The engine provides detailed error information:
# Error types
class FormulaError(Enum):
DIV_ZERO = "#DIV/0!"
VALUE = "#VALUE!"
REF = "#REF!"
NAME = "#NAME?"
NA = "#N/A"
NUM = "#NUM!"
# Error propagation
def handle_division(numerator: Any, denominator: Any) -> Any:
if denominator == 0:
return FormulaError.DIV_ZERO
return numerator / denominator
def handle_lookup_miss(lookup_value: Any, found: bool) -> Any:
if not found:
return FormulaError.NA
return lookup_value
Extension Points¶
Adding New Functions¶
# 1. Define the function
def custom_function(arg1: Any, arg2: Any) -> Any:
"""Custom Excel function implementation."""
return arg1 + arg2
# 2. Register in SAFE_FUNCTIONS
SAFE_FUNCTIONS["CUSTOM"] = custom_function
# 3. Add to tokenizer FUNCTIONS set
FUNCTIONS.add("CUSTOM")