Adding Excel Functions¶
This guide walks through the process of adding new Excel function support to CalcBridge's calculation engine. The engine uses vectorized operations for high performance on large financial datasets.
Architecture Overview¶
The calculation engine has three main components:
flowchart LR
A["Formula String<br/>=SUM(A1:A10)"] --> B["Parser<br/>src/calculations/parser/"]
B --> C["AST<br/>Abstract Syntax Tree"]
C --> D["Translator<br/>src/calculations/parser/translator.py"]
D --> E["Vectorized Functions<br/>src/calculations/functions/"]
E --> F["Result"] | Component | Location | Purpose |
|---|---|---|
| Parser | src/calculations/parser/ | Tokenizes and parses Excel formulas into AST |
| AST Nodes | src/calculations/parser/ast_nodes.py | Defines AST node types |
| Translator | src/calculations/parser/translator.py | Converts AST to executable code |
| Vectorized Functions | src/calculations/functions/vectorized.py | High-performance implementations |
| Function Registry | src/calculations/functions/__init__.py | Exports available functions |
Step-by-Step Guide¶
Step 1: Implement in vectorized.py¶
First, add your function implementation to src/calculations/functions/vectorized.py:
def vectorized_round(
values: pd.Series | np.ndarray,
decimals: int = 0,
) -> np.ndarray:
"""
Vectorized implementation of Excel ROUND function.
Equivalent to: =ROUND(value, decimals)
Args:
values: Values to round
decimals: Number of decimal places (can be negative)
Returns:
numpy array with rounded values
Example:
>>> df['Rounded'] = vectorized_round(df['Amount'], 2)
Excel equivalent:
=ROUND([@Amount], 2)
"""
return np.round(np.asarray(values), decimals)
Implementation Guidelines¶
- Always use vectorized operations - Avoid loops; use numpy/pandas operations
- Handle edge cases - Empty arrays, NaN values, type mismatches
- Match Excel behavior - Test against Excel to ensure identical results
- Document thoroughly - Include docstring, example, and Excel equivalent
Common Patterns¶
# Conditional logic: Use np.where
def vectorized_if(condition, value_true, value_false):
return np.where(condition, value_true, value_false)
# Multiple conditions: Chain np.where or use np.select
def vectorized_ifs(*conditions_and_values, default=None):
conditions = [c for i, c in enumerate(conditions_and_values) if i % 2 == 0]
values = [v for i, v in enumerate(conditions_and_values) if i % 2 == 1]
return np.select(conditions, values, default=default)
# Aggregation with conditions: Use boolean masking
def vectorized_sumif(range_values, criteria_range, criteria):
mask = criteria_range == criteria
return np.sum(np.where(mask, range_values, 0))
# Lookups: Use dictionary mapping
def vectorized_xlookup(lookup_values, lookup_dict, default=None):
return lookup_values.map(lookup_dict).fillna(default)
Step 2: Register in Function Registry¶
Export your function in src/calculations/functions/__init__.py:
from src.calculations.functions.vectorized import (
# ... existing imports ...
vectorized_round, # Add your new function
)
__all__ = [
# ... existing exports ...
"vectorized_round", # Add to __all__
]
Step 3: Add Parser Support (if needed)¶
Most functions use standard function call syntax and don't need parser changes. However, if your function has special syntax, you may need to modify the parser.
Standard Syntax (No Changes Needed)¶
Special Syntax (Parser Changes Required)¶
If your function has unique syntax (like array formulas or special operators), modify the parser:
# Add to the function signature registry
FUNCTION_SIGNATURES = {
# ... existing functions ...
"ROUND": FunctionSignature(
name="ROUND",
min_args=1,
max_args=2,
arg_types=["number", "number"],
return_type="number",
),
}
Step 4: Add Translator Support¶
Update the translator to convert the AST node to your vectorized function call:
def translate_function_call(node: FunctionCallNode) -> str:
"""Translate a function call AST node to Python code."""
func_name = node.function_name.upper()
# ... existing function translations ...
if func_name == "ROUND":
args = [translate_expression(arg) for arg in node.arguments]
value = args[0]
decimals = args[1] if len(args) > 1 else "0"
return f"vectorized_round({value}, {decimals})"
# ... rest of translations ...
Step 5: Write Tests¶
Create comprehensive tests for your function:
import pytest
import numpy as np
import pandas as pd
from numpy.testing import assert_array_equal, assert_array_almost_equal
from src.calculations.functions import vectorized_round
class TestVectorizedRound:
"""Tests for vectorized_round function."""
def test_round_to_integer(self):
"""Test rounding to integer (0 decimal places)."""
values = pd.Series([1.4, 1.5, 1.6, 2.5])
result = vectorized_round(values, 0)
# Note: numpy uses banker's rounding (round half to even)
expected = np.array([1.0, 2.0, 2.0, 2.0])
assert_array_equal(result, expected)
def test_round_to_decimals(self):
"""Test rounding to specific decimal places."""
values = pd.Series([1.234, 5.678, 9.999])
result = vectorized_round(values, 2)
expected = np.array([1.23, 5.68, 10.0])
assert_array_almost_equal(result, expected)
def test_round_negative_decimals(self):
"""Test rounding to tens, hundreds, etc."""
values = pd.Series([123, 456, 789])
result = vectorized_round(values, -1)
expected = np.array([120, 460, 790])
assert_array_equal(result, expected)
def test_round_empty_array(self):
"""Test handling of empty input."""
values = pd.Series([], dtype=float)
result = vectorized_round(values, 2)
assert len(result) == 0
def test_round_with_nan(self):
"""Test handling of NaN values."""
values = pd.Series([1.5, np.nan, 2.5])
result = vectorized_round(values, 0)
assert np.isnan(result[1])
assert result[0] == 2.0
assert result[2] == 2.0
def test_round_numpy_array_input(self):
"""Test with numpy array input."""
values = np.array([1.234, 5.678])
result = vectorized_round(values, 1)
expected = np.array([1.2, 5.7])
assert_array_almost_equal(result, expected)
Step 6: Test Parser Integration¶
import pytest
from src.calculations.parser import parse_formula, translate_formula
class TestRoundFunction:
"""Parser tests for ROUND function."""
def test_parse_round_with_decimals(self):
"""Test parsing ROUND with decimal argument."""
ast = parse_formula("=ROUND(A1, 2)")
assert ast.function_name == "ROUND"
assert len(ast.arguments) == 2
def test_parse_round_default_decimals(self):
"""Test parsing ROUND with default decimals."""
ast = parse_formula("=ROUND(A1)")
assert ast.function_name == "ROUND"
assert len(ast.arguments) == 1
def test_translate_round(self):
"""Test translation to vectorized function."""
result = translate_formula("=ROUND(A1, 2)")
assert "vectorized_round" in result.code
Step 7: Update Documentation¶
Add your function to the function reference:
### ROUND
Rounds a number to a specified number of digits.
**Syntax**: `=ROUND(value, [decimals])`
**Arguments**:
| Argument | Type | Required | Description |
|----------|------|----------|-------------|
| `value` | number | Yes | The value to round |
| `decimals` | number | No | Number of decimal places (default: 0) |
**Returns**: number
**Examples**:
Example: Implementing AVERAGEIF¶
Let's walk through a complete example of implementing AVERAGEIF:
1. Implementation¶
def vectorized_averageif(
range_values: pd.Series | np.ndarray,
criteria_range: pd.Series | np.ndarray,
criteria: Any,
average_range: pd.Series | np.ndarray | None = None,
) -> float:
"""
Vectorized implementation of Excel AVERAGEIF function.
Equivalent to: =AVERAGEIF(criteria_range, criteria, [average_range])
Args:
range_values: The range to evaluate with criteria
criteria_range: The range containing the criteria values
criteria: The criteria to match (value, expression, or pattern)
average_range: Optional range to average (defaults to criteria_range)
Returns:
Average of values where criteria is met, or NaN if no matches
Example:
>>> df['AvgByCategory'] = vectorized_averageif(
... df['Value'],
... df['Category'],
... 'A'
... )
Excel equivalent:
=AVERAGEIF(Category, "A", Value)
"""
criteria_arr = np.asarray(criteria_range)
values_arr = np.asarray(average_range if average_range is not None else range_values)
# Build condition based on criteria type
if isinstance(criteria, str):
if criteria.startswith(">"):
threshold = float(criteria[1:])
mask = criteria_arr > threshold
elif criteria.startswith("<"):
threshold = float(criteria[1:])
mask = criteria_arr < threshold
elif criteria.startswith(">="):
threshold = float(criteria[2:])
mask = criteria_arr >= threshold
elif criteria.startswith("<="):
threshold = float(criteria[2:])
mask = criteria_arr <= threshold
elif criteria.startswith("<>"):
compare_val = criteria[2:]
mask = criteria_arr != compare_val
else:
mask = criteria_arr == criteria
else:
mask = criteria_arr == criteria
matching_values = values_arr[mask]
if len(matching_values) == 0:
return np.nan
return np.mean(matching_values)
2. Register¶
from src.calculations.functions.vectorized import (
# ... existing ...
vectorized_averageif,
)
__all__ = [
# ... existing ...
"vectorized_averageif",
]
3. Tests¶
import pytest
import numpy as np
import pandas as pd
from src.calculations.functions import vectorized_averageif
class TestVectorizedAverageIf:
"""Tests for vectorized_averageif function."""
def test_basic_equality(self):
"""Test averaging with equality criteria."""
values = pd.Series([10, 20, 30, 40])
categories = pd.Series(['A', 'B', 'A', 'B'])
result = vectorized_averageif(values, categories, 'A')
assert result == 20.0 # Average of 10 and 30
def test_greater_than(self):
"""Test with greater than criteria."""
values = pd.Series([10, 20, 30, 40, 50])
result = vectorized_averageif(values, values, '>25')
assert result == 40.0 # Average of 30, 40, 50
def test_no_matches_returns_nan(self):
"""Test that no matches returns NaN."""
values = pd.Series([10, 20, 30])
categories = pd.Series(['A', 'A', 'A'])
result = vectorized_averageif(values, categories, 'B')
assert np.isnan(result)
def test_separate_average_range(self):
"""Test with separate average range."""
criteria_range = pd.Series(['X', 'Y', 'X', 'Y'])
average_range = pd.Series([100, 200, 300, 400])
result = vectorized_averageif(
average_range,
criteria_range,
'X',
average_range
)
assert result == 200.0 # Average of 100 and 300
Performance Considerations¶
Vectorization is Critical¶
Always use vectorized operations. The performance difference is significant:
| Approach | 10K rows | 100K rows | 1M rows |
|---|---|---|---|
| Vectorized (numpy) | 1ms | 5ms | 50ms |
| df.apply() | 50ms | 500ms | 5s |
| df.iterrows() | 500ms | 5s | 50s |
Memory Efficiency¶
For large datasets, consider memory-efficient approaches:
# Good: In-place operations where possible
def vectorized_abs_inplace(values: np.ndarray) -> np.ndarray:
"""Absolute value with potential for in-place operation."""
return np.abs(values, out=values if values.flags.writeable else None)
# Good: Use appropriate dtypes
df['flag'] = df['value'] > 0 # bool uses 1 byte vs 8 for float64
Caching Lookup Dictionaries¶
For lookup functions, build dictionaries once and reuse:
# Good: Build lookup once
lookup_dict = dict(zip(reference_df['key'], reference_df['value']))
df['result'] = df['lookup_col'].map(lookup_dict)
# Avoid: Repeated lookups
# for idx, row in df.iterrows():
# df.at[idx, 'result'] = reference_df[reference_df['key'] == row['lookup_col']]['value'].iloc[0]
Checklist¶
Before submitting a new function:
- Implementation in
vectorized.pywith full docstring - Function exported in
__init__.py - Parser signature registered (if needed)
- Translator support added
- Unit tests with edge cases
- Integration tests with parser
- Documentation in function reference
- Performance validated on large datasets
- Matches Excel behavior (tested against Excel)