Adding Excel Functions¶

This guide walks through the process of adding new Excel function support to CalcBridge's calculation engine. The engine uses vectorized operations for high performance on large financial datasets.

Architecture Overview¶

The calculation engine has three main components:

flowchart LR
    A["Formula String<br/>=SUM(A1:A10)"] --> B["Parser<br/>src/calculations/parser/"]
    B --> C["AST<br/>Abstract Syntax Tree"]
    C --> D["Translator<br/>src/calculations/parser/translator.py"]
    D --> E["Vectorized Functions<br/>src/calculations/functions/"]
    E --> F["Result"]

Component	Location	Purpose
Parser	`src/calculations/parser/`	Tokenizes and parses Excel formulas into AST
AST Nodes	`src/calculations/parser/ast_nodes.py`	Defines AST node types
Translator	`src/calculations/parser/translator.py`	Converts AST to executable code
Vectorized Functions	`src/calculations/functions/vectorized.py`	High-performance implementations
Function Registry	`src/calculations/functions/__init__.py`	Exports available functions

Step-by-Step Guide¶

Step 1: Implement in vectorized.py¶

First, add your function implementation to src/calculations/functions/vectorized.py:

src/calculations/functions/vectorized.py

def vectorized_round(
    values: pd.Series | np.ndarray,
    decimals: int = 0,
) -> np.ndarray:
    """
    Vectorized implementation of Excel ROUND function.

    Equivalent to: =ROUND(value, decimals)

    Args:
        values: Values to round
        decimals: Number of decimal places (can be negative)

    Returns:
        numpy array with rounded values

    Example:
        >>> df['Rounded'] = vectorized_round(df['Amount'], 2)

    Excel equivalent:
        =ROUND([@Amount], 2)
    """
    return np.round(np.asarray(values), decimals)

Implementation Guidelines¶

Always use vectorized operations - Avoid loops; use numpy/pandas operations
Handle edge cases - Empty arrays, NaN values, type mismatches
Match Excel behavior - Test against Excel to ensure identical results
Document thoroughly - Include docstring, example, and Excel equivalent

Common Patterns¶

# Conditional logic: Use np.where
def vectorized_if(condition, value_true, value_false):
    return np.where(condition, value_true, value_false)

# Multiple conditions: Chain np.where or use np.select
def vectorized_ifs(*conditions_and_values, default=None):
    conditions = [c for i, c in enumerate(conditions_and_values) if i % 2 == 0]
    values = [v for i, v in enumerate(conditions_and_values) if i % 2 == 1]
    return np.select(conditions, values, default=default)

# Aggregation with conditions: Use boolean masking
def vectorized_sumif(range_values, criteria_range, criteria):
    mask = criteria_range == criteria
    return np.sum(np.where(mask, range_values, 0))

# Lookups: Use dictionary mapping
def vectorized_xlookup(lookup_values, lookup_dict, default=None):
    return lookup_values.map(lookup_dict).fillna(default)

Step 2: Register in Function Registry¶

Export your function in src/calculations/functions/__init__.py:

src/calculations/functions/__init__.py

from src.calculations.functions.vectorized import (
    # ... existing imports ...
    vectorized_round,  # Add your new function
)

__all__ = [
    # ... existing exports ...
    "vectorized_round",  # Add to __all__
]

Step 3: Add Parser Support (if needed)¶

Most functions use standard function call syntax and don't need parser changes. However, if your function has special syntax, you may need to modify the parser.

Standard Syntax (No Changes Needed)¶

=ROUND(A1, 2)
=SUM(A1:A10)
=IF(A1>0, "Positive", "Non-positive")

Special Syntax (Parser Changes Required)¶

If your function has unique syntax (like array formulas or special operators), modify the parser:

src/calculations/parser/parser.py

# Add to the function signature registry
FUNCTION_SIGNATURES = {
    # ... existing functions ...
    "ROUND": FunctionSignature(
        name="ROUND",
        min_args=1,
        max_args=2,
        arg_types=["number", "number"],
        return_type="number",
    ),
}

Step 4: Add Translator Support¶

Update the translator to convert the AST node to your vectorized function call:

src/calculations/parser/translator.py

def translate_function_call(node: FunctionCallNode) -> str:
    """Translate a function call AST node to Python code."""
    func_name = node.function_name.upper()

    # ... existing function translations ...

    if func_name == "ROUND":
        args = [translate_expression(arg) for arg in node.arguments]
        value = args[0]
        decimals = args[1] if len(args) > 1 else "0"
        return f"vectorized_round({value}, {decimals})"

    # ... rest of translations ...

Step 5: Write Tests¶

Create comprehensive tests for your function:

tests/unit/test_vectorized_functions.py

import pytest
import numpy as np
import pandas as pd
from numpy.testing import assert_array_equal, assert_array_almost_equal

from src.calculations.functions import vectorized_round


class TestVectorizedRound:
    """Tests for vectorized_round function."""

    def test_round_to_integer(self):
        """Test rounding to integer (0 decimal places)."""
        values = pd.Series([1.4, 1.5, 1.6, 2.5])
        result = vectorized_round(values, 0)
        # Note: numpy uses banker's rounding (round half to even)
        expected = np.array([1.0, 2.0, 2.0, 2.0])
        assert_array_equal(result, expected)

    def test_round_to_decimals(self):
        """Test rounding to specific decimal places."""
        values = pd.Series([1.234, 5.678, 9.999])
        result = vectorized_round(values, 2)
        expected = np.array([1.23, 5.68, 10.0])
        assert_array_almost_equal(result, expected)

    def test_round_negative_decimals(self):
        """Test rounding to tens, hundreds, etc."""
        values = pd.Series([123, 456, 789])
        result = vectorized_round(values, -1)
        expected = np.array([120, 460, 790])
        assert_array_equal(result, expected)

    def test_round_empty_array(self):
        """Test handling of empty input."""
        values = pd.Series([], dtype=float)
        result = vectorized_round(values, 2)
        assert len(result) == 0

    def test_round_with_nan(self):
        """Test handling of NaN values."""
        values = pd.Series([1.5, np.nan, 2.5])
        result = vectorized_round(values, 0)
        assert np.isnan(result[1])
        assert result[0] == 2.0
        assert result[2] == 2.0

    def test_round_numpy_array_input(self):
        """Test with numpy array input."""
        values = np.array([1.234, 5.678])
        result = vectorized_round(values, 1)
        expected = np.array([1.2, 5.7])
        assert_array_almost_equal(result, expected)

Step 6: Test Parser Integration¶

tests/unit/test_parser.py

import pytest
from src.calculations.parser import parse_formula, translate_formula


class TestRoundFunction:
    """Parser tests for ROUND function."""

    def test_parse_round_with_decimals(self):
        """Test parsing ROUND with decimal argument."""
        ast = parse_formula("=ROUND(A1, 2)")
        assert ast.function_name == "ROUND"
        assert len(ast.arguments) == 2

    def test_parse_round_default_decimals(self):
        """Test parsing ROUND with default decimals."""
        ast = parse_formula("=ROUND(A1)")
        assert ast.function_name == "ROUND"
        assert len(ast.arguments) == 1

    def test_translate_round(self):
        """Test translation to vectorized function."""
        result = translate_formula("=ROUND(A1, 2)")
        assert "vectorized_round" in result.code

Step 7: Update Documentation¶

Add your function to the function reference:

docs-site/docs/reference/functions.md

### ROUND

Rounds a number to a specified number of digits.

**Syntax**: `=ROUND(value, [decimals])`

**Arguments**:

| Argument | Type | Required | Description |
|----------|------|----------|-------------|
| `value` | number | Yes | The value to round |
| `decimals` | number | No | Number of decimal places (default: 0) |

**Returns**: number

**Examples**:

=ROUND(1.234, 2) -> 1.23 =ROUND(1.5) -> 2 =ROUND(123, -1) -> 120

Example: Implementing AVERAGEIF¶

Let's walk through a complete example of implementing AVERAGEIF:

1. Implementation¶

src/calculations/functions/vectorized.py

def vectorized_averageif(
    range_values: pd.Series | np.ndarray,
    criteria_range: pd.Series | np.ndarray,
    criteria: Any,
    average_range: pd.Series | np.ndarray | None = None,
) -> float:
    """
    Vectorized implementation of Excel AVERAGEIF function.

    Equivalent to: =AVERAGEIF(criteria_range, criteria, [average_range])

    Args:
        range_values: The range to evaluate with criteria
        criteria_range: The range containing the criteria values
        criteria: The criteria to match (value, expression, or pattern)
        average_range: Optional range to average (defaults to criteria_range)

    Returns:
        Average of values where criteria is met, or NaN if no matches

    Example:
        >>> df['AvgByCategory'] = vectorized_averageif(
        ...     df['Value'],
        ...     df['Category'],
        ...     'A'
        ... )

    Excel equivalent:
        =AVERAGEIF(Category, "A", Value)
    """
    criteria_arr = np.asarray(criteria_range)
    values_arr = np.asarray(average_range if average_range is not None else range_values)

    # Build condition based on criteria type
    if isinstance(criteria, str):
        if criteria.startswith(">"):
            threshold = float(criteria[1:])
            mask = criteria_arr > threshold
        elif criteria.startswith("<"):
            threshold = float(criteria[1:])
            mask = criteria_arr < threshold
        elif criteria.startswith(">="):
            threshold = float(criteria[2:])
            mask = criteria_arr >= threshold
        elif criteria.startswith("<="):
            threshold = float(criteria[2:])
            mask = criteria_arr <= threshold
        elif criteria.startswith("<>"):
            compare_val = criteria[2:]
            mask = criteria_arr != compare_val
        else:
            mask = criteria_arr == criteria
    else:
        mask = criteria_arr == criteria

    matching_values = values_arr[mask]

    if len(matching_values) == 0:
        return np.nan

    return np.mean(matching_values)

2. Register¶

src/calculations/functions/__init__.py

from src.calculations.functions.vectorized import (
    # ... existing ...
    vectorized_averageif,
)

__all__ = [
    # ... existing ...
    "vectorized_averageif",
]

3. Tests¶

tests/unit/test_averageif.py

import pytest
import numpy as np
import pandas as pd

from src.calculations.functions import vectorized_averageif


class TestVectorizedAverageIf:
    """Tests for vectorized_averageif function."""

    def test_basic_equality(self):
        """Test averaging with equality criteria."""
        values = pd.Series([10, 20, 30, 40])
        categories = pd.Series(['A', 'B', 'A', 'B'])

        result = vectorized_averageif(values, categories, 'A')
        assert result == 20.0  # Average of 10 and 30

    def test_greater_than(self):
        """Test with greater than criteria."""
        values = pd.Series([10, 20, 30, 40, 50])

        result = vectorized_averageif(values, values, '>25')
        assert result == 40.0  # Average of 30, 40, 50

    def test_no_matches_returns_nan(self):
        """Test that no matches returns NaN."""
        values = pd.Series([10, 20, 30])
        categories = pd.Series(['A', 'A', 'A'])

        result = vectorized_averageif(values, categories, 'B')
        assert np.isnan(result)

    def test_separate_average_range(self):
        """Test with separate average range."""
        criteria_range = pd.Series(['X', 'Y', 'X', 'Y'])
        average_range = pd.Series([100, 200, 300, 400])

        result = vectorized_averageif(
            average_range,
            criteria_range,
            'X',
            average_range
        )
        assert result == 200.0  # Average of 100 and 300

Performance Considerations¶

Vectorization is Critical¶

Always use vectorized operations. The performance difference is significant:

Approach	10K rows	100K rows	1M rows
Vectorized (numpy)	1ms	5ms	50ms
df.apply()	50ms	500ms	5s
df.iterrows()	500ms	5s	50s

Memory Efficiency¶

For large datasets, consider memory-efficient approaches:

# Good: In-place operations where possible
def vectorized_abs_inplace(values: np.ndarray) -> np.ndarray:
    """Absolute value with potential for in-place operation."""
    return np.abs(values, out=values if values.flags.writeable else None)

# Good: Use appropriate dtypes
df['flag'] = df['value'] > 0  # bool uses 1 byte vs 8 for float64

Caching Lookup Dictionaries¶

For lookup functions, build dictionaries once and reuse:

# Good: Build lookup once
lookup_dict = dict(zip(reference_df['key'], reference_df['value']))
df['result'] = df['lookup_col'].map(lookup_dict)

# Avoid: Repeated lookups
# for idx, row in df.iterrows():
#     df.at[idx, 'result'] = reference_df[reference_df['key'] == row['lookup_col']]['value'].iloc[0]

Checklist¶

Before submitting a new function:

Implementation in vectorized.py with full docstring
Function exported in __init__.py
Parser signature registered (if needed)
Translator support added
Unit tests with edge cases
Integration tests with parser
Documentation in function reference
Performance validated on large datasets
Matches Excel behavior (tested against Excel)