Skip to content

Adding Excel Functions

This guide walks through the process of adding new Excel function support to CalcBridge's calculation engine. The engine uses vectorized operations for high performance on large financial datasets.


Architecture Overview

The calculation engine has three main components:

flowchart LR
    A["Formula String<br/>=SUM(A1:A10)"] --> B["Parser<br/>src/calculations/parser/"]
    B --> C["AST<br/>Abstract Syntax Tree"]
    C --> D["Translator<br/>src/calculations/parser/translator.py"]
    D --> E["Vectorized Functions<br/>src/calculations/functions/"]
    E --> F["Result"]
Component Location Purpose
Parser src/calculations/parser/ Tokenizes and parses Excel formulas into AST
AST Nodes src/calculations/parser/ast_nodes.py Defines AST node types
Translator src/calculations/parser/translator.py Converts AST to executable code
Vectorized Functions src/calculations/functions/vectorized.py High-performance implementations
Function Registry src/calculations/functions/__init__.py Exports available functions

Step-by-Step Guide

Step 1: Implement in vectorized.py

First, add your function implementation to src/calculations/functions/vectorized.py:

src/calculations/functions/vectorized.py
def vectorized_round(
    values: pd.Series | np.ndarray,
    decimals: int = 0,
) -> np.ndarray:
    """
    Vectorized implementation of Excel ROUND function.

    Equivalent to: =ROUND(value, decimals)

    Args:
        values: Values to round
        decimals: Number of decimal places (can be negative)

    Returns:
        numpy array with rounded values

    Example:
        >>> df['Rounded'] = vectorized_round(df['Amount'], 2)

    Excel equivalent:
        =ROUND([@Amount], 2)
    """
    return np.round(np.asarray(values), decimals)

Implementation Guidelines

  1. Always use vectorized operations - Avoid loops; use numpy/pandas operations
  2. Handle edge cases - Empty arrays, NaN values, type mismatches
  3. Match Excel behavior - Test against Excel to ensure identical results
  4. Document thoroughly - Include docstring, example, and Excel equivalent

Common Patterns

# Conditional logic: Use np.where
def vectorized_if(condition, value_true, value_false):
    return np.where(condition, value_true, value_false)

# Multiple conditions: Chain np.where or use np.select
def vectorized_ifs(*conditions_and_values, default=None):
    conditions = [c for i, c in enumerate(conditions_and_values) if i % 2 == 0]
    values = [v for i, v in enumerate(conditions_and_values) if i % 2 == 1]
    return np.select(conditions, values, default=default)

# Aggregation with conditions: Use boolean masking
def vectorized_sumif(range_values, criteria_range, criteria):
    mask = criteria_range == criteria
    return np.sum(np.where(mask, range_values, 0))

# Lookups: Use dictionary mapping
def vectorized_xlookup(lookup_values, lookup_dict, default=None):
    return lookup_values.map(lookup_dict).fillna(default)

Step 2: Register in Function Registry

Export your function in src/calculations/functions/__init__.py:

src/calculations/functions/__init__.py
from src.calculations.functions.vectorized import (
    # ... existing imports ...
    vectorized_round,  # Add your new function
)

__all__ = [
    # ... existing exports ...
    "vectorized_round",  # Add to __all__
]

Step 3: Add Parser Support (if needed)

Most functions use standard function call syntax and don't need parser changes. However, if your function has special syntax, you may need to modify the parser.

Standard Syntax (No Changes Needed)

=ROUND(A1, 2)
=SUM(A1:A10)
=IF(A1>0, "Positive", "Non-positive")

Special Syntax (Parser Changes Required)

If your function has unique syntax (like array formulas or special operators), modify the parser:

src/calculations/parser/parser.py
# Add to the function signature registry
FUNCTION_SIGNATURES = {
    # ... existing functions ...
    "ROUND": FunctionSignature(
        name="ROUND",
        min_args=1,
        max_args=2,
        arg_types=["number", "number"],
        return_type="number",
    ),
}

Step 4: Add Translator Support

Update the translator to convert the AST node to your vectorized function call:

src/calculations/parser/translator.py
def translate_function_call(node: FunctionCallNode) -> str:
    """Translate a function call AST node to Python code."""
    func_name = node.function_name.upper()

    # ... existing function translations ...

    if func_name == "ROUND":
        args = [translate_expression(arg) for arg in node.arguments]
        value = args[0]
        decimals = args[1] if len(args) > 1 else "0"
        return f"vectorized_round({value}, {decimals})"

    # ... rest of translations ...

Step 5: Write Tests

Create comprehensive tests for your function:

tests/unit/test_vectorized_functions.py
import pytest
import numpy as np
import pandas as pd
from numpy.testing import assert_array_equal, assert_array_almost_equal

from src.calculations.functions import vectorized_round


class TestVectorizedRound:
    """Tests for vectorized_round function."""

    def test_round_to_integer(self):
        """Test rounding to integer (0 decimal places)."""
        values = pd.Series([1.4, 1.5, 1.6, 2.5])
        result = vectorized_round(values, 0)
        # Note: numpy uses banker's rounding (round half to even)
        expected = np.array([1.0, 2.0, 2.0, 2.0])
        assert_array_equal(result, expected)

    def test_round_to_decimals(self):
        """Test rounding to specific decimal places."""
        values = pd.Series([1.234, 5.678, 9.999])
        result = vectorized_round(values, 2)
        expected = np.array([1.23, 5.68, 10.0])
        assert_array_almost_equal(result, expected)

    def test_round_negative_decimals(self):
        """Test rounding to tens, hundreds, etc."""
        values = pd.Series([123, 456, 789])
        result = vectorized_round(values, -1)
        expected = np.array([120, 460, 790])
        assert_array_equal(result, expected)

    def test_round_empty_array(self):
        """Test handling of empty input."""
        values = pd.Series([], dtype=float)
        result = vectorized_round(values, 2)
        assert len(result) == 0

    def test_round_with_nan(self):
        """Test handling of NaN values."""
        values = pd.Series([1.5, np.nan, 2.5])
        result = vectorized_round(values, 0)
        assert np.isnan(result[1])
        assert result[0] == 2.0
        assert result[2] == 2.0

    def test_round_numpy_array_input(self):
        """Test with numpy array input."""
        values = np.array([1.234, 5.678])
        result = vectorized_round(values, 1)
        expected = np.array([1.2, 5.7])
        assert_array_almost_equal(result, expected)

Step 6: Test Parser Integration

tests/unit/test_parser.py
import pytest
from src.calculations.parser import parse_formula, translate_formula


class TestRoundFunction:
    """Parser tests for ROUND function."""

    def test_parse_round_with_decimals(self):
        """Test parsing ROUND with decimal argument."""
        ast = parse_formula("=ROUND(A1, 2)")
        assert ast.function_name == "ROUND"
        assert len(ast.arguments) == 2

    def test_parse_round_default_decimals(self):
        """Test parsing ROUND with default decimals."""
        ast = parse_formula("=ROUND(A1)")
        assert ast.function_name == "ROUND"
        assert len(ast.arguments) == 1

    def test_translate_round(self):
        """Test translation to vectorized function."""
        result = translate_formula("=ROUND(A1, 2)")
        assert "vectorized_round" in result.code

Step 7: Update Documentation

Add your function to the function reference:

docs-site/docs/reference/functions.md
### ROUND

Rounds a number to a specified number of digits.

**Syntax**: `=ROUND(value, [decimals])`

**Arguments**:

| Argument | Type | Required | Description |
|----------|------|----------|-------------|
| `value` | number | Yes | The value to round |
| `decimals` | number | No | Number of decimal places (default: 0) |

**Returns**: number

**Examples**:
=ROUND(1.234, 2) -> 1.23 =ROUND(1.5) -> 2 =ROUND(123, -1) -> 120


Example: Implementing AVERAGEIF

Let's walk through a complete example of implementing AVERAGEIF:

1. Implementation

src/calculations/functions/vectorized.py
def vectorized_averageif(
    range_values: pd.Series | np.ndarray,
    criteria_range: pd.Series | np.ndarray,
    criteria: Any,
    average_range: pd.Series | np.ndarray | None = None,
) -> float:
    """
    Vectorized implementation of Excel AVERAGEIF function.

    Equivalent to: =AVERAGEIF(criteria_range, criteria, [average_range])

    Args:
        range_values: The range to evaluate with criteria
        criteria_range: The range containing the criteria values
        criteria: The criteria to match (value, expression, or pattern)
        average_range: Optional range to average (defaults to criteria_range)

    Returns:
        Average of values where criteria is met, or NaN if no matches

    Example:
        >>> df['AvgByCategory'] = vectorized_averageif(
        ...     df['Value'],
        ...     df['Category'],
        ...     'A'
        ... )

    Excel equivalent:
        =AVERAGEIF(Category, "A", Value)
    """
    criteria_arr = np.asarray(criteria_range)
    values_arr = np.asarray(average_range if average_range is not None else range_values)

    # Build condition based on criteria type
    if isinstance(criteria, str):
        if criteria.startswith(">"):
            threshold = float(criteria[1:])
            mask = criteria_arr > threshold
        elif criteria.startswith("<"):
            threshold = float(criteria[1:])
            mask = criteria_arr < threshold
        elif criteria.startswith(">="):
            threshold = float(criteria[2:])
            mask = criteria_arr >= threshold
        elif criteria.startswith("<="):
            threshold = float(criteria[2:])
            mask = criteria_arr <= threshold
        elif criteria.startswith("<>"):
            compare_val = criteria[2:]
            mask = criteria_arr != compare_val
        else:
            mask = criteria_arr == criteria
    else:
        mask = criteria_arr == criteria

    matching_values = values_arr[mask]

    if len(matching_values) == 0:
        return np.nan

    return np.mean(matching_values)

2. Register

src/calculations/functions/__init__.py
from src.calculations.functions.vectorized import (
    # ... existing ...
    vectorized_averageif,
)

__all__ = [
    # ... existing ...
    "vectorized_averageif",
]

3. Tests

tests/unit/test_averageif.py
import pytest
import numpy as np
import pandas as pd

from src.calculations.functions import vectorized_averageif


class TestVectorizedAverageIf:
    """Tests for vectorized_averageif function."""

    def test_basic_equality(self):
        """Test averaging with equality criteria."""
        values = pd.Series([10, 20, 30, 40])
        categories = pd.Series(['A', 'B', 'A', 'B'])

        result = vectorized_averageif(values, categories, 'A')
        assert result == 20.0  # Average of 10 and 30

    def test_greater_than(self):
        """Test with greater than criteria."""
        values = pd.Series([10, 20, 30, 40, 50])

        result = vectorized_averageif(values, values, '>25')
        assert result == 40.0  # Average of 30, 40, 50

    def test_no_matches_returns_nan(self):
        """Test that no matches returns NaN."""
        values = pd.Series([10, 20, 30])
        categories = pd.Series(['A', 'A', 'A'])

        result = vectorized_averageif(values, categories, 'B')
        assert np.isnan(result)

    def test_separate_average_range(self):
        """Test with separate average range."""
        criteria_range = pd.Series(['X', 'Y', 'X', 'Y'])
        average_range = pd.Series([100, 200, 300, 400])

        result = vectorized_averageif(
            average_range,
            criteria_range,
            'X',
            average_range
        )
        assert result == 200.0  # Average of 100 and 300

Performance Considerations

Vectorization is Critical

Always use vectorized operations. The performance difference is significant:

Approach 10K rows 100K rows 1M rows
Vectorized (numpy) 1ms 5ms 50ms
df.apply() 50ms 500ms 5s
df.iterrows() 500ms 5s 50s

Memory Efficiency

For large datasets, consider memory-efficient approaches:

# Good: In-place operations where possible
def vectorized_abs_inplace(values: np.ndarray) -> np.ndarray:
    """Absolute value with potential for in-place operation."""
    return np.abs(values, out=values if values.flags.writeable else None)

# Good: Use appropriate dtypes
df['flag'] = df['value'] > 0  # bool uses 1 byte vs 8 for float64

Caching Lookup Dictionaries

For lookup functions, build dictionaries once and reuse:

# Good: Build lookup once
lookup_dict = dict(zip(reference_df['key'], reference_df['value']))
df['result'] = df['lookup_col'].map(lookup_dict)

# Avoid: Repeated lookups
# for idx, row in df.iterrows():
#     df.at[idx, 'result'] = reference_df[reference_df['key'] == row['lookup_col']]['value'].iloc[0]

Checklist

Before submitting a new function:

  • Implementation in vectorized.py with full docstring
  • Function exported in __init__.py
  • Parser signature registered (if needed)
  • Translator support added
  • Unit tests with edge cases
  • Integration tests with parser
  • Documentation in function reference
  • Performance validated on large datasets
  • Matches Excel behavior (tested against Excel)