System Design¶

This document provides a comprehensive overview of CalcBridge's system architecture, including component interactions, request flows, and deployment topology.

High-Level System Diagram¶

flowchart TB
    subgraph External["External"]
        USER["Users"]
        EXT_API["External APIs"]
    end

    subgraph Edge["Edge Layer"]
        CDN["CDN<br/>(Static Assets)"]
        LB["Load Balancer<br/>(NGINX/ALB)"]
    end

    subgraph Frontend["Frontend Tier"]
        NEXT["Next.js 14<br/>(SSR + CSR)"]
    end

    subgraph API["API Tier"]
        direction TB
        FASTAPI1["FastAPI Instance 1"]
        FASTAPI2["FastAPI Instance 2"]
        FASTAPI3["FastAPI Instance N"]
    end

    subgraph Services["Service Layer"]
        AUTH["Auth Service"]
        CALC["Calculation Engine"]
        TENANT["Tenant Service"]
        AUDIT["Audit Service"]
        ENCRYPT["Encryption Service"]
    end

    subgraph Workers["Worker Tier"]
        WORKER1["Celery Worker<br/>(Default Queue)"]
        WORKER2["Celery Worker<br/>(Calculate Queue)"]
        WORKER3["Celery Worker<br/>(Priority Queue)"]
    end

    subgraph Data["Data Tier"]
        PG_PRIMARY[("PostgreSQL 16<br/>(Primary)")]
        PG_REPLICA[("PostgreSQL<br/>(Read Replica)")]
        VALKEY[("Valkey<br/>(Cache)")]
    end

    subgraph Observability["Observability"]
        PROM["Prometheus"]
        GRAFANA["Grafana"]
        OTEL["OpenTelemetry<br/>Collector"]
    end

    USER --> CDN
    USER --> LB
    EXT_API --> LB
    CDN --> NEXT
    LB --> NEXT
    LB --> API
    NEXT --> API

    FASTAPI1 --> Services
    FASTAPI2 --> Services
    FASTAPI3 --> Services

    Services --> Workers
    Services --> Data
    Workers --> Data

    API --> PROM
    Workers --> PROM
    PROM --> GRAFANA
    API --> OTEL

    PG_PRIMARY --> PG_REPLICA

    style PG_PRIMARY fill:#DBEAFE,stroke:#3B82F6
    style VALKEY fill:#FEF3C7,stroke:#F59E0B
    style CALC fill:#DCFCE7,stroke:#22C55E

Component Overview¶

API Layer (FastAPI)¶

The API layer handles all HTTP requests and provides the REST interface.

# src/api/main.py - App factory pattern
from fastapi import FastAPI
from src.api.middleware import (
    TenantMiddleware,
    RateLimitMiddleware,
    SecurityHeadersMiddleware,
    CorrelationMiddleware,
)

def create_app() -> FastAPI:
    app = FastAPI(
        title="CalcBridge API",
        version="1.0.0",
        docs_url="/docs",
    )

    # Middleware order matters (executed in reverse)
    app.add_middleware(SecurityHeadersMiddleware)
    app.add_middleware(RateLimitMiddleware)
    app.add_middleware(TenantMiddleware)
    app.add_middleware(CorrelationMiddleware)

    return app

Key Components:

Component	Purpose
Routes	API endpoint handlers
Middleware	Cross-cutting concerns (auth, rate limit)
Dependencies	Dependency injection (DB sessions, auth)
Exception Handlers	Unified error responses

Calculation Engine¶

The calculation engine processes Excel formulas safely and efficiently.

flowchart LR
    subgraph Input
        FORMULA["Excel Formula"]
    end

    subgraph Parser["Parser Pipeline"]
        TOK["Tokenizer"]
        PARSE["Parser"]
        AST["AST Builder"]
    end

    subgraph Executor["Execution"]
        TRANS["Translator"]
        SAFE["Safe Evaluator"]
        VEC["Vectorized Ops"]
    end

    subgraph Output
        RESULT["Result"]
    end

    FORMULA --> TOK
    TOK --> PARSE
    PARSE --> AST
    AST --> TRANS
    TRANS --> SAFE
    SAFE --> VEC
    VEC --> RESULT

    style SAFE fill:#DCFCE7,stroke:#22C55E

Processing Pipeline:

Tokenizer: Converts formula string to tokens
Parser: Builds Abstract Syntax Tree from tokens
Translator: Converts AST to executable operations
Safe Evaluator: Executes without dynamic code execution
Vectorized Operations: Uses pandas/numpy for performance

Database Layer¶

PostgreSQL 16 with Row-Level Security for tenant isolation.

erDiagram
    TENANT ||--o{ USER : contains
    TENANT ||--o{ WORKBOOK : owns
    TENANT ||--o{ API_KEY : has
    USER ||--o{ WORKBOOK : creates
    WORKBOOK ||--o{ SHEET : contains
    SHEET ||--o{ CELL_DATA : stores

    TENANT {
        uuid id PK
        string name
        string slug UK
        string subscription_tier
        jsonb settings
        boolean is_active
    }

    USER {
        uuid id PK
        uuid tenant_id FK
        string email UK
        string password_hash
        string role
    }

    WORKBOOK {
        uuid id PK
        uuid tenant_id FK
        uuid created_by FK
        string name
        jsonb metadata
    }

    SHEET {
        uuid id PK
        uuid workbook_id FK
        string name
        integer row_count
        integer col_count
    }

    CELL_DATA {
        uuid id PK
        uuid sheet_id FK
        jsonb data
    }

Cache Layer (Valkey)¶

Valkey provides distributed caching and rate limiting.

Cache Patterns:

Pattern	Use Case	TTL
Session Cache	User sessions	15 min
Calculation Cache	Formula results	5 min
Rate Limit	Request counters	60 sec
Tenant Config	Tenant settings	10 min

# Cache key structure
CACHE_KEYS = {
    "session": "calcbridge:session:{session_id}",
    "rate_limit": "calcbridge:rate:{identifier}",
    "calculation": "calcbridge:calc:{tenant_id}:{formula_hash}",
    "tenant": "calcbridge:tenant:{tenant_id}:config",
}

Worker Layer (Celery)¶

Celery handles asynchronous task processing.

Queue Configuration:

Queue	Workers	Purpose
`default`	4	General tasks, exports
`parse`	2	Excel file parsing
`calculate`	4	Formula calculations
`priority`	2	High-priority tasks

# Task routing configuration
CELERY_TASK_ROUTES = {
    "src.workers.tasks.parse.*": {"queue": "parse"},
    "src.workers.tasks.calculate.*": {"queue": "calculate"},
    "src.workers.tasks.export.*": {"queue": "default"},
    "src.workers.tasks.priority.*": {"queue": "priority"},
}

Request/Response Flow¶

Typical API Request¶

sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant API as FastAPI
    participant MW as Middleware
    participant SVC as Service
    participant DB as PostgreSQL
    participant CACHE as Valkey

    C->>LB: POST /api/v1/calculations/evaluate
    LB->>API: Forward request

    API->>MW: Correlation ID
    MW->>MW: Rate limit check
    MW->>CACHE: Get rate limit counter
    CACHE-->>MW: Counter value

    alt Rate limit OK
        MW->>MW: JWT validation
        MW->>MW: Set tenant context
        MW->>SVC: Process request
        SVC->>DB: Query with RLS
        DB-->>SVC: Results
        SVC->>CACHE: Cache result
        SVC-->>MW: Response
        MW-->>API: Add security headers
        API-->>LB: JSON response
        LB-->>C: 200 OK
    else Rate limited
        MW-->>C: 429 Too Many Requests
    end

Async Task Flow¶

sequenceDiagram
    participant C as Client
    participant API as FastAPI
    participant Q as Celery Queue
    participant W as Worker
    participant DB as PostgreSQL
    participant WS as WebSocket

    C->>API: POST /api/v1/workbooks/upload
    API->>DB: Create workbook record
    API->>Q: Enqueue parse task
    API-->>C: 202 Accepted (job_id)

    Q->>W: Dequeue task
    W->>W: Parse Excel file
    W->>DB: Store parsed data
    W->>WS: Notify completion
    WS-->>C: Job complete event

    C->>API: GET /api/v1/jobs/{job_id}
    API->>DB: Get job status
    API-->>C: 200 OK (completed)

Deployment Topology¶

Development Environment¶

flowchart LR
    subgraph Local["Local Machine"]
        DEV["Developer"]
        VENV["Python venv"]
        NODE["Node.js"]
    end

    subgraph Docker["Docker Compose"]
        PG["PostgreSQL"]
        VK["Valkey"]
        CELERY["Celery"]
    end

    DEV --> VENV
    DEV --> NODE
    VENV --> Docker

Production Environment¶

flowchart TB
    subgraph Internet
        USERS["Users"]
    end

    subgraph CloudProvider["Cloud Provider"]
        subgraph Edge
            CDN["CloudFront/CDN"]
            ALB["Application LB"]
        end

        subgraph Compute["ECS/Kubernetes"]
            subgraph APICluster["API Cluster"]
                API1["API Pod 1"]
                API2["API Pod 2"]
                APIN["API Pod N"]
            end

            subgraph WorkerCluster["Worker Cluster"]
                W1["Worker Pod 1"]
                W2["Worker Pod 2"]
            end
        end

        subgraph Database["RDS"]
            PG_PRIMARY["Primary"]
            PG_REPLICA["Read Replica"]
        end

        subgraph Cache["ElastiCache"]
            VALKEY["Valkey Cluster"]
        end

        subgraph Storage["S3"]
            FILES["Excel Files"]
            LOGS["Audit Logs"]
        end
    end

    USERS --> CDN
    USERS --> ALB
    CDN --> ALB
    ALB --> APICluster
    APICluster --> Database
    APICluster --> Cache
    WorkerCluster --> Database
    WorkerCluster --> Cache
    WorkerCluster --> Storage

Scalability Considerations¶

Horizontal Scaling¶

Component	Scaling Strategy
API	Add instances behind load balancer
Workers	Add workers per queue based on depth
Database	Read replicas for read-heavy workloads
Cache	Valkey cluster for sharding

Vertical Scaling¶

Component	Recommended Sizing
API	2 vCPU, 4GB RAM per instance
Workers (calc)	4 vCPU, 8GB RAM
Workers (parse)	2 vCPU, 4GB RAM
Database	4 vCPU, 16GB RAM (primary)

Performance Targets¶

Metric	Target	Monitoring
API P50	<100ms	Prometheus histogram
API P95	<200ms	Prometheus histogram
API P99	<500ms	Prometheus histogram
Error Rate	<0.1%	Prometheus counter
Availability	99.9%	Uptime monitoring

Health Checks¶

Liveness Probe¶

GET /health/live
# Returns 200 if process is running

Readiness Probe¶

GET /health/ready
# Returns 200 if:
# - Database connection OK
# - Valkey connection OK
# - Celery broker connection OK

Startup Probe¶

GET /health
# Returns 200 with system status:
{
  "status": "healthy",
  "version": "1.0.0",
  "components": {
    "database": "ok",
    "cache": "ok",
    "workers": "ok"
  }
}

Resilience Patterns¶

Circuit Breaker¶

Prevents cascade failures when downstream services are unavailable.

from src.core.resilience import circuit_breaker

@circuit_breaker(
    failure_threshold=5,
    recovery_timeout=30,
    expected_exceptions=(ConnectionError, TimeoutError)
)
async def call_external_service():
    # External service call
    pass

Retry with Backoff¶

Handles transient failures gracefully.

from celery import shared_task

@shared_task(
    bind=True,
    max_retries=3,
    autoretry_for=(ConnectionError,),
    retry_backoff=True,  # Exponential backoff
    retry_backoff_max=300,
)
def process_workbook(self, workbook_id: str):
    pass

Graceful Degradation¶

Fallback behavior when services are unavailable.

async def get_cached_result(key: str) -> Optional[dict]:
    try:
        return await valkey.get(key)
    except ConnectionError:
        # Fallback to database
        return await db.get_cached_result(key)