System Design¶
This document provides a comprehensive overview of CalcBridge's system architecture, including component interactions, request flows, and deployment topology.
High-Level System Diagram¶
flowchart TB
subgraph External["External"]
USER["Users"]
EXT_API["External APIs"]
end
subgraph Edge["Edge Layer"]
CDN["CDN<br/>(Static Assets)"]
LB["Load Balancer<br/>(NGINX/ALB)"]
end
subgraph Frontend["Frontend Tier"]
NEXT["Next.js 14<br/>(SSR + CSR)"]
end
subgraph API["API Tier"]
direction TB
FASTAPI1["FastAPI Instance 1"]
FASTAPI2["FastAPI Instance 2"]
FASTAPI3["FastAPI Instance N"]
end
subgraph Services["Service Layer"]
AUTH["Auth Service"]
CALC["Calculation Engine"]
TENANT["Tenant Service"]
AUDIT["Audit Service"]
ENCRYPT["Encryption Service"]
end
subgraph Workers["Worker Tier"]
WORKER1["Celery Worker<br/>(Default Queue)"]
WORKER2["Celery Worker<br/>(Calculate Queue)"]
WORKER3["Celery Worker<br/>(Priority Queue)"]
end
subgraph Data["Data Tier"]
PG_PRIMARY[("PostgreSQL 16<br/>(Primary)")]
PG_REPLICA[("PostgreSQL<br/>(Read Replica)")]
VALKEY[("Valkey<br/>(Cache)")]
end
subgraph Observability["Observability"]
PROM["Prometheus"]
GRAFANA["Grafana"]
OTEL["OpenTelemetry<br/>Collector"]
end
USER --> CDN
USER --> LB
EXT_API --> LB
CDN --> NEXT
LB --> NEXT
LB --> API
NEXT --> API
FASTAPI1 --> Services
FASTAPI2 --> Services
FASTAPI3 --> Services
Services --> Workers
Services --> Data
Workers --> Data
API --> PROM
Workers --> PROM
PROM --> GRAFANA
API --> OTEL
PG_PRIMARY --> PG_REPLICA
style PG_PRIMARY fill:#DBEAFE,stroke:#3B82F6
style VALKEY fill:#FEF3C7,stroke:#F59E0B
style CALC fill:#DCFCE7,stroke:#22C55E Component Overview¶
API Layer (FastAPI)¶
The API layer handles all HTTP requests and provides the REST interface.
# src/api/main.py - App factory pattern
from fastapi import FastAPI
from src.api.middleware import (
TenantMiddleware,
RateLimitMiddleware,
SecurityHeadersMiddleware,
CorrelationMiddleware,
)
def create_app() -> FastAPI:
app = FastAPI(
title="CalcBridge API",
version="1.0.0",
docs_url="/docs",
)
# Middleware order matters (executed in reverse)
app.add_middleware(SecurityHeadersMiddleware)
app.add_middleware(RateLimitMiddleware)
app.add_middleware(TenantMiddleware)
app.add_middleware(CorrelationMiddleware)
return app
Key Components:
| Component | Purpose |
|---|---|
| Routes | API endpoint handlers |
| Middleware | Cross-cutting concerns (auth, rate limit) |
| Dependencies | Dependency injection (DB sessions, auth) |
| Exception Handlers | Unified error responses |
Calculation Engine¶
The calculation engine processes Excel formulas safely and efficiently.
flowchart LR
subgraph Input
FORMULA["Excel Formula"]
end
subgraph Parser["Parser Pipeline"]
TOK["Tokenizer"]
PARSE["Parser"]
AST["AST Builder"]
end
subgraph Executor["Execution"]
TRANS["Translator"]
SAFE["Safe Evaluator"]
VEC["Vectorized Ops"]
end
subgraph Output
RESULT["Result"]
end
FORMULA --> TOK
TOK --> PARSE
PARSE --> AST
AST --> TRANS
TRANS --> SAFE
SAFE --> VEC
VEC --> RESULT
style SAFE fill:#DCFCE7,stroke:#22C55E Processing Pipeline:
- Tokenizer: Converts formula string to tokens
- Parser: Builds Abstract Syntax Tree from tokens
- Translator: Converts AST to executable operations
- Safe Evaluator: Executes without dynamic code execution
- Vectorized Operations: Uses pandas/numpy for performance
Database Layer¶
PostgreSQL 16 with Row-Level Security for tenant isolation.
erDiagram
TENANT ||--o{ USER : contains
TENANT ||--o{ WORKBOOK : owns
TENANT ||--o{ API_KEY : has
USER ||--o{ WORKBOOK : creates
WORKBOOK ||--o{ SHEET : contains
SHEET ||--o{ CELL_DATA : stores
TENANT {
uuid id PK
string name
string slug UK
string subscription_tier
jsonb settings
boolean is_active
}
USER {
uuid id PK
uuid tenant_id FK
string email UK
string password_hash
string role
}
WORKBOOK {
uuid id PK
uuid tenant_id FK
uuid created_by FK
string name
jsonb metadata
}
SHEET {
uuid id PK
uuid workbook_id FK
string name
integer row_count
integer col_count
}
CELL_DATA {
uuid id PK
uuid sheet_id FK
jsonb data
} Cache Layer (Valkey)¶
Valkey provides distributed caching and rate limiting.
Cache Patterns:
| Pattern | Use Case | TTL |
|---|---|---|
| Session Cache | User sessions | 15 min |
| Calculation Cache | Formula results | 5 min |
| Rate Limit | Request counters | 60 sec |
| Tenant Config | Tenant settings | 10 min |
# Cache key structure
CACHE_KEYS = {
"session": "calcbridge:session:{session_id}",
"rate_limit": "calcbridge:rate:{identifier}",
"calculation": "calcbridge:calc:{tenant_id}:{formula_hash}",
"tenant": "calcbridge:tenant:{tenant_id}:config",
}
Worker Layer (Celery)¶
Celery handles asynchronous task processing.
Queue Configuration:
| Queue | Workers | Purpose |
|---|---|---|
default | 4 | General tasks, exports |
parse | 2 | Excel file parsing |
calculate | 4 | Formula calculations |
priority | 2 | High-priority tasks |
# Task routing configuration
CELERY_TASK_ROUTES = {
"src.workers.tasks.parse.*": {"queue": "parse"},
"src.workers.tasks.calculate.*": {"queue": "calculate"},
"src.workers.tasks.export.*": {"queue": "default"},
"src.workers.tasks.priority.*": {"queue": "priority"},
}
Request/Response Flow¶
Typical API Request¶
sequenceDiagram
participant C as Client
participant LB as Load Balancer
participant API as FastAPI
participant MW as Middleware
participant SVC as Service
participant DB as PostgreSQL
participant CACHE as Valkey
C->>LB: POST /api/v1/calculations/evaluate
LB->>API: Forward request
API->>MW: Correlation ID
MW->>MW: Rate limit check
MW->>CACHE: Get rate limit counter
CACHE-->>MW: Counter value
alt Rate limit OK
MW->>MW: JWT validation
MW->>MW: Set tenant context
MW->>SVC: Process request
SVC->>DB: Query with RLS
DB-->>SVC: Results
SVC->>CACHE: Cache result
SVC-->>MW: Response
MW-->>API: Add security headers
API-->>LB: JSON response
LB-->>C: 200 OK
else Rate limited
MW-->>C: 429 Too Many Requests
end Async Task Flow¶
sequenceDiagram
participant C as Client
participant API as FastAPI
participant Q as Celery Queue
participant W as Worker
participant DB as PostgreSQL
participant WS as WebSocket
C->>API: POST /api/v1/workbooks/upload
API->>DB: Create workbook record
API->>Q: Enqueue parse task
API-->>C: 202 Accepted (job_id)
Q->>W: Dequeue task
W->>W: Parse Excel file
W->>DB: Store parsed data
W->>WS: Notify completion
WS-->>C: Job complete event
C->>API: GET /api/v1/jobs/{job_id}
API->>DB: Get job status
API-->>C: 200 OK (completed) Deployment Topology¶
Development Environment¶
flowchart LR
subgraph Local["Local Machine"]
DEV["Developer"]
VENV["Python venv"]
NODE["Node.js"]
end
subgraph Docker["Docker Compose"]
PG["PostgreSQL"]
VK["Valkey"]
CELERY["Celery"]
end
DEV --> VENV
DEV --> NODE
VENV --> Docker Production Environment¶
flowchart TB
subgraph Internet
USERS["Users"]
end
subgraph CloudProvider["Cloud Provider"]
subgraph Edge
CDN["CloudFront/CDN"]
ALB["Application LB"]
end
subgraph Compute["ECS/Kubernetes"]
subgraph APICluster["API Cluster"]
API1["API Pod 1"]
API2["API Pod 2"]
APIN["API Pod N"]
end
subgraph WorkerCluster["Worker Cluster"]
W1["Worker Pod 1"]
W2["Worker Pod 2"]
end
end
subgraph Database["RDS"]
PG_PRIMARY["Primary"]
PG_REPLICA["Read Replica"]
end
subgraph Cache["ElastiCache"]
VALKEY["Valkey Cluster"]
end
subgraph Storage["S3"]
FILES["Excel Files"]
LOGS["Audit Logs"]
end
end
USERS --> CDN
USERS --> ALB
CDN --> ALB
ALB --> APICluster
APICluster --> Database
APICluster --> Cache
WorkerCluster --> Database
WorkerCluster --> Cache
WorkerCluster --> Storage Scalability Considerations¶
Horizontal Scaling¶
| Component | Scaling Strategy |
|---|---|
| API | Add instances behind load balancer |
| Workers | Add workers per queue based on depth |
| Database | Read replicas for read-heavy workloads |
| Cache | Valkey cluster for sharding |
Vertical Scaling¶
| Component | Recommended Sizing |
|---|---|
| API | 2 vCPU, 4GB RAM per instance |
| Workers (calc) | 4 vCPU, 8GB RAM |
| Workers (parse) | 2 vCPU, 4GB RAM |
| Database | 4 vCPU, 16GB RAM (primary) |
Performance Targets¶
| Metric | Target | Monitoring |
|---|---|---|
| API P50 | <100ms | Prometheus histogram |
| API P95 | <200ms | Prometheus histogram |
| API P99 | <500ms | Prometheus histogram |
| Error Rate | <0.1% | Prometheus counter |
| Availability | 99.9% | Uptime monitoring |
Health Checks¶
Liveness Probe¶
Readiness Probe¶
GET /health/ready
# Returns 200 if:
# - Database connection OK
# - Valkey connection OK
# - Celery broker connection OK
Startup Probe¶
GET /health
# Returns 200 with system status:
{
"status": "healthy",
"version": "1.0.0",
"components": {
"database": "ok",
"cache": "ok",
"workers": "ok"
}
}
Resilience Patterns¶
Circuit Breaker¶
Prevents cascade failures when downstream services are unavailable.
from src.core.resilience import circuit_breaker
@circuit_breaker(
failure_threshold=5,
recovery_timeout=30,
expected_exceptions=(ConnectionError, TimeoutError)
)
async def call_external_service():
# External service call
pass
Retry with Backoff¶
Handles transient failures gracefully.
from celery import shared_task
@shared_task(
bind=True,
max_retries=3,
autoretry_for=(ConnectionError,),
retry_backoff=True, # Exponential backoff
retry_backoff_max=300,
)
def process_workbook(self, workbook_id: str):
pass
Graceful Degradation¶
Fallback behavior when services are unavailable.