Skip to content

Operational Resilience

This document specifies the requirements for CalcBridge's operational resilience capabilities, including error handling, recovery, and system reliability.


Overview

CalcBridge is designed to handle failures gracefully, ensuring that no data is lost and that the system degrades predictably under adverse conditions.


Requirements

Functional Requirements

ID Requirement Priority Status
OR-001 Failed tasks shall be captured in a dead letter queue Must Have Implemented
OR-002 Circuit breakers shall prevent cascade failures Must Have Implemented
OR-003 System shall support graceful degradation Must Have Implemented
OR-004 Health checks shall verify all dependencies Must Have Implemented
OR-005 Maintenance mode shall allow controlled degradation Should Have Implemented
OR-006 Feature flags shall allow runtime feature toggling Should Have Implemented
OR-007 DLQ entries shall be retryable Must Have Implemented

Dead Letter Queue

Failed Celery tasks are automatically captured:

Feature Description
Capture All failed tasks stored with full context
Retry Individual or forced retry of failed tasks
Resolution Mark as resolved with notes
Discard Remove unrecoverable entries
Statistics Health monitoring with severity levels

DLQ Health Levels

Level Criteria
Healthy 0 pending entries
Warning 1-10 pending entries
Critical >10 pending entries

Circuit Breaker

Prevents cascade failures when downstream services are unavailable:

State Description
Closed Normal operation, requests pass through
Open Service unavailable, requests fail fast
Half-Open Testing if service has recovered

Configuration:

  • Failure threshold: 5 failures trigger open state
  • Recovery timeout: 30 seconds before half-open
  • Expected exceptions: ConnectionError, TimeoutError

Graceful Degradation

When services are impaired, CalcBridge:

  • Falls back to cached data when the database is slow
  • Uses synchronous task execution when Celery is unavailable
  • Returns partial results rather than errors when possible

Health Monitoring

Probe Endpoint Purpose
Liveness /health/live Process is running
Readiness /health/ready Dependencies healthy
Deep /health/deep All components checked

Feature Flags

Runtime toggleable features:

  • Enable/disable via API
  • Admin-only access
  • Persisted state

User Stories

US-OR-001: Manage Failed Tasks

As a System Administrator I want to review and retry failed background tasks So that no work is silently lost

US-OR-002: Monitor System Health

As a System Administrator I want to see comprehensive health status So that I can proactively address issues