Operational Resilience¶
This document specifies the requirements for CalcBridge's operational resilience capabilities, including error handling, recovery, and system reliability.
Overview¶
CalcBridge is designed to handle failures gracefully, ensuring that no data is lost and that the system degrades predictably under adverse conditions.
Requirements¶
Functional Requirements¶
| ID | Requirement | Priority | Status |
|---|---|---|---|
| OR-001 | Failed tasks shall be captured in a dead letter queue | Must Have | Implemented |
| OR-002 | Circuit breakers shall prevent cascade failures | Must Have | Implemented |
| OR-003 | System shall support graceful degradation | Must Have | Implemented |
| OR-004 | Health checks shall verify all dependencies | Must Have | Implemented |
| OR-005 | Maintenance mode shall allow controlled degradation | Should Have | Implemented |
| OR-006 | Feature flags shall allow runtime feature toggling | Should Have | Implemented |
| OR-007 | DLQ entries shall be retryable | Must Have | Implemented |
Dead Letter Queue¶
Failed Celery tasks are automatically captured:
| Feature | Description |
|---|---|
| Capture | All failed tasks stored with full context |
| Retry | Individual or forced retry of failed tasks |
| Resolution | Mark as resolved with notes |
| Discard | Remove unrecoverable entries |
| Statistics | Health monitoring with severity levels |
DLQ Health Levels¶
| Level | Criteria |
|---|---|
| Healthy | 0 pending entries |
| Warning | 1-10 pending entries |
| Critical | >10 pending entries |
Circuit Breaker¶
Prevents cascade failures when downstream services are unavailable:
| State | Description |
|---|---|
| Closed | Normal operation, requests pass through |
| Open | Service unavailable, requests fail fast |
| Half-Open | Testing if service has recovered |
Configuration:
- Failure threshold: 5 failures trigger open state
- Recovery timeout: 30 seconds before half-open
- Expected exceptions: ConnectionError, TimeoutError
Graceful Degradation¶
When services are impaired, CalcBridge:
- Falls back to cached data when the database is slow
- Uses synchronous task execution when Celery is unavailable
- Returns partial results rather than errors when possible
Health Monitoring¶
| Probe | Endpoint | Purpose |
|---|---|---|
| Liveness | /health/live | Process is running |
| Readiness | /health/ready | Dependencies healthy |
| Deep | /health/deep | All components checked |
Feature Flags¶
Runtime toggleable features:
- Enable/disable via API
- Admin-only access
- Persisted state
User Stories¶
US-OR-001: Manage Failed Tasks¶
As a System Administrator I want to review and retry failed background tasks So that no work is silently lost
US-OR-002: Monitor System Health¶
As a System Administrator I want to see comprehensive health status So that I can proactively address issues
Related Documentation¶
- Health API - Health endpoint documentation
- DLQ API - Dead letter queue endpoints
- Metrics API - Prometheus metrics
- Role Guide: System Admin - Admin workflow