Business continuity
Beakr maintains business continuity through redundant infrastructure, automated recovery, and defined incident response procedures.
Availability
Beakr targets 99.9% uptime for the production platform. Availability is supported by:
- Multi-AZ deployment across two availability zones in AWS us-east-1.
- Auto-scaling compute (Amazon ECS) to handle load spikes.
- Multi-AZ database (RDS) with automatic failover.
- Multi-AZ cache (ElastiCache Redis) with replication.
- Circuit breaker with automatic rollback on failed deployments.
- Health checks on all services with automatic replacement of unhealthy instances.
Disaster recovery
| Scenario | RPO | RTO | Mechanism |
|---|---|---|---|
| Single AZ failure | 0 (synchronous replication) | < 5 minutes | Multi-AZ automatic failover (RDS, ElastiCache) |
| Database corruption | < 5 minutes | 30 -- 60 minutes | Point-in-time recovery from continuous transaction logs |
| Accidental data deletion | 0 | 1 -- 2 hours | RDS snapshot restore or PITR. S3 versioning (30-day retention). |
| Full region failure | Up to 24 hours | 4 -- 8 hours | Snapshot restore to alternate region. Terraform-based infrastructure rebuild. |
Backup schedule
- Database. Automated daily snapshots retained for 30 days (production). Continuous transaction log capture for point-in-time recovery.
- File storage. S3 versioning enabled with 30-day noncurrent version retention.
- Backup encryption. All backups encrypted with AES-256.
- Backup testing. Quarterly restore testing performed and documented.
Incident response
Beakr follows a structured incident response process for security events and service disruptions.
Severity classification
| Severity | Definition | Response time | Notification |
|---|---|---|---|
| P1 -- Critical | Service outage, data breach, or active security incident affecting multiple customers. | Within 1 hour | Affected customers notified within 4 hours. Status page updated. |
| P2 -- High | Significant degradation, security vulnerability with active exploit risk, or single-customer data incident. | Within 4 hours | Affected customers notified within 24 hours. |
| P3 -- Medium | Minor degradation, potential vulnerability without active exploit, or non-critical system issue. | Within 1 business day | Included in next scheduled communication. |
| P4 -- Low | Cosmetic issues, informational security findings, or minor configuration drift. | Within 5 business days | Resolved in normal operations. |
Incident response process
- Detection. Automated alerting via GuardDuty, CloudTrail anomaly detection, CloudWatch alarms, and WAF. Alerts delivered to on-call via Slack and SNS.
- Triage. On-call engineer assesses severity and scope. Incident is classified per the severity table above.
- Containment. Immediate actions to limit impact -- isolate affected systems, revoke compromised credentials, block malicious IPs.
- Resolution. Root cause identified and remediated. Service restored.
- Notification. Affected customers notified per severity timeline. BAA customers receive breach notifications per contractual terms.
- Post-incident review. Root cause analysis documented. Preventive measures implemented. Lessons learned shared with the team.
Change management
All changes to the Beakr platform follow a controlled process:
- Code changes require pull request review before merge.
- Infrastructure changes are defined in Terraform, reviewed in pull requests, and applied through CI/CD.
- Database migrations run as separate tasks before application deployment.
- Deployments use circuit breaker with automatic rollback on failure.
- Production deployments are logged in CloudTrail.
Status and incident history
For real-time platform status and incident history, visit our Trust Center.