
Cloud Reliability & Recovery Engineer
AlphaSense6 days ago
Remote, IndiaSenior / Mid Level
H1B Sponsor
Responsibilities
- Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets.
- Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront.
- Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions.
- Implement chaos engineering practices using AWS Fault Injection Simulator (FIS) to validate resiliency.
- Architect cross-region replication strategies for S3, DynamoDB Global Tables, RDS, and Aurora Global.
- Review containerized workloads using Kubernetes for resilience.
- Administer AWS Backup across all services with policy-based automation.
- Design immutable backup vaults and cross-account/cross-region backup replication pipelines.
- Develop and automate data recovery testing procedures.
- Implement point-in-time recovery (PITR) for databases and storage.
- Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies.
- Author and maintain Terraform/CloudFormation templates for BCP/DR infrastructure.
- Automate DR testing pipelines through CI/CD.
- Write scripts to orchestrate failover, failback, and health-check workflows.
- Build CloudWatch dashboards and alarms for availability and DR-readiness indicators.
- Participate in on-call rotations and lead DR incident response.
- Conduct regular BCP/DR tabletop exercises and full failover simulations.
- Ensure DR controls meet compliance requirements.
- Maintain current and accurate DR documentation.
Requirements
- 5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles.
- 3+ years of hands-on AWS experience in production environments at scale.
- Proven delivery of multi-region DR architectures with defined and tested RTO/RPO targets.
- Expert-level proficiency with core AWS resilience services.
- Strong scripting skills: Python, Bash, or PowerShell for automation.
- Experience with Infrastructure as Code: Terraform and/or AWS CloudFormation.
- Solid understanding of networking fundamentals: VPC, TGW, Direct Connect, VPN, DNS failover.
- Excellent written and verbal communication skills.