about 20 hours ago
New York, NY, USA or Remote, United States
Staff+
Base Salary
$180k - $230k/yr
Responsibilities
- Act as the technical leader for reliability across one or more domains.
- Drive reliability strategy by defining SLOs, SLIs, and reliability KPIs.
- Lead incident response maturity and improve incident command practices.
- Architect and implement automation to reduce toil and risk.
- Advance GitOps delivery practices using Argo CD.
- Scale infrastructure management with Crossplane and Terraform.
- Lead operational readiness and reliability reviews for new features.
- Improve performance and cost efficiency through capacity planning.
- Champion infrastructure security best practices for PHI environments.
- Mentor Staff and Senior engineers to raise reliability standards.
Requirements
- 8+ years of experience in SRE, platform engineering, or related roles.
- Demonstrated principal-level impact in leading cross-team initiatives.
- Expertise in Kubernetes operations and troubleshooting.
- Strong GitOps experience with Argo CD and Argo Workflows.
- Experience with Crossplane and Terraform for infrastructure orchestration.
- Deep AWS experience and understanding of reliability in cloud systems.
- Proficiency in Python for automation and tooling.
- Strong incident management and on-call leadership experience.
- Excellent communication skills for translating technical risks.
Benefits
- Be part of a mission-driven company transforming the healthcare industry.
- Flexible, remote-friendly work environment.
- Employee-driven programs for personal and professional development.
- Join a diverse and purpose-driven community at Arcadia.
Tech Stack
Apache CassandraApache SparkArgo CDAWSKubernetesPythonTerraform
Categories
DevOpsSecurity