about 3 hours ago
Responsibilities
- Design, build, and operate large-scale cloud infrastructure and production services.
- Participate in an on-call rotation supporting highly available customer-facing systems.
- Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
- Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
- Partner with engineering teams to improve service availability, scalability, performance, and resilience.
- Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.
- Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
- Eliminate operational toil through automation, tooling, and platform engineering.
- Improve deployment safety and operational workflows through CI/CD and GitOps practices.
- Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
- Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.
- Lead complex reliability initiatives spanning multiple engineering teams.
- Guide engineers in adopting operational best practices and reliability engineering principles.
- Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
- Influence architecture and operational decisions through data-driven recommendations and engineering expertise.
- Drive projects from conception through production rollout and long-term operational ownership.
- Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.
Requirements
- Strong experience operating large-scale production services in AWS and/or GCP.
- Deep expertise with Kubernetes in production environments.
- Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
- Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
- Strong software engineering skills in Golang and/or Python.
- Experience building automation and internal engineering platforms.
- Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, or similar technologies.
- Strong understanding of cloud networking fundamentals including DNS, load balancing, ingress, TLS, service networking, and traffic management.
- Experience with observability platforms, monitoring strategies, and production telemetry.
- Experience with or strong interest in AI-assisted engineering and operational automation.
- Strong expertise operating customer-facing production systems.
- Experience leading incident response and driving operational improvements.
- Deep understanding of reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning.
- Strong understanding of CI/CD pipelines, deployment strategies, and automation-first operational practices.
- Proven ability to balance reliability, scalability, security, and engineering velocity.