about 5 hours ago
Responsibilities
- Build and operate foundational, security-critical services with a focus on availability and fault tolerance.
- Automate infrastructure to reduce operational toil and improve system reliability.
- Design and implement systems using SRE best practices.
- Define and refine SLIs, SLOs, and error budgets.
- Enhance observability, alerting, and incident response.
- Participate in on-call rotations with a focus on sustainable operations.
- Conduct quantitative analysis to understand system behavior and capacity constraints.
- Identify systemic risks and drive long-term solutions.
- Collaborate with product, platform, and security engineers.
- Mentor and pair with other engineers to improve operational maturity.
Requirements
- Proficient in writing production-quality code (e.g., Python, Go).
- Experience with distributed, cloud-native systems and understanding of failure modes.
- Familiarity with containerized workloads and platforms (e.g., Kubernetes).
- Comfortable with on-call rotations and diagnosing production issues.
- Experience designing and operating observability systems.
- Knowledge of SRE concepts such as SLIs, SLOs, and error budgets.
- Hands-on experience with infrastructure as code (e.g., Terraform).
- Experience with capacity planning and performance analysis.
- Ability to contribute to post-incident reviews.
- Interest in experimenting with AI tools and workflows.