about 6 hours ago
Responsibilities
- Design and implement comprehensive monitoring, logging, and tracing solutions.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Lead incident management and response during high-impact incidents.
- Architect and improve automation to eliminate operational toil.
- Optimize performance of large-scale cloud deployments, focusing on Kubernetes.
- Debug and harden distributed systems to enhance robustness.
- Provide guidance on reliability, scalability, and operational integrity.
- Educate and mentor the engineering team on reliability best practices.
- Write high-quality, well-tested code in Python or Go.
Requirements
- 8-10 years of experience in Site Reliability Engineering or similar roles.
- Strong programming skills in Python or Go.
- Deep understanding of distributed systems and service-oriented architecture.
- Experience with container orchestration platforms, specifically Kubernetes.
- Proven track record in designing and maintaining monitoring solutions.
- Strong incident management skills with experience in complex systems.
- Familiarity with infrastructure as code tools like Terraform or Pulumi.
- Excellent communication skills for explaining complex concepts.
- Strong interpersonal skills for mentoring engineers.
Benefits
- Competitive Salary & Equity.
- 401(k) Program with a 4% match.
- Health, Dental, Vision and Life Insurance.
- Short Term and Long Term Disability.
- Paid Parental, Medical, Caregiver Leave.
- Commuter Benefits.
- Monthly Wellness Stipend.
- Autonomous Work Environment.
- In Office Set-Up Reimbursement.
- Flexible Time Off (FTO) + Holidays.
- Quarterly Team Gatherings.
- In Office Amenities.
