about 6 hours ago
Responsibilities
- Shape reliability and operational excellence engineering practices to maintain high system uptime.
- Drive performance testing, tuning, and capacity planning to ensure effective system scaling.
- Identify and automate systemic manual processes to improve efficiency.
- Debug and resolve reliability and performance issues across services and codebases.
- Embed security and compliance into engineering platforms and delivery pipelines.
- Design and implement observability solutions for actionable insights into system health.
- Participate in incident response and postmortem reviews to drive systemic improvements.
- Help teams optimize cloud usage in line with business objectives and budget constraints.
Requirements
- Strong background in software engineering with experience in SRE or platform practices.
- Experience owning or operating systems in production, including incident response.
- Ability to take ownership of complex systems and improve them over time.
- Proficient in navigating and understanding unfamiliar codebases.
- Experience debugging complex distributed systems across service boundaries.
- Strong communication skills for effective collaboration with technical and non-technical stakeholders.
- Passion for continuous improvement and staying current with emerging trends.
- Solid experience with Infrastructure as Code tools like Terraform.
- Experience with containerized workloads on cloud-native platforms.
- Familiarity with AWS Well-Architected Framework or equivalent standards.
- Experience designing observability strategies for metrics, logs, and traces.
- Strong programming skills in languages like Go or Python.