about 4 hours ago
Responsibilities
- Design, implement, and maintain observability solutions for datacentre infrastructure.
- Develop and maintain operational and reliability components of a large-scale Observability and Telemetry collection platform.
- Participate in the entire lifecycle of services from inception to deployment and refinement.
- Analyze and optimize the performance of datacentre systems and applications.
- Implement SRE best practices to improve reliability and scalability.
- Develop automation scripts for infrastructure provisioning and management.
- Conduct root cause analysis and post-mortem reviews.
- Collaborate with engineering teams to meet observability and reliability requirements.
- Ensure compliance with security policies and industry standards.
- Provide support for observability and reliability-related issues.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- 8–12 years of progressive software engineering experience with a focus on distributed systems.
- Strong proficiency in Go or Python with a deep understanding of networked systems.
- Expert-level knowledge of Kubernetes internals and containerization ecosystems.
- Proven experience with load balancing and service mesh at scale.
- A strong ownership mindset with a track record of maintaining high-availability systems.