about 3 hours ago
Responsibilities
- Lead the improvement of the whole lifecycle of services from inception to refinement.
- Drive sustainable scaling of systems through automation.
- Establish low noise incident response rotations and conduct blameless postmortems.
- Write and review code, develop documentation, and debug complex problems.
- Collaborate with software engineers to maintain and optimize SLOs.
- Participate in a 24x1 on-call rotation.
Requirements
- Bachelor's degree in Computer Science or a related technical field.
- Proficient in at least one modern programming language, preferably Golang.
- Systematic problem-solving methods and effective communication skills.
- 10+ years of industry experience in designing and supporting large scale systems (preferred).
- Experience with modern observability tools and production monitoring practices (preferred).
- Experience with capacity and load testing of distributed applications (preferred).
- Experience with containers and orchestration systems like Kubernetes (preferred).
- Experience in managing scalable and fault tolerant Linux infrastructure (preferred).
- Hands-on experience with public cloud providers (AWS, Azure, or GCP) (preferred).