Senior Site Reliability Engineer- Remote

3 months ago

Remote, AustraliaSenior / Staff+

H1B Sponsor

Responsibilities

Collaborate with engineering teams to design and implement scalable, secure systems.
Establish and manage service level objectives (SLOs) and service level agreements (SLAs).
Ensure infrastructure components have monitoring and alerting for incident detection.
Enhance incident response processes and conduct post-mortem analysis.
Continuously improve the reliability and performance of ClickHouse services.
Plan and drive Chaos initiatives across engineering teams.
Manage on-call processes for performance and reliability issues.

Bachelor’s or Master’s degree in Computer Science or a related field.
At least 8 years of experience in Site Reliability Engineering or a related field.
Hands-on experience with Go and/or Python.
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud.
Excellent understanding of distributed databases and SQL, particularly ClickHouse.
Experience with container orchestration tools like Kubernetes or Docker Swarm.
Strong experience with automation and configuration management tools like Ansible, Terraform, or Puppet.
Strong problem-solving skills and production debugging capabilities.
Passionate about efficiency, availability, scalability, and data governance.
Ability to thrive in a fast-paced environment and partner with the business.
High level of responsibility, ownership, and accountability.
Excellent communication and interpersonal skills.