Senior Site Reliability Engineer, AI Research

6 months ago

Remote, AustraliaSenior

H1B Sponsor

Responsibilities

Support and evolve the reliability of platforms used by the AI Research team.
Ensure production services meet expectations for availability, latency, and operational readiness.
Design infrastructure and operational patterns that prioritize iteration speed.
Work closely with researchers and engineers in a cross-functional setting.
Participate directly in team planning and execution.
Help researchers self-serve infrastructure safely and effectively.
Build and maintain Kubernetes-based services on GCP using infrastructure-as-code.
Own and improve CI/CD pipelines for services primarily written in Go.
Design and operate observability systems using tools such as Datadog.
Participate in an on-call rotation, responding to incidents.

Strong experience operating cloud-first infrastructure.
Hands-on experience running production services on Kubernetes.
Proficiency with infrastructure-as-code (Terraform) and CI/CD systems.
Experience supporting production services written in Go.
Solid grounding in service reliability, incident response, and operational best practices.
Comfort working in environments with ambiguity.