GrepJob
Cohere

Site Reliability Engineer, Inference Infrastructure

Cohere
Apply
4 months ago
Toronto, Canada +4 moreSenior
H1B Sponsor

Responsibilities

  • Build self-service systems for managing, deploying, and operating services.
  • Develop custom Kubernetes operators for language model deployments.
  • Automate environment observability and resilience.
  • Participate in an on-call rotation to ensure defined SLOs are met.
  • Build strong relationships with internal developers to influence the Infrastructure team’s roadmap.
  • Engage in knowledge sharing and active review processes to develop the team.

Requirements

  • 5+ years of engineering experience running production infrastructure at a large scale.
  • Experience designing highly available distributed systems with Kubernetes and GPU workloads.
  • Proficient in Kubernetes development and production support.
  • Familiarity with GCP, Azure, AWS, OCI, and multi-cloud on-prem/hybrid serving.
  • Experience in complex Linux-based computing environments.
  • Strong collaboration and troubleshooting skills for mission-critical systems.
  • Adaptability to solve evolving technical challenges.
  • Familiarity with computational characteristics of accelerators like GPUs and TPUs.
  • Strong understanding of distributed systems.
  • Experience in Golang, C++, or other high-performance server languages.

Benefits

  • An open and inclusive culture and work environment.
  • Weekly lunch stipend, in-office lunches, and snacks.
  • Full health and dental benefits, including a mental health budget.
  • 100% Parental Leave top-up for up to 6 months.
  • Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
  • Remote-flexible work options with offices in major cities.
  • 6 weeks of vacation (30 working days).

Tech Stack

AWSAzureC++GoGoogle Cloud PlatformKubernetesLinux