GrepJob
SambaNova Systems

Cloud Site Reliability Engineer

SambaNova Systems
Apply
about 3 hours ago
San Jose, CA, USA
Mid Level / Senior
H1B Sponsor

Responsibilities

  • Take ownership of the production inferencing service, ensuring its availability and performance.
  • Participate in a shared on-call rotation for 24/7 service support.
  • Develop and maintain monitoring and alerting systems using tools like Prometheus and Grafana.
  • Identify and eliminate performance bottlenecks and implement auto-scaling policies.
  • Manage cloud infrastructure using Infrastructure as Code tools like Terraform.
  • Champion CI/CD automation for seamless deployment of model updates.
  • Forecast infrastructure needs and manage cloud costs effectively.
  • Define and report on Service Level Objectives and Indicators.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 3-5+ years of experience in a Site Reliability Engineer or DevOps role.
  • Strong programming skills in Python, Go, or Java.
  • Experience with containerization and orchestration technologies like Docker and Kubernetes.
  • Deep understanding of monitoring and observability tools.
  • Solid experience with Infrastructure as Code practices.
  • Familiarity with CI/CD principles and tools.
  • Excellent problem-solving skills for complex distributed systems.

Benefits

  • Competitive compensation including equity and excellent benefits.
  • 95% premium coverage for employee medical insurance.
  • 77% premium coverage for dependents.
  • Health Savings Account (HSA) with employer contribution.
  • Access to well-being benefits including gym memberships and counseling services.

Tech Stack

AWSAzureDatadogDockerGitHub ActionsGoGoogle Cloud PlatformGrafanaJavaJenkinsKubernetesPrometheusPythonRedisTerraform

Categories

AI & MLDevOps