Cloud Site Reliability Engineer
SambaNova Systemsabout 3 hours ago
San Jose, CA, USA
Mid Level / Senior
H1B Sponsor
Responsibilities
- Take ownership of the production inferencing service, ensuring its availability and performance.
- Participate in a shared on-call rotation for 24/7 service support.
- Develop and maintain monitoring and alerting systems using tools like Prometheus and Grafana.
- Identify and eliminate performance bottlenecks and implement auto-scaling policies.
- Manage cloud infrastructure using Infrastructure as Code tools like Terraform.
- Champion CI/CD automation for seamless deployment of model updates.
- Forecast infrastructure needs and manage cloud costs effectively.
- Define and report on Service Level Objectives and Indicators.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 3-5+ years of experience in a Site Reliability Engineer or DevOps role.
- Strong programming skills in Python, Go, or Java.
- Experience with containerization and orchestration technologies like Docker and Kubernetes.
- Deep understanding of monitoring and observability tools.
- Solid experience with Infrastructure as Code practices.
- Familiarity with CI/CD principles and tools.
- Excellent problem-solving skills for complex distributed systems.
Benefits
- Competitive compensation including equity and excellent benefits.
- 95% premium coverage for employee medical insurance.
- 77% premium coverage for dependents.
- Health Savings Account (HSA) with employer contribution.
- Access to well-being benefits including gym memberships and counseling services.
Tech Stack
AWSAzureDatadogDockerGitHub ActionsGoGoogle Cloud PlatformGrafanaJavaJenkinsKubernetesPrometheusPythonRedisTerraform
Categories
AI & MLDevOps