Cloud Site Reliability Engineer

3 months ago

San Jose, CA, USAMid Level / Senior

H1B Sponsor

Responsibilities

Take ownership of the production inferencing service, ensuring its availability and performance.
Participate in a shared on-call rotation for 24/7 service support.
Develop and maintain monitoring and alerting systems using tools like Prometheus and Grafana.
Identify and eliminate performance bottlenecks and implement auto-scaling policies.
Manage cloud infrastructure using Infrastructure as Code tools like Terraform.
Champion CI/CD automation for seamless deployment of model updates.
Forecast infrastructure needs and manage cloud costs effectively.
Define and report on Service Level Objectives and Indicators.

Bachelor's degree in Computer Science, Engineering, or a related field.
3-5+ years of experience in a Site Reliability Engineer or DevOps role.
Strong programming skills in Python, Go, or Java.
Experience with containerization and orchestration technologies like Docker and Kubernetes.
Deep understanding of monitoring and observability tools.
Solid experience with Infrastructure as Code practices.
Familiarity with CI/CD principles and tools.
Excellent problem-solving skills for complex distributed systems.

Competitive compensation including equity and excellent benefits.
95% premium coverage for employee medical insurance.
77% premium coverage for dependents.
Health Savings Account (HSA) with employer contribution.
Access to well-being benefits including gym memberships and counseling services.