Senior Site Reliability Engineer, Platform Infrastructure (Foundations)

about 2 months ago

Palo Alto, CA, USA or San Francisco, CA, USASenior / Mid Level

H1B Sponsor

Responsibilities

Design, build, and scale services for orchestrating Ray clusters across cloud and on-prem environments.
Optimize control plane components for large-scale, distributed AI/ML workloads.
Build intelligent scheduling and resource management systems for heterogeneous compute clusters.
Develop features to enhance reliability, performance, scalability, and observability of Ray workloads.
Support and optimize accelerator integration, including GPUs and TPUs.
Manage container images and dependency resolution for distributed workloads.
Participate in code reviews and design discussions.
Provide on-call support and troubleshoot infrastructure issues.

Bachelor's degree in Computer Science, Engineering, or equivalent experience.
3+ years of experience writing high-quality production code.
Hands-on experience in building and maintaining scalable distributed systems.
Expertise in cloud-native technologies and Kubernetes-based deployments.
Deep understanding of networking, security, and authentication in cloud environments.
Familiarity with observability stacks like Prometheus and Grafana.
Proficiency in Go and Python.
Knowledge of low-level operating system foundations.