GrepJob
Anyscale

Senior Site Reliability Engineer, Platform Infrastructure (Foundations)

Anyscale
Apply
about 6 hours ago
Palo Alto, CA, USA or San Francisco, CA, USASenior / Mid Level
H1B Sponsor

Responsibilities

  • Design, build, and scale services for orchestrating Ray clusters across cloud and on-prem environments.
  • Optimize control plane components for large-scale, distributed AI/ML workloads.
  • Build intelligent scheduling and resource management systems for heterogeneous compute clusters.
  • Develop features to enhance reliability, performance, scalability, and observability of Ray workloads.
  • Support and optimize accelerator integration, including GPUs and TPUs.
  • Manage container images and dependency resolution for distributed workloads.
  • Participate in code reviews and design discussions.
  • Provide on-call support and troubleshoot infrastructure issues.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent experience.
  • 3+ years of experience writing high-quality production code.
  • Hands-on experience in building and maintaining scalable distributed systems.
  • Expertise in cloud-native technologies and Kubernetes-based deployments.
  • Deep understanding of networking, security, and authentication in cloud environments.
  • Familiarity with observability stacks like Prometheus and Grafana.
  • Proficiency in Go and Python.
  • Knowledge of low-level operating system foundations.

Categories

AI & MLData EngineeringDevOps