about 6 hours ago
Palo Alto, CA, USA or San Francisco, CA, USASenior / Mid Level
H1B Sponsor
Responsibilities
- Design, build, and scale services for orchestrating Ray clusters across cloud and on-prem environments.
- Optimize control plane components for large-scale, distributed AI/ML workloads.
- Build intelligent scheduling and resource management systems for heterogeneous compute clusters.
- Develop features to enhance reliability, performance, scalability, and observability of Ray workloads.
- Support and optimize accelerator integration, including GPUs and TPUs.
- Manage container images and dependency resolution for distributed workloads.
- Participate in code reviews and design discussions.
- Provide on-call support and troubleshoot infrastructure issues.
Requirements
- Bachelor's degree in Computer Science, Engineering, or equivalent experience.
- 3+ years of experience writing high-quality production code.
- Hands-on experience in building and maintaining scalable distributed systems.
- Expertise in cloud-native technologies and Kubernetes-based deployments.
- Deep understanding of networking, security, and authentication in cloud environments.
- Familiarity with observability stacks like Prometheus and Grafana.
- Proficiency in Go and Python.
- Knowledge of low-level operating system foundations.
