Software Engineer, Platform Infrastructure (Foundations)

about 2 months ago

Palo Alto, CA, USA or San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Responsibilities

Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments.
Optimize control plane components for large-scale, distributed AI/ML workloads.
Build intelligent scheduling and resource management systems for heterogeneous compute clusters.
Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads.
Support and optimize accelerator integration (e.g., GPUs, TPUs).
Handle container image management and dependency resolution for distributed workloads.
Participate in code reviews, design and architecture discussions.
Provide on-call support and troubleshoot infrastructure issues.
Collaborate with distributed systems and machine learning experts.

Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
3+ years of experience writing high-quality production code.
Hands-on experience in building and maintaining highly available, scalable, and performant distributed systems.
Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments.
Deep understanding of networking, security, and authentication mechanisms in cloud environments.
Familiarity with observability stacks (Prometheus, Grafana, etc.).
Proficiency in Go and Python.
Knowledge of low-level operating system foundations (Linux kernel, file systems, containers).