GrepJob
Stack AV

Senior Compute Platform Engineer

Stack AV
Apply
about 3 hours ago
Remote, Worldwide or Pittsburgh, PA, USASenior
H1B Sponsor

Responsibilities

  • Design and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters.
  • Build and maintain compute platform abstractions.
  • Optimize utilization of compute resources.
  • Develop and improve multi-tenant scheduling strategies.
  • Improve reliability and fault tolerance of large-scale distributed jobs and platform components.
  • Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
  • Contribute to platform tooling, automation, and CI/CD workflows.

Requirements

  • 7+ years of experience building and operating distributed systems or infrastructure platforms.
  • Strong experience with Kubernetes and container orchestration in production grade environments.
  • Proficiency developing in Golang and Python.
  • Experience designing and operating large-scale batch compute systems.
  • Strong debugging and problem-solving skills in complex distributed systems.
  • Ability to collaborate across teams and communicate technical concepts clearly.
  • Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm.

Categories

AI & MLData EngineeringDevOps