GrepJob
DigitalOcean

Staff Engineer

DigitalOcean
Apply
2 days ago

Base Salary

$191k - $239k/yr

Responsibilities

  • Design and optimize hierarchical, high-throughput scheduling architectures for massive Kubernetes clusters.
  • Implement fractional GPU allocation to maximize GPU utilization in multi-tenant environments.
  • Deploy topology-aware scheduling to minimize communication latency for multi-GPU operations.
  • Tune etcd and optimize admission webhooks to enhance cluster performance.
  • Design secure environments for executing untrusted LLM-generated code.
  • Orchestrate efficient model weight distribution and implement fault recovery capabilities.
  • Implement robust gang scheduling for tightly-coupled, multi-node training jobs.
  • Manage disaggregated AI inference pipelines with multilevel autoscaling.

Requirements

  • Deep technical knowledge of Kubernetes core components and API performance optimization.
  • Proven experience with AI-specific Kubernetes schedulers and orchestrators.
  • Understanding of GPU architectures and how hardware topology impacts performance.
  • Experience balancing performance and cost using resource management strategies.
  • Familiarity with container runtime internals and security contexts.
  • Strong understanding of modern LLM serving architectures.
  • Experience tracking infrastructure and inference metrics.

Benefits

  • Competitive array of benefits including an Employee Assistance Program and flexible time off policy.
  • Reimbursement for relevant conferences, training, and education.
  • Access to LinkedIn Learning's 10,000+ courses for continued growth.
  • Potential for bonuses and equity compensation based on performance.

Tech Stack

KubernetesPyTorch

Categories