
Staff Engineer
DigitalOcean2 days ago
Base Salary
$191k - $239k/yr
Responsibilities
- Design and optimize hierarchical, high-throughput scheduling architectures for massive Kubernetes clusters.
- Implement fractional GPU allocation to maximize GPU utilization in multi-tenant environments.
- Deploy topology-aware scheduling to minimize communication latency for multi-GPU operations.
- Tune etcd and optimize admission webhooks to enhance cluster performance.
- Design secure environments for executing untrusted LLM-generated code.
- Orchestrate efficient model weight distribution and implement fault recovery capabilities.
- Implement robust gang scheduling for tightly-coupled, multi-node training jobs.
- Manage disaggregated AI inference pipelines with multilevel autoscaling.
Requirements
- Deep technical knowledge of Kubernetes core components and API performance optimization.
- Proven experience with AI-specific Kubernetes schedulers and orchestrators.
- Understanding of GPU architectures and how hardware topology impacts performance.
- Experience balancing performance and cost using resource management strategies.
- Familiarity with container runtime internals and security contexts.
- Strong understanding of modern LLM serving architectures.
- Experience tracking infrastructure and inference metrics.
Benefits
- Competitive array of benefits including an Employee Assistance Program and flexible time off policy.
- Reimbursement for relevant conferences, training, and education.
- Access to LinkedIn Learning's 10,000+ courses for continued growth.
- Potential for bonuses and equity compensation based on performance.