Staff Engineer

2 days ago

H1B Sponsor

Base Salary

$191k - $239k/yr

Responsibilities

Design and optimize hierarchical, high-throughput scheduling architectures for massive Kubernetes clusters.
Implement fractional GPU allocation to maximize GPU utilization in multi-tenant environments.
Deploy topology-aware scheduling to minimize communication latency for multi-GPU operations.
Tune etcd and optimize admission webhooks to enhance cluster performance.
Design secure environments for executing untrusted LLM-generated code.
Orchestrate efficient model weight distribution and implement fault recovery capabilities.
Implement robust gang scheduling for tightly-coupled, multi-node training jobs.
Manage disaggregated AI inference pipelines with multilevel autoscaling.

Deep technical knowledge of Kubernetes core components and API performance optimization.
Proven experience with AI-specific Kubernetes schedulers and orchestrators.
Understanding of GPU architectures and how hardware topology impacts performance.
Experience balancing performance and cost using resource management strategies.
Familiarity with container runtime internals and security contexts.
Strong understanding of modern LLM serving architectures.
Experience tracking infrastructure and inference metrics.

Competitive array of benefits including an Employee Assistance Program and flexible time off policy.
Reimbursement for relevant conferences, training, and education.
Access to LinkedIn Learning's 10,000+ courses for continued growth.
Potential for bonuses and equity compensation based on performance.

KubernetesPyTorch