about 2 months ago
Base Salary
$180k - $240k/yr
Responsibilities
- Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
- Implement and optimize Kubernetes-native GPU scheduling to ensure maximum hardware utilization.
- Drive the 'Everything as Code' philosophy using Terraform, Helm, and cloud-native tools.
- Deploy Autonomous AI Agents to monitor cluster health and enable automated triage of hardware failures.
- Build large-scale data pipelines using Apache Airflow, Kafka, and Spark.
- Implement robust GitOps workflows to automate deployment of infrastructure and model artifacts.
- Maintain visibility into infrastructure health and model serving performance using monitoring tools.
- Develop agent-driven workflows to optimize the developer experience.
- Design and maintain MLFlow and feature store integrations for model tracking.
- Build automated model lifecycles using Airflow and Kubernetes.
- Support deployment of models into simulation and production environments.
- Enable researchers to scale models across multi-node setups using distributed training frameworks.
- Optimize low-level communication to minimize latency for large-scale training.
- Partner with researchers to fine-tune performance across multi-node GPU clusters.
Requirements
- 5+ years of experience in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
- Deep expertise in Kubernetes, Helm, and container orchestration.
- Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
- Practical experience supporting distributed systems frameworks like Ray and PyTorch Distributed.
- Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.
