about 2 months ago
Base Salary
$180k - $240k/yr
Responsibilities
- Design and build high-performance AI infrastructure for autonomous driving models.
- Enable distributed training of complex models across multi-node setups.
- Optimize multi-GPU setups for efficient model and data parallelism.
- Implement intelligent resource scheduling for hardware utilization.
- Deploy and scale optimized model artifacts for inference performance.
- Architect self-healing AI infrastructure for automated hardware monitoring.
- Develop agent-driven automation for infrastructure and data tasks.
- Automate the end-to-end model lifecycle using ML infrastructure tools.
- Collaborate with data teams to scale ETL pipelines for dataset management.
- Define and track key ML system metrics for performance monitoring.
Requirements
- 5+ years of experience in ML infrastructure, MLOps, or DevOps.
- Deep understanding of multi-GPU training strategies and high-performance networking.
- Mastery of Kubernetes, Terraform, and Helm for infrastructure automation.
- Experience with AI agent frameworks for infrastructure automation.
- Expertise in MLFlow, Argo Workflows, and Kubernetes.
- Strong experience with Docker and containerization technologies.
- Proficiency in Apache Airflow, Kafka, Spark, and GitOps automation.
- Core programming skills in Python and Bash; experience with Go or Rust is a plus.
