GrepJob
Gatik AI

Senior AI Infrastructure Engineer

Gatik AI
Apply
about 2 months ago
Mountain View, CA, USASenior
H1B Sponsor

Base Salary

$180k - $240k/yr

Responsibilities

  • Design and build high-performance AI infrastructure for autonomous driving models.
  • Enable distributed training of complex models across multi-node setups.
  • Optimize multi-GPU setups for efficient model and data parallelism.
  • Implement intelligent resource scheduling for hardware utilization.
  • Deploy and scale optimized model artifacts for inference performance.
  • Architect self-healing AI infrastructure for automated hardware monitoring.
  • Develop agent-driven automation for infrastructure and data tasks.
  • Automate the end-to-end model lifecycle using ML infrastructure tools.
  • Collaborate with data teams to scale ETL pipelines for dataset management.
  • Define and track key ML system metrics for performance monitoring.

Requirements

  • 5+ years of experience in ML infrastructure, MLOps, or DevOps.
  • Deep understanding of multi-GPU training strategies and high-performance networking.
  • Mastery of Kubernetes, Terraform, and Helm for infrastructure automation.
  • Experience with AI agent frameworks for infrastructure automation.
  • Expertise in MLFlow, Argo Workflows, and Kubernetes.
  • Strong experience with Docker and containerization technologies.
  • Proficiency in Apache Airflow, Kafka, Spark, and GitOps automation.
  • Core programming skills in Python and Bash; experience with Go or Rust is a plus.

Tech Stack

Apache AirflowApache KafkaApache SparkBashDockerGoGrafanaHelmKubernetesMLflowPrometheusPythonPyTorchRustTerraform

Categories

AI & MLData EngineeringDevOps