GrepJob
Gatik AI

Senior Cloud Infrastructure Engineer

Gatik AI
Apply
about 2 months ago
Mountain View, CA, USASenior
H1B Sponsor

Base Salary

$180k - $240k/yr

Responsibilities

  • Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
  • Implement and optimize Kubernetes-native GPU scheduling to ensure maximum hardware utilization.
  • Drive the 'Everything as Code' philosophy using Terraform, Helm, and cloud-native tools.
  • Deploy Autonomous AI Agents to monitor cluster health and enable automated triage of hardware failures.
  • Build large-scale data pipelines using Apache Airflow, Kafka, and Spark.
  • Implement robust GitOps workflows to automate deployment of infrastructure and model artifacts.
  • Maintain visibility into infrastructure health and model serving performance using monitoring tools.
  • Develop agent-driven workflows to optimize the developer experience.
  • Design and maintain MLFlow and feature store integrations for model tracking.
  • Build automated model lifecycles using Airflow and Kubernetes.
  • Support deployment of models into simulation and production environments.
  • Enable researchers to scale models across multi-node setups using distributed training frameworks.
  • Optimize low-level communication to minimize latency for large-scale training.
  • Partner with researchers to fine-tune performance across multi-node GPU clusters.

Requirements

  • 5+ years of experience in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
  • Deep expertise in Kubernetes, Helm, and container orchestration.
  • Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
  • Practical experience supporting distributed systems frameworks like Ray and PyTorch Distributed.
  • Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.

Tech Stack

Apache AirflowApache KafkaApache SparkBashGitLab CI/CDGrafanaHelmKubernetesMLflowPrometheusPythonPyTorchTerraform

Categories

AI & MLData EngineeringDevOps