Senior Cloud Infrastructure Engineer

about 2 months ago

Mountain View, CA, USASenior

H1B Sponsor

Base Salary

$180k - $240k/yr

Responsibilities

Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
Implement and optimize Kubernetes-native GPU scheduling to ensure maximum hardware utilization.
Drive the 'Everything as Code' philosophy using Terraform, Helm, and cloud-native tools.
Deploy Autonomous AI Agents to monitor cluster health and enable automated triage of hardware failures.
Build large-scale data pipelines using Apache Airflow, Kafka, and Spark.
Implement robust GitOps workflows to automate deployment of infrastructure and model artifacts.
Maintain visibility into infrastructure health and model serving performance using monitoring tools.
Develop agent-driven workflows to optimize the developer experience.
Design and maintain MLFlow and feature store integrations for model tracking.
Build automated model lifecycles using Airflow and Kubernetes.
Support deployment of models into simulation and production environments.
Enable researchers to scale models across multi-node setups using distributed training frameworks.
Optimize low-level communication to minimize latency for large-scale training.
Partner with researchers to fine-tune performance across multi-node GPU clusters.

5+ years of experience in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
Deep expertise in Kubernetes, Helm, and container orchestration.
Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
Practical experience supporting distributed systems frameworks like Ray and PyTorch Distributed.
Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.

Apache AirflowApache KafkaApache SparkBashGitLab CI/CDGrafanaHelmKubernetesMLflowPrometheusPythonPyTorchTerraform