GrepJob
Raft Company Website

Principal MLOps Engineer

Raft Company Website
Apply
14 days ago
Remote, United States +3 moreStaff+

Base Salary

$150k - $200k/yr

Responsibilities

  • Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems.
  • Mature Raft’s internal ML platform and model lifecycle capabilities, including model packaging and monitoring.
  • Deploy and manage machine learning workloads on Kubernetes, including GPU-enabled clusters.
  • Support model serving and inference infrastructure for various ML use cases.
  • Build and maintain CI/CD workflows for ML services and platform components.
  • Collaborate with ML engineers and product teams to transition models from experimentation to deployment.
  • Enhance observability, reliability, security, and maintainability across ML infrastructure.
  • Evaluate and standardize runtime patterns and deployment architectures for production ML workloads.
  • Contribute to infrastructure decisions across edge, on-prem, and cloud-hosted environments.
  • Support compliance-driven deployment practices in defense environments.
  • Engage with customers in the Department of War.

Requirements

  • 7+ years of relevant experience in software engineering, platform engineering, DevOps, or MLOps.
  • 5+ years of experience with Docker and Kubernetes in production environments.
  • 5+ years of experience with enterprise cloud infrastructure in AWS, Azure, or similar.
  • Strong experience provisioning and troubleshooting Kubernetes clusters in production.
  • Experience building and maintaining machine learning platforms or pipelines.
  • Practical experience deploying machine learning workloads on Kubernetes.
  • Experience managing GPU-enabled clusters or workloads.
  • Strong understanding of Helm and Kubernetes deployment patterns.
  • Strong scripting or programming skills, preferably in Python.
  • Experience with modern software engineering practices including Git and CI/CD.
  • Strong troubleshooting, systems thinking, and communication skills.
  • Ability to work independently and collaboratively in a fast-paced environment.
  • Ability to obtain and maintain a Top Secret clearance.
  • Ability to obtain Security+ certification within the first 90 days.

Benefits

  • Highly competitive salary.
  • Fully covered healthcare, dental, and vision coverage.
  • 401(k) with company match.
  • Take as you need PTO + 11 paid holidays.
  • Education and training benefits.
  • Annual budget for tech/gadgets needs.
  • Monthly box of snacks.
  • Remote, hybrid, and flexible work options.
  • Team off-site events in fun locations.
  • Generous referral bonuses.

Tech Stack

AWSAzureDockerGitIstioKubernetesPythonTerraform

Categories

AI & MLData EngineeringDevOps