GrepJob
Reddit

Staff Machine Learning Engineer, GenAI Platform

Reddit
Apply
2 days ago
Remote, United States
Staff+
H1B Sponsor

Base Salary

$253k - $355k/yr

Responsibilities

  • Propose, design, and lead the architecture of the next-generation LLM platform.
  • Architect highly fault-tolerant training infrastructure for multi-week distributed workloads.
  • Design and implement robust pipelines for LLM fine-tuning.
  • Build scalable systems for automated model evaluation and regression detection.
  • Extend distributed data platforms to handle multimodal datasets efficiently.
  • Mentor senior engineers and define technical roadmaps.

Requirements

  • 10+ years of experience in production software development or distributed data systems.
  • Proven track record in designing and operating large-scale ML systems.
  • Hands-on experience with fault-tolerant, petabyte-scale distributed systems.
  • Deep understanding of ML orchestration and fine-tuning pipelines.
  • Experience with CUDA environments and GPU virtualization/containerization.
  • Proficiency in Kubernetes, Docker, and production-quality code in Python and/or Go.
  • Strong organizational and communication skills.

Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs.
  • 401k with Employer Match.
  • Global Benefit programs that fit your lifestyle.
  • Family Planning Support.
  • Gender-Affirming Care.
  • Mental Health & Coaching Benefits.
  • Flexible Vacation & Paid Volunteer Time Off.
  • Generous Paid Parental Leave.

Tech Stack

DockerGoKubernetesMLflowPython

Categories

AI & MLData EngineeringDevOps