GrepJob
Baseten

Software Engineer - Training Infrastructure

Baseten
Apply
8 months ago
San Francisco, CA, USA or New York, NY, USAMid Level / Senior

Base Salary

$165k - $330k/yr

Responsibilities

  • Design and architect scalable infrastructure systems for the ML training platform.
  • Partner closely with developers and research engineers to translate training requirements into technical solutions.
  • Design and architect a global training scheduler.
  • Design and architect reinforcement learning systems and continuous learning pipelines.
  • Drive long-term improvements to enhance system reliability and development velocity.
  • Collaborate with SRE and Capacity teams to optimize training infrastructure.
  • Make critical architectural decisions balancing performance and reliability.
  • Lead technical discussions and mentor junior engineers on best practices.
  • Contribute to the long-term technical strategy and infrastructure roadmap.

Requirements

  • Bachelor’s degree in Computer Science or related field.
  • Proficiency in Go, with Python experience preferred.
  • Deep expertise with Kubernetes in production environments.
  • Extensive experience with major cloud providers like AWS and GCP.
  • Advanced understanding of distributed systems concepts and performance tuning.
  • Proven experience designing observability systems.
  • Experience with ML/AI workloads and MLOps platforms is highly valued.

Benefits

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employees and dependents.
  • Flexible PTO policy including a company-wide Winter Break.
  • Paid parental leave.
  • Fertility and family-building stipend through Carrot.
  • Company-facilitated 401(k).
  • Exposure to a variety of ML startups for learning and networking opportunities.

Tech Stack

Apache AirflowAWSDigitalOceanGoGoogle Cloud PlatformKubernetesPythonPyTorch

Categories

AI & MLData EngineeringDevOps