OpenAI

Software Engineer, Platform Systems

OpenAI

Apply
about 1 month ago
San Francisco, CA, USA
Mid Level / Senior

Base Salary

$310k - $460k/yr

Responsibilities

  • Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs.
  • Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior.
  • Improve observability, reliability, and performance across OpenAI’s training platform.
  • Debug and resolve issues in complex, high-throughput distributed systems.
  • Collaborate with systems, infrastructure, and research teams to evolve platform capabilities.
  • Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads.

Requirements

  • Care deeply about performance, stability, and observability in distributed systems.
  • Enjoy finding and fixing issues in large-scale systems and automating operational workflows.
  • Have experience writing low-level software where system details matter.
  • Understand hardware, operating systems, networking, concurrency, and distributed systems.
  • Have a background in high-performance computing or low-level systems engineering.
  • Are excited to work on critical infrastructure that powers frontier AI research.

Categories

AI & MLBackendData EngineeringDevOps