Software Engineer, Platform Systems
OpenAI
about 1 month ago
London, United Kingdom
Mid Level / Senior
Responsibilities
- Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs.
- Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior.
- Improve observability, reliability, and performance across OpenAI’s training platform.
- Debug and resolve issues in complex, high-throughput distributed systems.
- Collaborate with systems, infrastructure, and research teams to evolve platform capabilities.
- Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads.
Requirements
- Care deeply about performance, stability, and observability in distributed systems.
- Enjoy finding and fixing issues in large-scale systems and automating operational workflows.
- Have experience writing low-level software where system details matter.
- Understand hardware, operating systems, networking, concurrency, and distributed systems.
- Have a background in high-performance computing or low-level systems engineering.
- Are excited to work on critical infrastructure that powers frontier AI research.
Categories
AI & MLBackendData EngineeringDevOps