Software Engineer, GPU Infrastructure - HPC
OpenAI
9 months ago
San Francisco, CA, USA
Mid Level / Senior
Base Salary
$230k - $490k/yr
Responsibilities
- Build and maintain automation systems for provisioning and managing server fleets.
- Develop tools to monitor server health, performance, and lifecycle events.
- Collaborate with clusters, networking, and infrastructure teams.
- Partner with external operators to ensure a high level of quality.
- Identify and fix performance bottlenecks and inefficiencies.
- Continuously improve automation to reduce manual work.
Requirements
- Experience managing large-scale server environments.
- Proficiency in Python, Go, or similar programming languages.
- Strong knowledge of Linux, networking, and server hardware.
- Comfort with data analysis using SQL, PromQL, and Pandas.
- Prior hardware expertise is not required.
Tech Stack
GoGrafanaLinuxPandasPrometheusPythonSQL
Categories
AI & MLData EngineeringDevOps