about 2 months ago
San Francisco, CA, USA or Sunnyvale, CA, USAStaff+
Base Salary
$208k - $253k/yr
Responsibilities
- Develop and implement diagnostics and troubleshooting for GPU hardware faults.
- Create automation tooling for GPU platforms including NVIDIA and AMD systems.
- Develop AI agents for diagnosing and remediating hardware issues.
- Collaborate with data center operations to innovate tooling for critical environments.
- Build post-repair validation and testing tools to ensure system performance.
- Oversee deployment, monitoring, and operational support of developed tooling.
- Create automation for facilities management and cooling hardware systems.
Requirements
- Proven software engineering experience.
- Ability to identify problems and develop scalable solutions quickly.
- Experience in setting technical direction for projects.
- Expertise in distributed systems and cloud platforms like Kubernetes and GCP.
- Proficiency in at least one programming language such as Go, Python, Java, or Rust.
- Strong analytical and problem-solving skills.
- Excellent communication and collaboration abilities.
- Capability to work independently and as part of a team.
Benefits
- Industry competitive pay.
- Restricted Stock Units in a fast-growing technology company.
- Health insurance options including HDHP and PPO, vision, and dental.
- Employer contributions to HSA accounts.
- Paid Parental Leave.
- Paid life insurance and disability coverage.
- 401(k) with a 100% match up to 4% of salary.
- Generous paid time off and holiday schedule.
- Cell phone reimbursement and tuition reimbursement.
- Subscription to the Calm app and company-paid commuter benefit.
