9 days ago
San Francisco, CA, USAMid Level / Senior
Base Salary
$180k - $250k/yr
Responsibilities
- Build and maintain a Python fleet tracking system for server lifecycle management.
- Develop server management tooling for automation of provisioning and health checks.
- Create metrics, dashboards, and alerting for hardware health monitoring.
- Leverage AI to enhance tools and automate alerting and recovery.
- Implement OS-level security measures and compliance automation.
- Manage and optimize distributed and local storage systems.
- Tune Linux systems for AI workloads and optimize performance.
- Develop automated error detection and recovery processes.
- Collaborate with partners to resolve technical issues.
Requirements
- 3+ years of experience managing large server fleets (100+ nodes).
- Strong software engineering skills in Python for production tooling.
- Deep knowledge of Linux systems including boot process and kernel tuning.
- Experience with configuration management and infrastructure-as-code tools.
- Solid understanding of storage technologies and Linux I/O stack tuning.
- Familiarity with hardware diagnostics and failure modes.
- Experience building internal tools or dashboards for infrastructure visibility.
- Excellent communication skills and ability to drive technical decisions.
- Self-starter with a focus on ownership and continuous improvement.
Benefits
- Interesting and challenging work.
- Opportunities for learning and growth.
- Relocation assistance to San Francisco.
- Health, dental, and vision insurance.
- Regular team events and offsites.
