GrepJob
fal

Software Engineer, Infrastructure

fal
Apply
9 days ago
San Francisco, CA, USAMid Level / Senior

Base Salary

$180k - $250k/yr

Responsibilities

  • Build and maintain a Python fleet tracking system for server lifecycle management.
  • Develop server management tooling for automation of provisioning and health checks.
  • Create metrics, dashboards, and alerting for hardware health monitoring.
  • Leverage AI to enhance tools and automate alerting and recovery.
  • Implement OS-level security measures and compliance automation.
  • Manage and optimize distributed and local storage systems.
  • Tune Linux systems for AI workloads and optimize performance.
  • Develop automated error detection and recovery processes.
  • Collaborate with partners to resolve technical issues.

Requirements

  • 3+ years of experience managing large server fleets (100+ nodes).
  • Strong software engineering skills in Python for production tooling.
  • Deep knowledge of Linux systems including boot process and kernel tuning.
  • Experience with configuration management and infrastructure-as-code tools.
  • Solid understanding of storage technologies and Linux I/O stack tuning.
  • Familiarity with hardware diagnostics and failure modes.
  • Experience building internal tools or dashboards for infrastructure visibility.
  • Excellent communication skills and ability to drive technical decisions.
  • Self-starter with a focus on ownership and continuous improvement.

Benefits

  • Interesting and challenging work.
  • Opportunities for learning and growth.
  • Relocation assistance to San Francisco.
  • Health, dental, and vision insurance.
  • Regular team events and offsites.

Tech Stack

AnsibleLinuxPythonTerraform