6 days ago
Istanbul, TurkeyMid Level / Senior / Staff+
Responsibilities
- Build and maintain a Python fleet tracking system for server lifecycle management.
- Develop server management tools for automation of provisioning and health checks.
- Create metrics, dashboards, and alerting systems for hardware health monitoring.
- Leverage AI to automate alerting and recovery processes.
- Implement OS-level security measures and compliance automation.
- Manage and optimize distributed and local storage systems.
- Tune Linux systems for optimal AI workload performance.
- Develop automated error detection and recovery processes.
- Collaborate with partners to resolve technical issues.
Requirements
- 3+ years of experience managing large server fleets (100+ nodes).
- Strong software engineering skills in Python for production tooling.
- Deep knowledge of Linux systems including boot processes and kernel tuning.
- Experience with configuration management and infrastructure-as-code tools.
- Solid understanding of storage technologies and Linux I/O stack tuning.
- Familiarity with hardware diagnostics and failure modes.
- Experience building internal tools or dashboards for infrastructure visibility.
- Excellent communication skills and ability to drive technical decisions.
- Self-starter with a focus on ownership and continuous improvement.
Benefits
- Interesting and challenging work.
- Opportunities for learning and growth.
- Regular team events and offsites.
