
Member of Technical Staff, Cluster Management
Fireworks AI2 months ago
San Mateo, CA, USASenior / Staff+
H1B Sponsor
Responsibilities
- Ensure systems are designed for high availability, scalability, and performance.
- Lead incident detection, response, and resolution for production issues.
- Develop and maintain monitoring, alerting, logging, and tracing solutions.
- Automate repetitive operational tasks to improve efficiency.
- Proactively plan capacity to handle growth and optimize performance.
- Collaborate with engineers to embed reliability principles into development.
- Participate in on-call rotation to support the production environment.
Requirements
- Bachelor's degree in Computer Science or related technical field.
- 5+ years of experience in Site Reliability Engineering or similar roles.
- Deep expertise in SRE principles, including SLOs and incident management.
- Extensive experience with public cloud platforms like AWS, GCP, or Azure.
- Strong experience with containerization (Docker) and orchestration (Kubernetes).
- Proficiency in monitoring and logging systems using tools like Prometheus and Grafana.
- Solid programming skills in at least one language for automation.
- In-depth knowledge of Linux, networking fundamentals, and system debugging.
- Proven ability to troubleshoot complex issues across the stack.
- Excellent communication and collaboration skills.
- Willingness to participate in on-call rotations.
Benefits
- Tackle challenges at the forefront of AI infrastructure.
- Work with cutting-edge technology impacting global AI usage.
- Join a passionate team where your work shapes the future of AI.
- Collaborate with world-class engineers and AI researchers.