GrepJob
Fireworks AI

Member of Technical Staff, Cluster Management

Fireworks AI
Apply
2 months ago
San Mateo, CA, USASenior / Staff+
H1B Sponsor

Responsibilities

  • Ensure systems are designed for high availability, scalability, and performance.
  • Lead incident detection, response, and resolution for production issues.
  • Develop and maintain monitoring, alerting, logging, and tracing solutions.
  • Automate repetitive operational tasks to improve efficiency.
  • Proactively plan capacity to handle growth and optimize performance.
  • Collaborate with engineers to embed reliability principles into development.
  • Participate in on-call rotation to support the production environment.

Requirements

  • Bachelor's degree in Computer Science or related technical field.
  • 5+ years of experience in Site Reliability Engineering or similar roles.
  • Deep expertise in SRE principles, including SLOs and incident management.
  • Extensive experience with public cloud platforms like AWS, GCP, or Azure.
  • Strong experience with containerization (Docker) and orchestration (Kubernetes).
  • Proficiency in monitoring and logging systems using tools like Prometheus and Grafana.
  • Solid programming skills in at least one language for automation.
  • In-depth knowledge of Linux, networking fundamentals, and system debugging.
  • Proven ability to troubleshoot complex issues across the stack.
  • Excellent communication and collaboration skills.
  • Willingness to participate in on-call rotations.

Benefits

  • Tackle challenges at the forefront of AI infrastructure.
  • Work with cutting-edge technology impacting global AI usage.
  • Join a passionate team where your work shapes the future of AI.
  • Collaborate with world-class engineers and AI researchers.

Tech Stack

AWSAzureDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPython

Categories

AI & MLData EngineeringDevOpsSecurity