Member of Technical Staff, Cluster Management

2 months ago

San Mateo, CA, USASenior / Staff+

H1B Sponsor

Responsibilities

Ensure systems are designed for high availability, scalability, and performance.
Lead incident detection, response, and resolution for production issues.
Develop and maintain monitoring, alerting, logging, and tracing solutions.
Automate repetitive operational tasks to improve efficiency.
Proactively plan capacity to handle growth and optimize performance.
Collaborate with engineers to embed reliability principles into development.
Participate in on-call rotation to support the production environment.

Bachelor's degree in Computer Science or related technical field.
5+ years of experience in Site Reliability Engineering or similar roles.
Deep expertise in SRE principles, including SLOs and incident management.
Extensive experience with public cloud platforms like AWS, GCP, or Azure.
Strong experience with containerization (Docker) and orchestration (Kubernetes).
Proficiency in monitoring and logging systems using tools like Prometheus and Grafana.
Solid programming skills in at least one language for automation.
In-depth knowledge of Linux, networking fundamentals, and system debugging.
Proven ability to troubleshoot complex issues across the stack.
Excellent communication and collaboration skills.
Willingness to participate in on-call rotations.

AWSAzureDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPython