2 months ago
San Francisco, CA, USAMid Level / Senior
Base Salary
$175k - $250k/yr
Responsibilities
- Architect, operate, and continuously improve the core infrastructure powering our compute engine.
- Build and evolve the observability stack to detect issues proactively.
- Define, monitor, and drive SLOs/SLIs for system reliability.
- Lead incident response efforts, including root cause analysis and post-mortems.
- Design and implement automated operational systems to reduce manual toil.
- Tune performance across compute, networking, and storage layers under extreme workloads.
- Build automation and tooling to streamline operations and failure prediction.
- Conduct load testing, chaos engineering, and performance benchmarking.
- Ensure security best practices at the infrastructure layer.
- Collaborate with platform engineers to integrate reliability into new features.
Requirements
- 3+ years in SRE, DevOps, or infrastructure engineering roles.
- Strong proficiency in at least one programming language such as Go, Rust, or Python.
- Hands-on experience with a major cloud provider (AWS, GCP).
- Solid knowledge of Linux systems, networking fundamentals, and distributed systems.
- Experience with bare-metal servers and datacenter operations.
- Experience with Kubernetes or similar orchestrators.
- Familiarity with observability stacks like Prometheus or Grafana.
- Experience building and maintaining CI/CD pipelines.
- Strong debugging, problem-solving, and incident-management skills.
