Site Reliability Engineer

4 months ago

Base Salary

$175k - $250k/yr

Responsibilities

Architect, operate, and continuously improve the core infrastructure powering our compute engine.
Build and evolve the observability stack to detect issues proactively.
Define, monitor, and drive SLOs/SLIs for system reliability.
Lead incident response efforts, including root cause analysis and post-mortems.
Design and implement automated operational systems to reduce manual toil.
Tune performance across compute, networking, and storage layers under extreme workloads.
Build automation and tooling to streamline operations and failure prediction.
Conduct load testing, chaos engineering, and performance benchmarking.
Ensure security best practices at the infrastructure layer.
Collaborate with platform engineers to integrate reliability into new features.

3+ years in SRE, DevOps, or infrastructure engineering roles.
Strong proficiency in at least one programming language such as Go, Rust, or Python.
Hands-on experience with a major cloud provider (AWS, GCP).
Solid knowledge of Linux systems, networking fundamentals, and distributed systems.
Experience with bare-metal servers and datacenter operations.
Experience with Kubernetes or similar orchestrators.
Familiarity with observability stacks like Prometheus or Grafana.
Experience building and maintaining CI/CD pipelines.
Strong debugging, problem-solving, and incident-management skills.

AWSDatadogGitHub ActionsGitLab CI/CDGo Google Cloud PlatformGrafanaJenkinsKubernetes LinuxPrometheusPython Rust Terraform