GrepJob
Blaxel

Site Reliability Engineer

Blaxel
Apply
2 months ago
San Francisco, CA, USAMid Level / Senior

Base Salary

$175k - $250k/yr

Responsibilities

  • Architect, operate, and continuously improve the core infrastructure powering our compute engine.
  • Build and evolve the observability stack to detect issues proactively.
  • Define, monitor, and drive SLOs/SLIs for system reliability.
  • Lead incident response efforts, including root cause analysis and post-mortems.
  • Design and implement automated operational systems to reduce manual toil.
  • Tune performance across compute, networking, and storage layers under extreme workloads.
  • Build automation and tooling to streamline operations and failure prediction.
  • Conduct load testing, chaos engineering, and performance benchmarking.
  • Ensure security best practices at the infrastructure layer.
  • Collaborate with platform engineers to integrate reliability into new features.

Requirements

  • 3+ years in SRE, DevOps, or infrastructure engineering roles.
  • Strong proficiency in at least one programming language such as Go, Rust, or Python.
  • Hands-on experience with a major cloud provider (AWS, GCP).
  • Solid knowledge of Linux systems, networking fundamentals, and distributed systems.
  • Experience with bare-metal servers and datacenter operations.
  • Experience with Kubernetes or similar orchestrators.
  • Familiarity with observability stacks like Prometheus or Grafana.
  • Experience building and maintaining CI/CD pipelines.
  • Strong debugging, problem-solving, and incident-management skills.

Tech Stack

AWSDatadogGitHub ActionsGitLab CI/CDGoGoogle Cloud PlatformGrafanaJenkinsKubernetesLinuxPrometheusPythonRustTerraform