GrepJob
Alembic

Senior Site Reliability Engineer

Alembic
Apply
5 months ago
San Francisco, CA, USASenior / Staff+
H1B Sponsor

Base Salary

$210k - $240k/yr

Responsibilities

  • Design, build, and maintain scalable infrastructure for real-time analytics and machine learning workloads.
  • Improve system reliability and performance through automation and observability.
  • Own and evolve CI/CD pipelines, deployment automation, and config management.
  • Implement and maintain monitoring, alerting, and incident response processes.
  • Collaborate with engineering and data science teams to promote performance and reliability.
  • Ensure security, compliance, and operational readiness of cloud infrastructure.
  • Drive post-incident analysis and continuous improvement initiatives.

Requirements

  • 8+ years of experience in SRE, DevOps, or infrastructure engineering roles.
  • 5+ years of experience with datacenter operations or system and network administration.
  • Experience with containerization (Docker) and orchestration (Kubernetes).
  • Strong knowledge of Linux systems, networking, and performance tuning.
  • Solid understanding of infrastructure-as-code tools like Terraform and Ansible.
  • Good programming skills in languages such as Terraform, Ansible, Bash, or Python.
  • Experience with monitoring and observability stacks like Prometheus or Grafana.
  • Proficiency with CI/CD tools and pipelines such as GitHub Actions.

Benefits

  • Ownership of mission-critical infrastructure in a company solving real-world problems.
  • A front-row seat to a high-performance engineering culture.
  • The ability to influence platform scaling from deployment to incident management.
  • An environment that values curiosity, accountability, and impact.

Tech Stack

AnsibleApache AirflowApache KafkaApache SparkAWSBashDatadogDockerGitHub ActionsGrafanaKubernetesLinuxPrometheusPythonTerraform

Categories

AI & MLData EngineeringDevOpsSecurity