5 months ago
Base Salary
$210k - $240k/yr
Responsibilities
- Design, build, and maintain scalable infrastructure for real-time analytics and machine learning workloads.
- Improve system reliability and performance through automation and observability.
- Own and evolve CI/CD pipelines, deployment automation, and config management.
- Implement and maintain monitoring, alerting, and incident response processes.
- Collaborate with engineering and data science teams to promote performance and reliability.
- Ensure security, compliance, and operational readiness of cloud infrastructure.
- Drive post-incident analysis and continuous improvement initiatives.
Requirements
- 8+ years of experience in SRE, DevOps, or infrastructure engineering roles.
- 5+ years of experience with datacenter operations or system and network administration.
- Experience with containerization (Docker) and orchestration (Kubernetes).
- Strong knowledge of Linux systems, networking, and performance tuning.
- Solid understanding of infrastructure-as-code tools like Terraform and Ansible.
- Good programming skills in languages such as Terraform, Ansible, Bash, or Python.
- Experience with monitoring and observability stacks like Prometheus or Grafana.
- Proficiency with CI/CD tools and pipelines such as GitHub Actions.
Benefits
- Ownership of mission-critical infrastructure in a company solving real-world problems.
- A front-row seat to a high-performance engineering culture.
- The ability to influence platform scaling from deployment to incident management.
- An environment that values curiosity, accountability, and impact.
