16 days ago
Base Salary
$170k - $190k/yr
Responsibilities
- Automate and control datacenter and cloud-based infrastructure.
- Improve system reliability and performance through automation and observability.
- Create and configure tools for datacenter provisioning and configuration management.
- Enhance developer experience by providing self-service tools.
- Implement and maintain monitoring, alerting, and incident response processes.
- Collaborate with engineering and data science teams to promote performance and reliability.
- Ensure security, compliance, and operational readiness across infrastructure.
- Drive post-incident analysis and continuous improvement initiatives.
Requirements
- 5+ years of experience in Tools development, SRE, DevOps, or platform engineering roles.
- Proficient in IaC languages such as Ansible, Helm, and Kustomize.
- Strong programming skills in Python or Go.
- Deep experience with Docker and Kubernetes.
- Strong knowledge of Linux systems and networking fundamentals.
- Experience with monitoring and observability stacks like Prometheus and Grafana.
- Proficiency with CI/CD tools and pipelines such as GitHub Actions.
- Ability to debug complex systems and automate solutions using scripting languages.
- Excellent communication skills for cross-functional collaboration.
Benefits
- Ownership of mission-critical infrastructure.
- A front-row seat to a high-performance engineering culture.
- Influence over platform scaling from deployment to incident management.
- An environment that values curiosity, accountability, and impact.
