DevOps / Site Reliability Engineer

3 days ago

Remote, WorldwideMid Level / Senior

H1B Sponsor

Responsibilities

Own cloud infrastructure on AWS including EC2, EKS, RDS, S3, IAM, and VPC.
Manage Kubernetes clusters and container orchestration end-to-end.
Build and maintain CI/CD pipelines using GitHub Actions or similar tools.
Implement monitoring, alerting, and observability stacks like Prometheus or Grafana.
Improve reliability, performance, and security of production systems.
Automate infrastructure using Terraform or similar IaC tools.
Debug and resolve issues across complex, distributed systems.
Participate in design reviews to enhance infrastructure quality.

Competitive compensation and meaningful equity.
Direct impact on frontier AI model training and evaluation infrastructure.
Flexible, remote-friendly work environment with low bureaucracy.
Opportunity to work with a small, high-caliber team with deep AI research expertise.
Health, wellness, and learning & development benefits.

AWSDatadogGitHub ActionsGoGrafanaKubernetesPrometheusPythonTerraform