Site Reliability Engineer

about 2 months ago

San José, Costa RicaSenior

H1B Sponsor

Responsibilities

Design and deploy production-grade infrastructure on cloud platforms using Infrastructure as Code tools.
Optimize system performance and architecture to ensure maximum uptime and minimal latency.
Architect robust deployment pipelines and manage hosted and self-hosted runners.
Create infrastructure that ensures new applications have logging, metrics, and alerts enabled by default.
Build internal AI plugins and automation scripts to enhance operational efficiency.
Participate in incident management workflows and lead rapid incident response for production outages.
Collaborate with Security, Engineering, and Support teams to deliver real business outcomes.

5+ years of production-level experience with strong proficiency in Python.
Expert-level proficiency in Terraform or Pulumi for Infrastructure as Code.
Hands-on experience with AWS, Azure, or GCP, along with Kubernetes and Docker.
Deep understanding of observability pillars and experience with tools like Datadog or Prometheus.
Proficiency in running distributed systems using concepts like Kafka.
Advanced knowledge of GitHub Actions and GitHub Runners.
Ability to take ownership of ambiguous projects and execute independently.