GrepJob
Databricks

Site Reliability Engineer

Databricks
Apply
12 days ago
San José, Costa RicaSenior
H1B Sponsor

Responsibilities

  • Design and deploy production-grade infrastructure on cloud platforms using Infrastructure as Code tools.
  • Optimize system performance and architecture to ensure maximum uptime and minimal latency.
  • Architect robust deployment pipelines and manage hosted and self-hosted runners.
  • Create infrastructure that ensures new applications have logging, metrics, and alerts enabled by default.
  • Build internal AI plugins and automation scripts to enhance operational efficiency.
  • Participate in incident management workflows and lead rapid incident response for production outages.
  • Collaborate with Security, Engineering, and Support teams to deliver real business outcomes.

Requirements

  • 5+ years of production-level experience with strong proficiency in Python.
  • Expert-level proficiency in Terraform or Pulumi for Infrastructure as Code.
  • Hands-on experience with AWS, Azure, or GCP, along with Kubernetes and Docker.
  • Deep understanding of observability pillars and experience with tools like Datadog or Prometheus.
  • Proficiency in running distributed systems using concepts like Kafka.
  • Advanced knowledge of GitHub Actions and GitHub Runners.
  • Ability to take ownership of ambiguous projects and execute independently.

Categories

AI & MLData EngineeringDevOps