Site Reliability Engineer
Databricks12 days ago
Responsibilities
- Design and deploy production-grade infrastructure on cloud platforms using Infrastructure as Code tools.
- Optimize system performance and architecture to ensure maximum uptime and minimal latency.
- Architect robust deployment pipelines and manage hosted and self-hosted runners.
- Create infrastructure that ensures new applications have logging, metrics, and alerts enabled by default.
- Build internal AI plugins and automation scripts to enhance operational efficiency.
- Participate in incident management workflows and lead rapid incident response for production outages.
- Collaborate with Security, Engineering, and Support teams to deliver real business outcomes.
Requirements
- 5+ years of production-level experience with strong proficiency in Python.
- Expert-level proficiency in Terraform or Pulumi for Infrastructure as Code.
- Hands-on experience with AWS, Azure, or GCP, along with Kubernetes and Docker.
- Deep understanding of observability pillars and experience with tools like Datadog or Prometheus.
- Proficiency in running distributed systems using concepts like Kafka.
- Advanced knowledge of GitHub Actions and GitHub Runners.
- Ability to take ownership of ambiguous projects and execute independently.
Tech Stack
Apache KafkaAWSAzureDatadogDockerGitHub ActionsGoogle Cloud PlatformKubernetesPrometheusPythonTerraform