GrepJob
fal

Software Engineer, Site Reliability

fal
Apply
9 days ago
Istanbul, TurkeySenior

Responsibilities

  • Own and operate Kubernetes infrastructure including lifecycle and upgrades.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Automate analysis and resolution of production issues using AI.
  • Create dashboards, alerting, and anomaly detection systems.
  • Define SLOs and develop incident response processes.
  • Manage networking, load balancing, and service mesh configurations.
  • Drive reliability improvements through automation and chaos engineering.

Requirements

  • 5+ years of experience managing critical production systems.
  • Strong experience with Kubernetes at scale and infrastructure-as-code tools.
  • Deep knowledge of Linux and container networking.
  • Experience building CI/CD systems and GitOps workflows.
  • Proficiency in Python and either Go or Bash.
  • Strong experience with logging, monitoring, and alerting tools.
  • Excellent communication skills and ability to drive technical decisions.
  • Self-starter with a focus on ownership and continuous improvement.

Benefits

  • Interesting and challenging work.
  • Opportunities for learning and growth.
  • Regular team events and offsites.

Tech Stack

AnsibleBashDatadogGoGrafanaKubernetesLinuxPrometheusPythonTerraform