Site Reliability Engineer

2 months ago

Berlin, Germany +2 moreMid Level / Senior

Responsibilities

Design and build cloud-native infrastructure platforms on-premises, focusing on Kubernetes-based solutions.
Create robust observability frameworks using Grafana, Prometheus, and distributed tracing.
Architect and implement secure, multi-tenant Kubernetes clusters with strong access controls.
Develop operators and controllers to automate infrastructure provisioning and compliance.
Build and maintain MLOps platforms for deploying and monitoring machine learning models.
Collaborate with Security teams to implement supply chain security and runtime protection.

Experience in scripting with Python, Go, Rust, or Bash/Shell for automation.
Deep experience operating production Kubernetes clusters and writing custom controllers/operators.
Hands-on experience with CNCF ecosystem tools like Helm, ArgoCD, and container runtime security tools.
Expert-level knowledge of observability tools such as Grafana, Prometheus, and OpenTelemetry.
Strong understanding of networking concepts, protocols, and security.
Experience with MLOps platforms like Kubeflow or MLflow.
Proficiency in Infrastructure as Code tools like Terraform and Ansible.
Deep understanding of Linux/Unix system administration and distributed systems.

AnsibleGrafanaHelmIstioKubernetes LinuxMLflowPrometheusTerraform