GrepJob
Helsing

Site Reliability Engineer

Helsing
Apply
19 days ago
Berlin, Germany +2 moreMid Level / Senior

Responsibilities

  • Design and build cloud-native infrastructure platforms on-premises, focusing on Kubernetes-based solutions.
  • Create robust observability frameworks using Grafana, Prometheus, and distributed tracing.
  • Architect and implement secure, multi-tenant Kubernetes clusters with strong access controls.
  • Develop operators and controllers to automate infrastructure provisioning and compliance.
  • Build and maintain MLOps platforms for deploying and monitoring machine learning models.
  • Collaborate with Security teams to implement supply chain security and runtime protection.

Requirements

  • Experience in scripting with Python, Go, Rust, or Bash/Shell for automation.
  • Deep experience operating production Kubernetes clusters and writing custom controllers/operators.
  • Hands-on experience with CNCF ecosystem tools like Helm, ArgoCD, and container runtime security tools.
  • Expert-level knowledge of observability tools such as Grafana, Prometheus, and OpenTelemetry.
  • Strong understanding of networking concepts, protocols, and security.
  • Experience with MLOps platforms like Kubeflow or MLflow.
  • Proficiency in Infrastructure as Code tools like Terraform and Ansible.
  • Deep understanding of Linux/Unix system administration and distributed systems.

Benefits

  • Focus on outcomes rather than time-tracking.
  • Competitive compensation and VSOP options.
  • Relocation support.
  • Social and education allowances.
  • Regular company events and all-hands meetings.
  • Hands-on onboarding program to learn the tech stack and company processes.

Tech Stack

AnsibleGrafanaHelmIstioKubernetesLinuxMLflowPrometheusTerraform