ML Ops Engineer (EMEA Remote)

2 months ago

Prague, Czechia +7 moreMid Level / Senior

Responsibilities

Build and operate production-grade model serving infrastructure using frameworks like vLLM, TGI, or Triton.
Design and implement robust deployment pipelines with blue/green and canary rollout strategies for ML models.
Develop and maintain auto-scaling systems and intelligent request routing layers.
Optimize GPU utilization, memory efficiency, and network throughput.
Design observability systems for tracking inference metrics and system health.
Manage model registries and CI/CD pipelines for automated model deployments.
Own the full lifecycle of ML systems, including operational support.
Define engineering best practices in a fast-moving startup environment.

4+ years of experience in ML Ops, Platform Engineering, or similar roles focused on ML systems.
Hands-on experience with model serving frameworks like vLLM, TGI, or Triton.
Strong background in container orchestration and operating GPU-based workloads.
Experience with MLOps tooling including model registries and automated deployment pipelines.
Proficiency in Python and infrastructure-as-code tools like Terraform or Helm.
Strong understanding of distributed systems and production reliability engineering.
Ability to effectively use AI coding assistants for development and debugging.
Ownership mindset with the ability to operate independently in a remote-first environment.

Take ownership of critical infrastructure for a rapidly scaling AI-native cloud platform.
Build foundational ML inference systems from the ground up in a high-growth startup.
Work at the intersection of distributed systems, GPU computing, and sustainable cloud architecture.
Gain deep expertise in next-generation AI infrastructure and large-scale model serving systems.
Influence core engineering decisions and define best practices for scalability.