GrepJob
Later

ML Infrastructure Engineer

Later
Apply
about 2 months ago
Vancouver, CanadaMid Level / Senior

Responsibilities

  • Define and own the long-term ML infrastructure roadmap.
  • Establish best practices for model lifecycle management and deployment standards.
  • Identify infrastructure gaps and design scalable solutions.
  • Design, build, and maintain production-grade model deployment systems.
  • Automate end-to-end ML lifecycle workflows.
  • Implement robust monitoring systems for model performance and infrastructure health.
  • Operate across AWS and GCP environments for managing ML workloads.
  • Develop and maintain infrastructure-as-code for secure cloud environments.
  • Implement and optimize CI/CD workflows for ML automation.
  • Collaborate with cross-functional teams to support end-to-end ML workflows.
  • Stay current on emerging ML Ops practices and tools.

Requirements

  • 4+ years of experience in ML Ops, ML infrastructure, or backend engineering.
  • Experience in cloud-native environments (AWS and/or GCP).
  • Proven track record designing and implementing CI/CD pipelines for ML systems.
  • Strong experience with Amazon SageMaker, Docker, and Flask-based APIs.
  • Hands-on experience with ML lifecycle tooling such as MLflow or SageMaker Studio.
  • Experience managing container orchestration platforms like Kubernetes.
  • Strong programming experience in Python; additional languages like Go or Java are a plus.
  • Experience with infrastructure-as-code tools such as Terraform or CloudFormation.
  • Familiarity with observability tools like CloudWatch or Prometheus.
  • Experience managing GPU-based workloads and scaling ML systems.
  • Familiarity with data infrastructure tools like BigQuery.
  • Bonus: Experience with LLMs, generative AI pipelines, or ML governance frameworks.

Tech Stack

AWSDatadogDockerFlaskGitHub ActionsGitLab CI/CDGoGoogle BigQueryGoogle Cloud PlatformGrafanaJavaKubernetesMLflowPrometheusPythonScalaTerraform

Categories

AI & MLData EngineeringDevOps