GrepJob
Pluralis Research

Machine Learning Engineer - ML Training Platform

Pluralis Research
Apply
2 months ago
Sydney, Australia or Melbourne, AustraliaSenior

Responsibilities

  • Design resource management systems for multi-cloud infrastructure using infrastructure-as-code.
  • Architect fault-tolerant infrastructure for distributed machine learning.
  • Build systems to simulate real-world network conditions for efficient data flow.
  • Manage dynamic scaling and state synchronization across heterogeneous nodes.
  • Enable continuous experimentation and large-scale model training.

Requirements

  • 5+ years of experience in infrastructure and platform engineering.
  • Production experience with infrastructure-as-code tools like Pulumi or Terraform.
  • Deep understanding of distributed training workflows and decentralized networking.
  • Strong Python engineering skills with experience in observability and SRE practices.
  • Experience in a startup environment or big tech background.

Tech Stack

AWSAzureDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform

Categories

AI & MLData EngineeringDevOps