GrepJob
Pluralis Research

Machine Learning Engineer - ML Training Platform

Pluralis Research
Apply
about 1 month ago

Responsibilities

  • Design resource management systems for multi-cloud infrastructure using infrastructure-as-code.
  • Architect fault-tolerant infrastructure for distributed machine learning.
  • Build systems to simulate real-world network conditions for efficient data flow.
  • Manage dynamic scaling and state synchronization across heterogeneous nodes.
  • Enable continuous experimentation and large-scale model training.

Requirements

  • 5+ years of experience in infrastructure and platform engineering.
  • Proficiency in infrastructure-as-code tools like Pulumi, Terraform, or CloudFormation.
  • Deep understanding of distributed training workflows and decentralized networking.
  • Strong Python programming skills with experience in observability and SRE practices.
  • Experience in a startup environment or big tech background.

Tech Stack

AWSAzureDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform

Categories

AI & MLData EngineeringDevOps