
Machine Learning Engineer - ML Training Platform
Pluralis Research2 months ago
Sydney, Australia or Melbourne, AustraliaSenior
Responsibilities
- Design resource management systems for multi-cloud infrastructure using infrastructure-as-code.
- Architect fault-tolerant infrastructure for distributed machine learning.
- Build systems to simulate real-world network conditions for efficient data flow.
- Manage dynamic scaling and state synchronization across heterogeneous nodes.
- Enable continuous experimentation and large-scale model training.
Requirements
- 5+ years of experience in infrastructure and platform engineering.
- Production experience with infrastructure-as-code tools like Pulumi or Terraform.
- Deep understanding of distributed training workflows and decentralized networking.
- Strong Python engineering skills with experience in observability and SRE practices.
- Experience in a startup environment or big tech background.