
Machine Learning Engineer - ML Training Platform
Pluralis Researchabout 1 month ago
Responsibilities
- Design resource management systems for multi-cloud infrastructure using infrastructure-as-code.
- Architect fault-tolerant infrastructure for distributed machine learning.
- Build systems to simulate real-world network conditions for efficient data flow.
- Manage dynamic scaling and state synchronization across heterogeneous nodes.
- Enable continuous experimentation and large-scale model training.
Requirements
- 5+ years of experience in infrastructure and platform engineering.
- Proficiency in infrastructure-as-code tools like Pulumi, Terraform, or CloudFormation.
- Deep understanding of distributed training workflows and decentralized networking.
- Strong Python programming skills with experience in observability and SRE practices.
- Experience in a startup environment or big tech background.