
Machine Learning Engineer — Training Optimization
Featherless AI4 months ago
Remote, WorldwideMid Level / Senior
Responsibilities
- Optimize large-scale model training pipelines for throughput, convergence, stability, and cost.
- Improve distributed training strategies including data, model, and pipeline parallelism.
- Tune optimizers, schedulers, batch sizing, and precision settings.
- Reduce training time and compute costs through profiling and bottleneck analysis.
- Collaborate with researchers on architecture-aware training strategies.
- Build and maintain robust training infrastructure for checkpointing and fault tolerance.
- Evaluate and integrate new training techniques and own training performance metrics.
Requirements
- Strong experience training large neural networks or similarly large models.
- Hands-on experience with training optimization techniques.
- Solid understanding of backpropagation, optimization algorithms, and training dynamics.
- Experience with distributed systems for machine learning training.
- Proficiency in PyTorch is required.
- Comfortable working close to hardware constraints like GPUs and memory.
Benefits
- Real ownership at a Series-A stage company.
- Opportunity to work on cutting-edge models and training systems at scale.
- Small, highly technical team with fast feedback loops.
- Strong emphasis on engineering quality and research rigor.
- Competitive compensation with meaningful equity.
Tech Stack
PyTorch
Categories
AI & MLData Engineering