
ML Systems Engineer, Infrastructure & Cloud
Basis Research Institute6 months ago
Cambridge, MA, USA or New York, NY, USAMid Level / Senior
H1B Sponsor
Responsibilities
- Own distributed training infrastructure including job launchers and monitoring systems.
- Debug and resolve training failures by diagnosing issues across the stack.
- Profile and optimize training performance by identifying bottlenecks.
- Manage cloud infrastructure and costs, including capacity planning and storage optimization.
- Implement security and compliance measures for sensitive data handling.
- Build evaluation and benchmarking infrastructure for consistent model performance measurement.
- Develop monitoring and alerting systems for training metrics and system health.
- Maintain development environments for reproducibility of results.
- Document and share knowledge through runbooks and training materials.
- Collaborate with researchers to understand requirements and suggest infrastructure solutions.
Requirements
- Demonstrated expertise in ML systems engineering.
- Deep knowledge of distributed training frameworks like PyTorch and JAX.
- Strong cloud administration skills with AWS, GCP, or Azure.
- Understanding of the full ML stack from hardware to evaluation pipelines.
- Skilled at debugging complex failures across the stack.
- Value documentation and knowledge sharing.
- Ability to progress with autonomy while coordinating with researchers.