ML Systems Engineer, Infrastructure & Cloud

7 months ago

Cambridge, MA, USA or New York, NY, USAMid Level / Senior

H1B Sponsor

Responsibilities

Own distributed training infrastructure including job launchers and monitoring systems.
Debug and resolve training failures by diagnosing issues across the stack.
Profile and optimize training performance by identifying bottlenecks.
Manage cloud infrastructure and costs, including capacity planning and storage optimization.
Implement security and compliance measures for sensitive data handling.
Build evaluation and benchmarking infrastructure for consistent model performance measurement.
Develop monitoring and alerting systems for training metrics and system health.
Maintain development environments for reproducibility of results.
Document and share knowledge through runbooks and training materials.
Collaborate with researchers to understand requirements and suggest infrastructure solutions.