GrepJob
Basis Research Institute

ML Systems Engineer, Infrastructure & Cloud

Basis Research Institute
Apply
6 months ago
Cambridge, MA, USA or New York, NY, USAMid Level / Senior
H1B Sponsor

Responsibilities

  • Own distributed training infrastructure including job launchers and monitoring systems.
  • Debug and resolve training failures by diagnosing issues across the stack.
  • Profile and optimize training performance by identifying bottlenecks.
  • Manage cloud infrastructure and costs, including capacity planning and storage optimization.
  • Implement security and compliance measures for sensitive data handling.
  • Build evaluation and benchmarking infrastructure for consistent model performance measurement.
  • Develop monitoring and alerting systems for training metrics and system health.
  • Maintain development environments for reproducibility of results.
  • Document and share knowledge through runbooks and training materials.
  • Collaborate with researchers to understand requirements and suggest infrastructure solutions.

Requirements

  • Demonstrated expertise in ML systems engineering.
  • Deep knowledge of distributed training frameworks like PyTorch and JAX.
  • Strong cloud administration skills with AWS, GCP, or Azure.
  • Understanding of the full ML stack from hardware to evaluation pipelines.
  • Skilled at debugging complex failures across the stack.
  • Value documentation and knowledge sharing.
  • Ability to progress with autonomy while coordinating with researchers.

Tech Stack

AWSAzureGoogle Cloud PlatformKubernetesPyTorchTerraform

Categories

AI & MLData EngineeringDevOps