GrepJob
Fireworks AI

Member of Technical Staff, AI Training Infrastructure

Fireworks AI
Apply
2 months ago
San Mateo, CA, USAMid Level / Senior
H1B Sponsor

Base Salary

$175k - $220k/yr

Responsibilities

  • Design and implement scalable infrastructure for large-scale model training workloads.
  • Develop and maintain distributed training pipelines for LLMs and multimodal models.
  • Optimize training performance across multiple GPUs, nodes, and data centers.
  • Implement monitoring, logging, and debugging tools for training operations.
  • Architect and maintain data storage solutions for large-scale training datasets.
  • Automate infrastructure provisioning, scaling, and orchestration for model training.
  • Collaborate with researchers to implement and optimize training methodologies.
  • Analyze and improve efficiency, scalability, and cost-effectiveness of training systems.
  • Troubleshoot complex performance issues in distributed training environments.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience.
  • 3+ years of experience with distributed systems and ML infrastructure.
  • Experience with PyTorch.
  • Proficiency in cloud platforms (AWS, GCP, Azure).
  • Experience with containerization and orchestration (Kubernetes, Docker).
  • Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP).

Benefits

  • Total compensation includes meaningful equity in a fast-growing startup.
  • Competitive salary and comprehensive benefits package.

Tech Stack

AWSAzureDockerGoogle Cloud PlatformKubernetesPyTorch

Categories

AI & MLData EngineeringDevOps