
Member of Technical Staff, AI Training Infrastructure
Fireworks AI2 months ago
San Mateo, CA, USAMid Level / Senior
H1B Sponsor
Base Salary
$175k - $220k/yr
Responsibilities
- Design and implement scalable infrastructure for large-scale model training workloads.
- Develop and maintain distributed training pipelines for LLMs and multimodal models.
- Optimize training performance across multiple GPUs, nodes, and data centers.
- Implement monitoring, logging, and debugging tools for training operations.
- Architect and maintain data storage solutions for large-scale training datasets.
- Automate infrastructure provisioning, scaling, and orchestration for model training.
- Collaborate with researchers to implement and optimize training methodologies.
- Analyze and improve efficiency, scalability, and cost-effectiveness of training systems.
- Troubleshoot complex performance issues in distributed training environments.
Requirements
- Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience.
- 3+ years of experience with distributed systems and ML infrastructure.
- Experience with PyTorch.
- Proficiency in cloud platforms (AWS, GCP, Azure).
- Experience with containerization and orchestration (Kubernetes, Docker).
- Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP).
Benefits
- Total compensation includes meaningful equity in a fast-growing startup.
- Competitive salary and comprehensive benefits package.