ML Infrastructure Engineer

about 2 months ago

Remote, WorldwideMid Level / Senior

Responsibilities

Profile and analyze GPU performance at the system and kernel level.
Evaluate and compare GPU performance across different platforms and software stacks.
Debug and optimize ML workloads for efficient GPU execution.
Perform acceptance testing for new GPU clusters to ensure performance and compatibility.
Conduct experiments on GPU configurations to assess performance impacts.
Develop tools and dashboards to visualize performance metrics and trends.
Contribute to internal tooling, frameworks, and best practices.

Requirements

Profound understanding of theoretical foundations of machine learning.
Deep knowledge of performance aspects of large neural networks.
Experience with modern deep learning frameworks like PyTorch and JAX.
Good understanding of the GPU stack including CUDA and relevant libraries.
Familiarity with containerized environments such as Docker and Kubernetes.
Strong communication skills and ability to work independently.

Benefits

Competitive compensation.
Career growth and learning opportunities.
Flexibility and work-life balance.
Collaborative and innovative culture.
Opportunity to work on impactful AI projects.
International environment and talented teams.

Tech Stack

AWS Docker Google Cloud Platform Kubernetes PyTorch

Categories

AI & MLData Engineering