GrepJob
Nebius

ML Infrastructure Engineer

Nebius
Apply
about 2 hours ago
Remote, WorldwideMid Level / Senior

Responsibilities

  • Profile and analyze GPU performance at the system and kernel level.
  • Evaluate and compare GPU performance across different platforms and software stacks.
  • Debug and optimize ML workloads for efficient GPU execution.
  • Perform acceptance testing for new GPU clusters to ensure performance and compatibility.
  • Conduct experiments on GPU configurations to assess performance impacts.
  • Develop tools and dashboards to visualize performance metrics and trends.
  • Contribute to internal tooling, frameworks, and best practices.

Requirements

  • Profound understanding of theoretical foundations of machine learning.
  • Deep knowledge of performance aspects of large neural networks.
  • Experience with modern deep learning frameworks like PyTorch and JAX.
  • Good understanding of the GPU stack including CUDA and relevant libraries.
  • Familiarity with containerized environments such as Docker and Kubernetes.
  • Strong communication skills and ability to work independently.

Benefits

  • Competitive compensation.
  • Career growth and learning opportunities.
  • Flexibility and work-life balance.
  • Collaborative and innovative culture.
  • Opportunity to work on impactful AI projects.
  • International environment and talented teams.

Tech Stack

AWSDockerGoogle Cloud PlatformKubernetesPyTorch

Categories

AI & MLData Engineering