Cerebras Systems

Distributed Software Engineer

Cerebras Systems

Apply
about 2 months ago
Bengaluru, India +2 more
Mid Level / Senior
H1B Sponsor

Responsibilities

  • Automate bare-metal configuration of networking, OS, and application software in large clusters.
  • Develop workflows for cluster upgrades, downgrades, and security patching.
  • Create an orchestration and scheduler system for resource allocation and job submission.
  • Support both on-premise and cloud mode deployment and operations.
  • Implement a robust system for monitoring and handling failures in clusters.
  • Develop broad cluster and job monitoring and visualization capabilities.
  • Create user-facing tools to monitor job status and collect metrics.
  • Build administrator-facing tools to manage and operate large clusters.

Requirements

  • Strong track record of software architecture, system design, and development.
  • Experience in development for distributed clusters.
  • Deep understanding of the Kubernetes software ecosystem, Prometheus, and Grafana.
  • Proficient in GoLang, Python, and bash.
  • Strong debugging skills with distributed systems.
  • Ability to develop tests for new features and regress old features.

Benefits

  • Opportunity to build a breakthrough AI platform beyond GPU constraints.
  • Ability to publish and open source cutting-edge AI research.
  • Work on one of the fastest AI supercomputers in the world.
  • Enjoy job stability with startup vitality.
  • Experience a simple, non-corporate work culture that respects individual beliefs.

Tech Stack

BashGoGrafanaKubernetesPrometheusPython

Categories

AI & MLData EngineeringDevOps