Distributed Software Engineer
Cerebras Systems
about 2 months ago
Bengaluru, India +2 more
Mid Level / Senior
H1B Sponsor
Responsibilities
- Automate bare-metal configuration of networking, OS, and application software in large clusters.
- Develop workflows for cluster upgrades, downgrades, and security patching.
- Create an orchestration and scheduler system for resource allocation and job submission.
- Support both on-premise and cloud mode deployment and operations.
- Implement a robust system for monitoring and handling failures in clusters.
- Develop broad cluster and job monitoring and visualization capabilities.
- Create user-facing tools to monitor job status and collect metrics.
- Build administrator-facing tools to manage and operate large clusters.
Requirements
- Strong track record of software architecture, system design, and development.
- Experience in development for distributed clusters.
- Deep understanding of the Kubernetes software ecosystem, Prometheus, and Grafana.
- Proficient in GoLang, Python, and bash.
- Strong debugging skills with distributed systems.
- Ability to develop tests for new features and regress old features.
Benefits
- Opportunity to build a breakthrough AI platform beyond GPU constraints.
- Ability to publish and open source cutting-edge AI research.
- Work on one of the fastest AI supercomputers in the world.
- Enjoy job stability with startup vitality.
- Experience a simple, non-corporate work culture that respects individual beliefs.
Tech Stack
BashGoGrafanaKubernetesPrometheusPython
Categories
AI & MLData EngineeringDevOps