3 days ago
Responsibilities
- Contribute to all phases of product development, from definition to early customer support.
- Design and implement fault-remediation solutions at scale.
- Implement multi-component integrations using Graphcore and third-party technologies.
- Create reference designs including documentation and source code.
- Deploy solutions for engineering teams to aid in debugging and performance analysis.
- Maintain and improve deployed infrastructure for optimal customer service.
- Ensure solutions are tested by collaborating with development and QA teams.
- Mentor and guide junior engineers to foster continuous learning.
Requirements
- BSc or MSc degree in Computer Engineering, Computer Science, or equivalent experience.
- Proven experience in architecting and implementing scalable cluster management systems.
- Experience managing large-scale datacenters with a focus on hardware observability.
- Familiarity with observability stacks like Prometheus, Grafana, and Elastic Stack.
- Understanding of secure telemetry practices and data exposure controls.
- Working knowledge of Datadog, Dynatrace, or Splunk.
- Experience with large-scale telemetry datasets and actionable dashboards.
- Proficiency in automation technologies such as Ansible or Terraform.
- Experience in containerization with Docker and Kubernetes.
- Strong programming skills in C/C++/Go and Python.
- Excellent written and verbal communication skills.
Benefits
- Competitive salary and annual leave policy.
- Medical and dental health plans.
- Gym card and employee pension matched up to 4%.
- Yearly review of benefits to ensure value and rewards for employees.
- Commitment to building an inclusive work environment.