about 3 hours ago
Remote, Worldwide or Pittsburgh, PA, USAMid Level / Senior
H1B Sponsor
Responsibilities
- Instrument systems scheduling and executing large-scale batch workloads across Kubernetes clusters.
- Diagnose and triage job failures for customers.
- Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
- Scale the reliability and velocity of our systems and processes through increased automation.
- Document actions to build a comprehensive library of runbooks.
- Participate in an on-call rotation to uphold the SLOs and SLAs of production services.
- Contribute to platform tooling, automation, and CI/CD workflows.
Requirements
- Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
- Strong experience with Kubernetes and container orchestration in production grade environments.
- Understanding of engineering design limitations and ability to provide guidance to teams.
- Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry.
- Strong communication skills and the ability to work effectively in a diverse and distributed team.