Staff/Sr Software Engineer, Compute Capacity
Anthropic
6 days ago
New York, NY, USA or San Francisco, CA, USA
Senior / Staff+
H1B Sponsor
Base Salary
$405k - $485k/yr
Responsibilities
- Build and operate data pipelines that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery.
- Develop and maintain observability infrastructure, including Prometheus recording rules and Grafana dashboards.
- Instrument and analyze compute efficiency metrics across training, inference, and eval workloads.
- Build internal tooling and platforms for capacity planning and workload attribution.
- Operate Kubernetes-native systems at scale, managing workload labeling infrastructure.
- Normalize and reconcile data across heterogeneous sources, including AWS, GCP, and Azure.
- Collaborate across organizational boundaries with various teams to gather requirements and communicate trade-offs.
Requirements
- 5+ years of software engineering experience with a strong track record in production systems.
- Kubernetes fluency at operational depth, with experience in scheduling and debugging cluster-level issues.
- Experience in designing and building production data pipelines, preferably with BigQuery.
- Familiarity with observability tooling such as Prometheus and Grafana.
- Proficiency in Python and SQL at production quality.
- Familiarity with at least one major cloud provider (AWS, GCP, or Azure) at the infrastructure level.
- Strong cross-team communication skills and ability to navigate ambiguity.
Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- A collaborative office space.
Tech Stack
AWSAzureClickHouseGoogle BigQueryGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRustSQLTerraform
Categories
AI & MLData EngineeringDevOps