Senior Software Platform Engineer
PsiQuantum6 days ago
Responsibilities
- Own AWS infrastructure end-to-end and actively shape its evolution.
- Reduce friction in the deployment pipeline for developers.
- Harden systems by securing IAM roles, container images, and authentication flows.
- Implement monitoring and alerting to catch production issues proactively.
- Make deployments faster, easier to roll back, and less prone to failure.
- Lead incident response and post-mortems as necessary.
- Make GPU clusters invisible to researchers and manage CUDA compatibility.
- Build standardized SLURM job submission workflows for researchers.
- Package and containerize Python simulation code for reproducibility.
- Monitor job health across utilization, cost, and runtime efficiency.
Requirements
- 5+ years of experience in Platform Engineering, DevOps, or SRE roles.
- Production AWS experience with ECS/EKS and multi-account networking.
- Proficient in Infrastructure as Code, particularly Terraform or Pulumi/CDK.
- Experience improving CI/CD pipelines in production environments.
- Supported GPU workloads in production, including code optimization and job scheduler setup.