Staff Site Reliability Engineer
PathAI
about 2 months ago
Boston, MA, USA or Remote, Worldwide
Staff+
H1B Sponsor
Base Salary
$166k - $224k/yr
Responsibilities
- Implement SRE best practices focusing on users, monitoring, and automation.
- Engineer infrastructure patterns for cloud environments in AWS.
- Design, build, and operate data centers to support the Machine Learning team.
- Integrate on-premises datacenter environments with cloud infrastructure.
- Improve infrastructure reliability through root-cause analysis.
- Participate in platform on-call rotations and assist with incident response.
Requirements
- 8+ years of relevant experience.
- Strong automation skills using scripting and configuration management tools.
- Experience building monitoring infrastructure with observability tools.
- Familiarity with infrastructure as code tools like Terraform or Cloudformation.
- Experience administering physical hardware stacks in production settings.
- Knowledge of storage solutions optimized for high performance workloads.
- Familiarity with modern network designs and operations across network layers.
- Experience with virtualization, containerization, or orchestration platforms.
- Operations experience managing critical production infrastructure.
- Bachelor's degree in Computer Science or equivalent experience.
- Intellectual curiosity and ability to learn quickly.
- Willingness to travel up to 25% of the time.
Tech Stack
AnsibleAWSDatadogGoGrafanaPrometheusPythonTerraform
Categories
AI & MLData EngineeringDevOps