PathAI

Staff Site Reliability Engineer

PathAI

Apply
about 2 months ago
Boston, MA, USA or Remote, Worldwide
Staff+
H1B Sponsor

Base Salary

$166k - $224k/yr

Responsibilities

  • Implement SRE best practices focusing on users, monitoring, and automation.
  • Engineer infrastructure patterns for cloud environments in AWS.
  • Design, build, and operate data centers to support the Machine Learning team.
  • Integrate on-premises datacenter environments with cloud infrastructure.
  • Improve infrastructure reliability through root-cause analysis.
  • Participate in platform on-call rotations and assist with incident response.

Requirements

  • 8+ years of relevant experience.
  • Strong automation skills using scripting and configuration management tools.
  • Experience building monitoring infrastructure with observability tools.
  • Familiarity with infrastructure as code tools like Terraform or Cloudformation.
  • Experience administering physical hardware stacks in production settings.
  • Knowledge of storage solutions optimized for high performance workloads.
  • Familiarity with modern network designs and operations across network layers.
  • Experience with virtualization, containerization, or orchestration platforms.
  • Operations experience managing critical production infrastructure.
  • Bachelor's degree in Computer Science or equivalent experience.
  • Intellectual curiosity and ability to learn quickly.
  • Willingness to travel up to 25% of the time.

Tech Stack

AnsibleAWSDatadogGoGrafanaPrometheusPythonTerraform

Categories

AI & MLData EngineeringDevOps