Site Reliability Engineer (SRE)

about 2 months ago

Atlanta, GA, USAMid Level / Senior

Responsibilities

Participate as an embedded member of a Scrum team in planning and reviews.
Use AI-powered tools to enhance system reliability and operational efficiency.
Design, build, and operate reliable cloud infrastructure.
Apply AI-assisted analysis to monitoring and observability data.
Define and maintain SLOs, SLIs, and error budgets.
Collaborate with software engineers to embed reliability into the development lifecycle.
Lead incident response and root cause analysis efforts.
Automate operational tasks through AI-enabled and traditional methods.
Contribute to disaster recovery planning and operational readiness.
Produce and maintain documentation such as runbooks and system diagrams.

Passionate about building reliable, scalable systems using AI-enabled approaches.
Strong understanding of cloud-native and distributed system architectures.
Experience applying SRE principles in a production environment.
Hands-on experience with cloud platforms, preferably AWS.
Experience using AI-assisted tools for coding and operational analysis.
Strong background in Linux, networking, and system operations.
Experience with infrastructure-as-code and automation tools like Terraform.
Familiarity with modern observability practices, including AI-enhanced analysis.
Comfortable working in an agile, cross-functional Scrum team.
Strong problem-solving, communication, and collaboration skills.
4+ years of experience in SRE, DevOps, or similar roles.
Experience supporting production systems at scale.