GrepJob
xAI

Member of Technical Staff

xAI
Apply
13 days ago
Memphis, TN, USAMid Level / Senior
H1B Sponsor

Responsibilities

  • Design, develop, and deploy scalable code and services to automate reliability workflows.
  • Implement and maintain observability tools and practices for real-time system health insights.
  • Collaborate with cross-functional teams to identify and automate solutions for reliability bottlenecks.
  • Troubleshoot and resolve complex issues in data center environments.
  • Optimize Linux-based systems for performance, security, and reliability.
  • Understand network topologies in multi-data center environments for effective troubleshooting.
  • Participate in on-call rotations and post-incident reviews to enhance site reliability.
  • Mentor junior team members and document processes for knowledge sharing.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
  • 5+ years of hands-on experience in site reliability engineering, infrastructure engineering, or DevOps.
  • Strong programming skills in Python and familiarity with Rust or willingness to learn.
  • Solid experience with Linux systems administration and performance tuning.
  • Practical knowledge of containerization and orchestration technologies like Docker and Kubernetes.
  • Experience implementing observability solutions and troubleshooting complex issues.
  • Understanding of networking fundamentals in large-scale environments.
  • Experience with on-call rotations and incident response practices.
  • Ability to collaborate effectively with cross-functional teams.

Tech Stack

DockerGrafanaKubernetesLinuxPrometheusPythonRust

Categories

AI & MLData EngineeringDevOps