Staff, Site Reliability Engineer (Tech Infra)

about 2 hours ago

Seoul, Korea, SouthStaff+

H1B Sponsor

Responsibilities

Serve as the primary point responsible for the reliability, health, and performance of all customer-facing services.
Gain deep knowledge of Coupang application workflow and dependencies.
Define and track key performance indicators (KPIs) and service-level objectives (SLOs).
Build incident management processes and automation for fast incident remediation.
Develop best practices for monitoring, alerting, and telemetry systems.
Automate disaster recovery testing and load testing.
Collaborate with product development teams to ensure scalable and operable designs.
Establish guardrails and automation for deploying production changes.
Participate in a 24x7 rotation for production issue escalations.
Communicate effectively with various organizational levels.

5+ years of experience building and operating large scale distributed systems.
Deep knowledge of UNIX/Linux systems and administration.
Programming skills in Python, Java, Golang, or Ruby.
Strong problem-solving and analytical skills across systems, network, and code.
Experience with cloud infrastructure such as AWS, Azure, or Google Cloud Platform.
Understanding of DevOps and SRE practices, including CI/CD and IaC.
Experience with containerization and orchestration technologies like Docker and Kubernetes.
Excellent communication and collaboration skills.
Knowledge of observability tools like Prometheus, Grafana, or Datadog.