GrepJob
Coupang

Staff, Site Reliability Engineer (Tech Infra)

Coupang
Apply
about 2 hours ago
Seoul, Korea, SouthStaff+
H1B Sponsor

Responsibilities

  • Serve as the primary point responsible for the reliability, health, and performance of all customer-facing services.
  • Gain deep knowledge of Coupang application workflow and dependencies.
  • Define and track key performance indicators (KPIs) and service-level objectives (SLOs).
  • Build incident management processes and automation for fast incident remediation.
  • Develop best practices for monitoring, alerting, and telemetry systems.
  • Automate disaster recovery testing and load testing.
  • Collaborate with product development teams to ensure scalable and operable designs.
  • Establish guardrails and automation for deploying production changes.
  • Participate in a 24x7 rotation for production issue escalations.
  • Communicate effectively with various organizational levels.

Requirements

  • 5+ years of experience building and operating large scale distributed systems.
  • Deep knowledge of UNIX/Linux systems and administration.
  • Programming skills in Python, Java, Golang, or Ruby.
  • Strong problem-solving and analytical skills across systems, network, and code.
  • Experience with cloud infrastructure such as AWS, Azure, or Google Cloud Platform.
  • Understanding of DevOps and SRE practices, including CI/CD and IaC.
  • Experience with containerization and orchestration technologies like Docker and Kubernetes.
  • Excellent communication and collaboration skills.
  • Knowledge of observability tools like Prometheus, Grafana, or Datadog.