Site Reliability Engineer

about 2 months ago

Remote, WorldwideMid Level / Senior

Responsibilities

Design and implement platform on the cloud to support OXIO backend services.
Automate technical operations such as deployments, scaling, and recovery.
Monitor and maintain mission-critical production infrastructure to ensure maximum uptime.
Participate in an on-call rotation and promote a culture of continuous improvement through blameless postmortems.
Enable Engineering/Telecom/Data Engineering teams by providing operational tools.

Understanding of Linux/Unix systems, primarily Linux-based.
Familiarity with Linux/Unix system internals like process management and networking.
Proficiency in at least one programming language (Python, Go, or Ruby) and strong scripting skills (Bash, Perl).
Experience with infrastructure provisioning tools such as Terraform or Ansible.
Familiarity with containerization (Docker) and orchestration tools (Kubernetes).
Experience with monitoring tools like Prometheus or Grafana.
Knowledge of incident management practices and experience in on-call rotations.
Hands-on experience with cloud providers (AWS, Google Cloud, Azure).
Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls.

AnsibleApache CassandraApache Kafka AWS AzureBashCircleCIDatadogDockerElasticsearchGitLab CI/CDGoGoogle CloudGrafanaJenkinsKubernetes LinuxPerlPrometheusPythonRubySplunkSQL Terraform