Site Reliability Engineer

Everscale Group

As a Site Reliability Engineer, your role covers the entire life-cycle of a product – from helping developers with architecture and delivery to on-call incident response and triage. You will be responsible for on-prem and cloud resources and should have a good understanding of cloud infrastructure fundamentals.

Responsibilities:
  • You will design and architect distributed systems in the cloud and understand how to move systems from on-prem data centers to the cloud
  • You will create monitoring, alerting, and dashboarding solutions that improve visibility into EA’s application performance and business metrics.
  • You will develop and troubleshoot distributed, large-scale production systems spanning on-prem. and cloud-based hosting
  • You will perform root cause analysis and post-mortems with an eye toward future prevention.
  • You will use automation technologies to ensure repeatability, eliminate toil, and reduce mean time to detection and resolution (MTTD & MTTR), and repair services.
  • You will design CI/CD pipelines.
  • You will produce documentation and support tooling for online support teams.
Qualifications:
  • Experience monitoring infrastructure and application availability to ensure SLI and SLO.
  • Experience with Virtualization, Containerization, Cloud Computing (AWS preferred), VMWare ecosystems, Kubernetes, or Docker.
  • Knowledge of ElasticSearch, Prometheus, Graphite, Kafka
  • Systems Administration experience, including an understanding of *nix.
  • Network experience, including an understanding of standard protocols/components.
  • Automation and orchestration experience including Chef, Puppet, Terraform, Packer, or Jenkins.
  • Experience writing code in Python, Golang, and/or Java.
  • Experience working with distributed systems.

To apply for this job email your details to jobs@everscalegroup.com