As a Site Reliability Engineer, your role covers the entire life-cycle of a product – from helping developers with architecture and delivery to on-call incident response and triage. You will be responsible for on-prem and cloud resources and should have a good understanding of cloud infrastructure fundamentals.
- You will design and architect distributed systems in the cloud and understand how to move systems from on-prem data centers to the cloud
- You will create monitoring, alerting, and dashboarding solutions that improve visibility into EA’s application performance and business metrics.
- You will develop and troubleshoot distributed, large-scale production systems spanning on-prem. and cloud-based hosting
- You will perform root cause analysis and post-mortems with an eye toward future prevention.
- You will use automation technologies to ensure repeatability, eliminate toil, and reduce mean time to detection and resolution (MTTD & MTTR), and repair services.
- You will design CI/CD pipelines.
- You will produce documentation and support tooling for online support teams.
- Experience monitoring infrastructure and application availability to ensure SLI and SLO.
- Experience with Virtualization, Containerization, Cloud Computing (AWS preferred), VMWare ecosystems, Kubernetes, or Docker.
- Knowledge of ElasticSearch, Prometheus, Graphite, Kafka
- Systems Administration experience, including an understanding of *nix.
- Network experience, including an understanding of standard protocols/components.
- Automation and orchestration experience including Chef, Puppet, Terraform, Packer, or Jenkins.
- Experience writing code in Python, Golang, and/or Java.
- Experience working with distributed systems.
To apply for this job email your details to email@example.com