Description: Our client is currently seeking a Site Reliability Engineer (SRE) - Remote
- Lead the newly established SRE team supporting Discover.com and Discover’s mobile application. Discover.com gets 3.5million and Discover Mobile gets 2.5 million logins daily!
- Champion a culture of learning, continuous improvement, and blameless retrospection within your team.
- Mentor and grow your junior engineers, and empower and unblock your senior ones.
- Partner with our Talent Acquisition team as we recruit, interview and hire the best engineering talent to join Discover’s growing SRE practice.
- Partner with Product teams and Solution Architects to help design solutions that achieve the required reliability outcomes for their services.
- Be a leader in the SRE community of practice and evolve the SRE practice or the entire organization.
- Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run platforms
- Partner with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
• Well versed with the entire software development lifecycle, DevOps, and SRE practices
• Expertise and operational experience at scale - designing and operating highly available, scalable and fault-tolerant systems using container platforms
• Experience with operational monitoring tools (AppDynamics, NewRelic, Instana, CatchPoint) with a mindset towards predictive analysis
• Experience with Splunk or ELK Stack, Grafana, DataDog, or Sysdig
• Working knowledge of the automation tools such as Ansible, Terraform, or Chef
• Experience with Pivotal Cloud Foundry (PCF), OpenShift (OCP), Amazon Web Service (AWS), and Google Cloud Platform (GCP)
- Good understanding of networking including L2 and L3 concepts, including Firewall, Load Balancing, Routing and Switching.
- A working knowledge of Linux based systems and Virtual Machines (VM) technology
- Strong scripting skills including ability to write scripts from scratch using Python and/or Bash
- Basic knowledge and understanding of Security (CIA Model and PCI compliance) is a plus
- Experience with Continuous Integration and Continuous Delivery models including Blue/Green and Canary release models is a plus
• You have 5+ years of SRE experience in a highly customer-focused environment.
• You have 3+ years experience successfully managing a team of engineers on large-scale projects that included technical deep-dives and production troubleshooting in the areas of: distributed systems, programming, configuration management, networking, storage, and operating systems
• You possess strong leadership skills and the ability to motivate teams.
• You bring a strong perspective and collaborative partnership that drives change, and motivates engineers to develop simple solutions to complex operational or reliability challenges.
• You have experience formulating a team's technical strategy and roadmap, and you've collaborated and partnered effectively with several other teams.
• You are capable of leading a discussion with upper management, and are able to tailor the level of technical detail to suit your audience.
B.S. in Computer Science or equivalent experience
Contact: [Click Here to Email Your Resumé]