Senior Site Reliability Engineer - EPAM Systems
Newtown, PA 18940
About the Job
EPAM is hiring a Remote Lead Site Reliability Engineer. If you are looking for a high-impact, exciting role with a company that leads the globe in the digital transformation space, EPAM is the perfect next step in your career! As an EPAMer, you’ll have the opportunity to work with a supportive team on a variety of interesting and impactful projects for some of the largest and most recognizable brands in the world. Are you ready to advance in your career journey? Apply now!
Responsibilities
• Lead weekly operational state reviews covering performance trends, anomalies, errors and other availability events with SREs, product owners, and development teams
• Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc
• Socialize SRE culture across teams within the organization to publicize the value of SRE, mentor and train other engineers around proactive reliability decision making and planning
• Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs
• Establish, test, and tune alerting for varying tiers of applications
• Document and maintain runbooks and procedures, automate as much as possible
• Plan and execute periodic disaster recovery exercises, load and scalability testing, and peak readiness reviews
• Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs
Requirements
• 5+ years of SRE or Systems Engineering experience
• 2+ years as team lead or SRE champion
• Bachelors degree in Computer Science, similar technical field of study, or equivalent practical experience
• Proven experience troubleshooting, mitigating, and resolving issues in a distributed system
• Strong communication and collaboration skills for varying groups of stakeholders
• Be self-motivated and can prioritize effectively between competing priorities
• Experience with implementing SRE practices for services and applications deployed in production in the cloud
• Must understand most SRE concepts, including SLI/SLO/SLA, Error Budget, MTTD/MTTR/MTBF, Toil, Capacity Planning, Observability, Monitoring/Alerting, Release Engineering, and Incident Management/Blameless Post-Mortems
Benefits
• Medical, Dental and Vision Insurance (Subsidized)
• Health Savings Account
• Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
• Short-Term and Long-Term Disability (Company Provided)
• Life and AD&D Insurance (Company Provided)
• Employee Assistance Program
• Unlimited access to LinkedIn learning solutions
• Matched 401(k) Retirement Savings Plan
• Paid Time Off – the employee will be eligible to accrue 15-25 paid days, depending on specific level and tenure with EPAM (accrual eligibility may change over time)
• Paid Holidays - nine (9) total per year
• Legal Plan and Identity Theft Protection
• Accident Insurance
• Employee Discounts
• Pet Insurance
• Employee Stock Purchase Program
• If otherwise eligible, participation in the discretionary annual bonus program
• If otherwise eligible and hired into a qualifying level, participation in the discretionary Long-Term Incentive (LTI) Program
About EPAM
• EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potentia