Site Reliability Engineer (SRE) Lead - Diverse Linx
Remote, MI
About the Job
Site Reliability Engineer (SRE) Lead - Public Sector Core Framework team
Remote
About the Role
Client is seeking a Lead Software Engineer to join our Public Sector Core Framework platform team and play a critical role as a Site Reliability Engineer (SRE) within our Azure/Kubernetes ecosystem. In this role, you will be responsible for ensuring the stability, scalability, and performance of our platform, contributing significantly to the continued success of client.
Key Responsibilities
Requirements
Bonus Points (Will be a strong plus)
Benefits:
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Remote
About the Role
Client is seeking a Lead Software Engineer to join our Public Sector Core Framework platform team and play a critical role as a Site Reliability Engineer (SRE) within our Azure/Kubernetes ecosystem. In this role, you will be responsible for ensuring the stability, scalability, and performance of our platform, contributing significantly to the continued success of client.
Key Responsibilities
- Champion SRE Practices: Lead the team in strengthening SRE practices, including defining service level indicators (SLIs), objectives (SLOs), error budgets, thresholds, alerting, and error management systems.
- Site Planning and Optimization: Collaborate with development and testing teams to plan changes for production and other environments. Optimize planned outages, streamlining DevOps activities and minimizing downtime.
- Toil Reduction: Identify repetitive tasks (toil) and develop solutions to improve efficiency and reduce manual workload.
- Automation Advocacy: Leverage automation wherever possible to enhance stability, functionality, and overall platform management.
- Alert Management: Strengthen alerting systems by establishing goals, criteria, and processes for alert recalls, resets, enabling/disabling alerts, and revising error budgets based on team toil.
- Outage Prevention and Response: Proactively address non-critical alerts and collaborate with development and testing teams to prevent outages.
- Performance Verification: Work closely with Load and Performance teams to redefine parameters like load and concurrent user capacity.
- Incident Management: Lead and facilitate meetings with development and operations teams during incidents to ensure effective resolution.
- Post-Incident Reviews: Lead post-incident reviews with teams to identify root causes (RCAs), develop long-term solutions (code changes, configuration adjustments, architectural modifications, or capacity planning), and implement learnings to prevent future issues.
- Reliability Reporting: Generate reports using defined reliability metrics, including availability, Mean Time to Restore (MTTR), Mean Time Between Repairs (MTBR), and Probability of Failure.
- Continuous Improvement: Develop and maintain a backlog of opportunities for SRE improvements.
- Security Clearance: With company sponsorship, obtain and maintain a U.S. Federal Government "Public Trust" suitability clearance (required).
Requirements
- Proven experience and expertise within the Site Reliability Engineering (SRE) discipline.
- In-depth knowledge and experience administering Azure systems.
- Proficiency with Kubernetes systems and familiarity with Podman/Docker and Helm Charts.
- Strong programming skills in Python.
- Experience using GitHub for version control.
- Understanding of resiliency and reliability design patterns.
Bonus Points (Will be a strong plus)
- Experience with Prometheus, AKS Monitoring, Grafana, and automation tools.
Benefits:
- Opportunity to work with cutting-edge technologies
- Work in a collaborative and fast-paced environment
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Source : Diverse Linx