Site Reliability Engineer at MTW Recruit
Bloomington, MN 55437
About the Job
This is 3 days on site in Bloomington and 2 days remote.
The Senior Site Reliability Engineer role involves leading best practices in system reliability, performance optimization, and observability, particularly for data-intensive and machine learning (ML) workloads. The position emphasizes building frameworks for scalability, defining and implementing Service Level Objectives (SLOs) and Indicators (SLIs), and collaborating across teams to enhance system reliability and performance.
Key Points
Responsibilities
- Build frameworks for flagship product reliability and scalability.
- Design and implement observability frameworks focused on data and ML pipelines.
- Define and establish SLOs and SLIs aligned with business goals.
- Translate performance requirements into actionable strategies.
- Identify and resolve system performance bottlenecks.
- Design and execute performance regression test suites for data and ML workloads.
- Own and monitor reliability and performance metrics to ensure system excellence.
- Use tools like Datadog, Jira, and GitHub for performance monitoring and project management.
- Collaborate with subject matter experts on performance challenges in data and ML domains.
- Establish and track success metrics for reliability and performance targets.
- Promote continuous improvement of performance engineering practices.
Experience
- Minimum 5 years in site reliability engineering, with cloud-native environments focus.
- Expertise in creating SLOs/SLIs and observability frameworks for complex systems.
- Proficiency with AWS and scalable, reliable architecture design.
- Hands-on experience with tools like Datadog for performance monitoring.
- Experience with Git/GitHub and infrastructure-as-code tools (e.g., Terraform).
Preferred Skills
- Proficiency in Java programming, including REST, Spring, and microservices.
- Strong understanding of RDBMS schema design and index optimization.
Salary
160,000 - 170,000 /year