Service Reliability Engineering and automation - Diverse Lynx
O Fallon, MO
About the Job
Job Summary: Service Reliability Engineering and automation
Location: O Fallon, MO
Contract Role
We are looking for a Software Engineer for Service Reliability Engineering and automation with a strong focus on automation. The ideal candidate will have experience in automating complex infrastructures, optimizing CI/CD pipelines, and incorporating AI and machine learning models to enhance service reliability, incident response, and infrastructure management. You will work alongside development, operations, and AI teams to build resilient, scalable, and automated solutions.
Key Responsibilities:
Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.
Location: O Fallon, MO
Contract Role
We are looking for a Software Engineer for Service Reliability Engineering and automation with a strong focus on automation. The ideal candidate will have experience in automating complex infrastructures, optimizing CI/CD pipelines, and incorporating AI and machine learning models to enhance service reliability, incident response, and infrastructure management. You will work alongside development, operations, and AI teams to build resilient, scalable, and automated solutions.
Key Responsibilities:
- Automate Infrastructure & Operations :
- Develop and implement automation strategies to manage large-scale infrastructure (e.g., provisioning, configuration management, patch management).
- Build and maintain Infrastructure-as-Code (IaC) solutions.
- AI-Driven Monitoring & Incident Response :
- Integrate AI and machine learning models into monitoring systems to predict potential failures and optimize response times.
- Use AI tools and techniques to improve anomaly detection, system health predictions, and proactive incident resolution.
- CI/CD Pipeline Management :
- Automate the CI/CD processes using tools such as Jenkins, Bitbucket Pipelines, GitLab CI, or similar.
- Incorporate AI/Client into CI/CD workflows for optimizing build/test times and enhancing code quality predictions.
- Collaborate with the development team to enhance and optimize deployment pipelines.
- AI-Powered Optimization :
- Utilize AI to perform predictive scaling, system optimization, and capacity planning.
- Implement self-healing capabilities through AI-based predictive analysis and automation tools.
- Monitoring & Alerting Automation :
- Automate monitoring and alerting solutions to detect anomalies, failures, and capacity issues early.
- Implement observability tools like Prometheus, Grafana, and Dynatrace for efficient system monitoring.
- Reliability & Scalability :
- Design and build self-healing, scalable systems that reduce manual intervention.
- Perform capacity planning and optimize system performance through automation.
- Incident Management & Response :
- Build automated runbooks and workflows to address incidents quickly.
- Set up automated playbooks for incident detection, troubleshooting, and remediation.
- Security & Compliance Automation :
- Implement automated security checks and audits within the CI/CD pipeline.
- Automate compliance reports, vulnerability scans, and patches.
- Technical Expertise :
- Hands-on experience with on-premise machines and cloud platforms like PCF, AWS, Azure.
- Proficiency in programming languages such as Java, Python, Bash for scripting automation tasks.
- Strong knowledge of CI/CD tools (e.g., Jenkins, Bitbucket, GitLab, etc.) and version control systems.
- Ability to integrate machine learning models into infrastructure for automation and predictive monitoring.
- Infrastructure Automation :
- Expertise in containerization and orchestration tools (e.g., Docker, Kubernetes).
- Monitoring & Observability :
- Familiarity with monitoring tools like Prometheus, Grafana, Dynatrace, Splunk and alerting frameworks.
- Reliability Engineering :
- Experience with building and automating scalable, reliable, and self-healing systems.
- Strong troubleshooting skills.
- F5 Knowledge : ( Good to have and not a mandatory requirement)
- Understanding with F5 BIG-IP, including LTM (Local Traffic Manager), GTM (Global Traffic Manager), and iRules scripting.
- Understanding of load balancing strategies, SSL termination, and traffic management for high availability systems.
- Collaboration & Communication :
- Excellent communication and collaboration skills to work cross-functionally with development, operations, and QA teams.
- Familiarity with Agile and DevOps practices.
- Experience with automation in large-scale distributed systems.
- Experience working with both microservices and monolith architecture.
- Familiar with AI/Client-driven infrastructure optimization
- Problem-solving mindset and analytical thinking.
- Ability to thrive in a fast-paced and high-pressure environment.
- Team player with excellent collaboration skills.
Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.
Source : Diverse Lynx