Sr. Site Reliability Engineer at Sygna LLC
Pittsburgh, PA
About the Job
Job Title: Sr. Site Reliability Engineer
Contract Type: Contract to hire
Location: Hybrid (Dallas Tx / Pittsburgh PA)
Must Have and Metrics Technical Skills:
Years of experience: 7+
- Ability to collaborate with cross-functional teams, troubleshoot effectively, and proactively identify areas for improvement in network reliability and performance
- Ansible Tower
- BigPanda
- Configuring, managing, and troubleshooting network performance and latency issues across complex, distributed systems
- Dynatrace
- Grafana
- Network performance tuning and monitoring, with a deep understanding of network protocols and network optimization techniques
ThousandEyes
- Extensive experience in network performance tuning and monitoring
- Deep understanding of network protocols (e.g., TCP/IP, DNS, HTTP/S) and network optimization techniques.
- Proficiency with Dynatrace and BigPanda for real-time monitoring, root cause analysis, and incident response; hands-on experience with these tools is required.
- Strong background in configuring, managing, and troubleshooting network performance and latency issues across complex, distributed systems.
- Experience with additional monitoring and observability tools like Thousand Eyes and Grafana.
- Skilled in Ansible Tower for automation of network and system configurations.
- Demonstrated ability to collaborate with cross-functional teams, troubleshoot effectively, and proactively identify areas for improvement in network reliability and performance.
Flex Skills/Nice to Have:
Proven experience in incident/problem management with a good understanding of any of the tools used for this purpose.
- Good understanding of both UNIX and Windows operating systems
- Good understanding of web hosting technologies like Apache / tomcat or other equivalent web/app servers.
- Good understanding of Big Data & cloud concepts.
- Good understanding of database technologies like ORACLE and SQL.
- Good understanding of monitoring tools is an added advantage.
- Solid understanding of the major functionality bundled into a release, both from a technology and business point of view.
- Strong knowledge of relevant applications and development life cycles.
- Experience working with geographically distributed and culturally diverse work-groups.
- Strong desire to learn new technology.
Roles and Responsibilities:
- Monitor infrastructure, servers, middleware, databases, and batch jobs.
- Aggressively respond to service requests from business partners facing support teams, Operations, Risk/control partners, etc.
- Troubleshoot environment, data control and operational issues.
- Create and Maintain documentation to ensure knowledge accessibility.
- Automate and streamline process using scripts and scheduling tools.
- Liaise with other application support teams and internal/external business and technical partners.
- Provide ad hoc and on-demand reports.
- Perform timely escalation of critical issues and proactively identify patterns of recurring issues to improve production.
- Lead problem resolution and conduct root cause analysis and establish processes that will help incident prevention.
- Participates in the Incident and Problem Management processes as a resolver accountable for root cause analysis, resolution and reporting.
- Ensures that all production changes are processed according to Change Management policies and procedures.
- Ensures that appropriate levels of Quality Assurance have been met for all new and existing products.
- Support Sustained Resiliency, Disaster Recovery, and High Availability events.
- Help Level 2 operation team with setting up monitoring and bridging the gaps in current monitoring setup.
- Play key part in setting up reporting and be a key component in Monitor -> Report -> Improve principle
- Coordinate incident management coverage, to ensure appropriate coverage.
- Call facilitation, coordination and communications during critical outage situations.
- Call documentation, queue management, ticket analysis and interface to impacting lines of business for incident impact analysis via the Production Assurance process.
- End to end view of issues for objectivity.
- Influence senior technology leads across organizations to ensure timely resolution of incidents
- Problem Management:
- Participate and ensure RCA (root cause analysis) activities on client impacting incidents are executed and action items are assigned / completed.
- Provide expertise and support during critical incidents, interfacing with all impacted groups to better manage the message.
- Chronic issue coordination and leadership.
- Guidance to all staff involved and vendors in driving a coordinated approach for results.
- Hygiene and Capacity Maintenance:
- Responsible for data quality of PLM.
- Work aggressively to make sure all servers are up to company standards as per uptimes, patch level etc.
- Work on Capacity planning for applications, estimating and analyzing growth rates of vital infrastructure components and adding capacity pro-actively as and when required.
- Understand application code, work flow and business usage of application.
- Understand DB component of application.
- Understand the impacts of application based on seasonality of critical applications.
- Document known errors and play important role in Knowledge transfer to Level 1 team.
- Reduce escalations to Level 3 based on incremental learning about applications.
Intended length of Assignment: 4/5/2025
Reason for open position: SRE/SRC Special Projects
Potential for Contract Extension: N/A
This position is contract with the right to hire if a need becomes available. Manager will only look at candidates that are open to converting to a full time PNC employee. PNC will not sponsor work visas if the decision is made to hire the contingent worker: YES
Initiatives/Projects: SRE / SRC Special Projects
Industry background: Technical
Soft Skills:
- Excellent communication skills, both verbal and written, with the ability to lead/manage large conference calls.
- Comfortable providing clear problem descriptions and guidance to business users in a time critical environment.
- Ability to be proactive with a strong bias for action, naturally inquisitive, and bias for continuous improvement of practices / processes.
- Excellent influence, negotiation and presentation skills.
- Experience in working with cross line of business teams, Outside Service Providers and Partner Organizations.
- Outstanding interpersonal skills and ability to establish strong relationships with all levels of management.
- Ability to work independently as a self-starter, and within a team environment.
Interview Process:
Logistics:
2 step interview
1st round with HM
2nd round panel ITV with engineering managers