Data Center Systems Engineer - EDI Staffing
New York City, NY 10010
About the Job
The Data Center Systems Engineer deploys and maintains computational infrastructure from supercomputers to desktop workstations and provides resources, services, and expertise to the research community. Responsible for the deployment, operation, and maintenance of the computing infrastructure. The responsibilities include hardware monitoring and replacement for large scale HPC clusters with high-performance interconnects such as InfiniBand and Ethernet. Will work with complex systems and networks in production and research environments, actively monitor clusters to ensure high system availability, and apply standard methods and procedures to resolve systems problems of moderate scope where analysis of situations or data requires a review of a variety of factors.
Responsibilities:
Responsibilities:
- Support and troubleshoot cluster compute and network hardware for large Linux-based HPC systems connected via Ethernet, InfiniBand, or other networks.
- Perform basic networking tasks to interconnect servers or components of clusters for communication.
- Monitor state of HPC systems and initiate appropriate actions to maintain a high availability level of resources.
- Fulfill technology support requests from staff.
- Manage issue resolution cycle including creation and tracking of tickets, communication with vendors, and generation of summary reports.
- Work with hardware vendors to find resolution of hardware issues.
- Provide guidance and assistance to the SCC team on escalated requests and contact vendors for support as necessary.
- Assist with department reporting and operations
- Keep up to date with the Data Center technologies.
- Utilize the Bright Cluster Manager and other configuration management systems to maintain configuration state of clusters.
- Track all changes to configuration with Git or similar version control system.
- Write and execute shell scripts in support of systems management, log analysis and other system administration duties for multiple systems.
- Use DevOps tools to properly document said changes and commit them to revision control systems.
- Travel required to our data center in Secaucus, NJ 1-2x/week.
- Experience troubleshooting and repairing HPC/Data Center hardware including computer and storage servers, and network switches.
- Knowledge of structured cabling components and concepts (patch panels, fiber, copper)
- Experience with Data Center facilities components (Rack Power Distribution Units, UPS, single and 3-phase power, CDUs )
- Basic understanding of Unix (Linux), Ethernet, and IP networking.
- Ability to write technical documentation in a clear and concise manner.
- Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
- Knowledge of the design, development and application of technology and systems to meet business needs.
- General knowledge of other areas of IT.
- Understanding of how system management actions affect users and dependent / related functions and how these actions affect multi-tenant and multi-node environments that are characteristic of HPC systems.
- Experience using issue-tracking systems to manage and document problem resolution.
Source : EDI Staffing