HPC System Administrator - JR25462-3800 - The University of Chicago
Chicago, IL
About the Job
This job was posted by https://illinoisjoblink.illinois.gov : For more
information, please see:
https://illinoisjoblink.illinois.gov/jobs/12398219 Department
Provost Research Computing Center
About the Department
The University of Chicago Research Computing Center (RCC), a unit in the
Office of Research, provides high-end research computing resources to
researchers at the University of Chicago. It is dedicated to enabling
research by providing access to centrally managed High-Performance
Computing (HPC), storage, and visualization resources. These resources
include hardware, software, high-level scientific and technical user
support, and the education and training required to help researchers
make full use of modern HPC technology and local and national
supercomputing resources. The Office of Research oversees the conduct of
sponsored research, research program development, and contract
management functions.
Job Summary
The job designs automated, scalable, and rapidly deployable solutions to
infrastructure development and server configuration. Works independently
to install, configure, and maintain operating systems. Uses best
practices and systems knowledge to monitor and alert systems, utility
software, and firewalls. Guides maintenance for production servers as
well as Windows and Linux servers.
The University of Chicago Research Computing Center (RCC) is seeking a
highly qualified HPC system engineer to join its system and operation
team that builds and manages RCC HPC systems and facility operations.
The individual in this position will be involved in the management and
administration of RCC hardware and software.
Responsibilities
- Installing, configuring, and maintaining large computer
clusters/servers and software.
- Day-to-day operations of the systems including systems
administration, monitoring and storage performance up to and
including network components.
- Management of the system's network switch, parallel file system and
HPC software stack and tools. 10%
- Configuration of the scheduling and queuing system.
- Diagnosing and resolving system operational problems quickly and
effectively. Coordinating with vendors to resolve hardware and
software problems. Assist users with access and other help desk
ticket requests or issues.
- Building and deploying open source software and software from
vendors/partners.
- Providing reliable and efficient backups/restores for all managed
systems. Maintaining and monitoring the security of the HPC systems
and servers.
- Documenting system administration procedures for routine and complex
tasks.
- Plans and installs necessary patches and upgrades for servers and
their associated storage, network, communications, and peripheral
sub-systems. Installs and maintains an appropriate level of
intrusion detection, monitoring, and auditing software as required.
- Tracks compliance and maintains documentation for hardware,
software, and service inventories for management reports.
- Performs other related work as needed.
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related
field.
---
Work Experience:
Minimum requirements include knowledge and skills developed through 5-7
years of work experience in a related job discipline.
---
Certifications:
---
Preferred Qualifications
Education:
- Bachelor's degree in Computer Science or closely related field.
Experience:
- A minimum of three years of Linux system administration experience
in a large distributed computing environment.
- At least two years experience in HPC system administration or
managing large HPC clusters.
Technical Skills or Knowledge:
Knowledge of Linux.
Experience scripting with one or more language such as Python, Shell,
Perl.
Experience with Linux build automation tools such as puppet, Ansible,
GIT, Docker, highly preferred.
Experience implementing automation and monitoring using shell scripting
and other related tools strongly preferred.
Experience with installing, configuring, and maintaining job management
tools (such as SLURM, Moab, TORQUE, PBS, etc.) strongly preferred.
Experience with operating system deployment tools (e.g. XCAT, ROCKS)
strongly preferred.
Experience configuring, administering, and supporting network storage
subsystems (e.g. IBM, NetAppl DataDirect Network, LSI, etc.) strongly
preferred.
Experience with one or more distributed file systems (GPFS, Lustre,
Gluster, etc.) strongly preferred.
Experience configuring, installing, tuning and maintaining scientific
application software strongly preferred.
Experience configuring, installing, maintaining and/or using performance
monitoring and optimization tools strongly preferred.
Experienc
information, please see:
https://illinoisjoblink.illinois.gov/jobs/12398219 Department
Provost Research Computing Center
About the Department
The University of Chicago Research Computing Center (RCC), a unit in the
Office of Research, provides high-end research computing resources to
researchers at the University of Chicago. It is dedicated to enabling
research by providing access to centrally managed High-Performance
Computing (HPC), storage, and visualization resources. These resources
include hardware, software, high-level scientific and technical user
support, and the education and training required to help researchers
make full use of modern HPC technology and local and national
supercomputing resources. The Office of Research oversees the conduct of
sponsored research, research program development, and contract
management functions.
Job Summary
The job designs automated, scalable, and rapidly deployable solutions to
infrastructure development and server configuration. Works independently
to install, configure, and maintain operating systems. Uses best
practices and systems knowledge to monitor and alert systems, utility
software, and firewalls. Guides maintenance for production servers as
well as Windows and Linux servers.
The University of Chicago Research Computing Center (RCC) is seeking a
highly qualified HPC system engineer to join its system and operation
team that builds and manages RCC HPC systems and facility operations.
The individual in this position will be involved in the management and
administration of RCC hardware and software.
Responsibilities
- Installing, configuring, and maintaining large computer
clusters/servers and software.
- Day-to-day operations of the systems including systems
administration, monitoring and storage performance up to and
including network components.
- Management of the system's network switch, parallel file system and
HPC software stack and tools. 10%
- Configuration of the scheduling and queuing system.
- Diagnosing and resolving system operational problems quickly and
effectively. Coordinating with vendors to resolve hardware and
software problems. Assist users with access and other help desk
ticket requests or issues.
- Building and deploying open source software and software from
vendors/partners.
- Providing reliable and efficient backups/restores for all managed
systems. Maintaining and monitoring the security of the HPC systems
and servers.
- Documenting system administration procedures for routine and complex
tasks.
- Plans and installs necessary patches and upgrades for servers and
their associated storage, network, communications, and peripheral
sub-systems. Installs and maintains an appropriate level of
intrusion detection, monitoring, and auditing software as required.
- Tracks compliance and maintains documentation for hardware,
software, and service inventories for management reports.
- Performs other related work as needed.
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related
field.
---
Work Experience:
Minimum requirements include knowledge and skills developed through 5-7
years of work experience in a related job discipline.
---
Certifications:
---
Preferred Qualifications
Education:
- Bachelor's degree in Computer Science or closely related field.
Experience:
- A minimum of three years of Linux system administration experience
in a large distributed computing environment.
- At least two years experience in HPC system administration or
managing large HPC clusters.
Technical Skills or Knowledge:
Knowledge of Linux.
Experience scripting with one or more language such as Python, Shell,
Perl.
Experience with Linux build automation tools such as puppet, Ansible,
GIT, Docker, highly preferred.
Experience implementing automation and monitoring using shell scripting
and other related tools strongly preferred.
Experience with installing, configuring, and maintaining job management
tools (such as SLURM, Moab, TORQUE, PBS, etc.) strongly preferred.
Experience with operating system deployment tools (e.g. XCAT, ROCKS)
strongly preferred.
Experience configuring, administering, and supporting network storage
subsystems (e.g. IBM, NetAppl DataDirect Network, LSI, etc.) strongly
preferred.
Experience with one or more distributed file systems (GPFS, Lustre,
Gluster, etc.) strongly preferred.
Experience configuring, installing, tuning and maintaining scientific
application software strongly preferred.
Experience configuring, installing, maintaining and/or using performance
monitoring and optimization tools strongly preferred.
Experienc
Source : The University of Chicago