Systems Engineer II HPC - Cold Spring Harbor Laboratory
Cold Spring Harbor, NY
About the Job
- Position Description
Join our cutting-edge team at Cold Spring Harbor Laboratory (CSHL) as a Systems Engineer II HPC, where you will play a pivotal role in advancing research through state-of-the-art high-performance computing (HPC) infrastructure. In this position, you'll leverage your expertise to support and manage CSHL's AI-driven compute cluster powered by NVIDIA H100 GPUs, empowering groundbreaking discoveries in science and innovation.
Collaborating closely with the Sr. Systems Engineer HPC, you'll translate complex user requirements into robust system designs while maintaining a proactive, responsive, and customer-focused approach. This role offers the opportunity to make a direct impact on CSHL's research community by delivering reliable systems engineering, administration, and user support, all while staying at the forefront of emerging technologies in HPC and AI.
- Position Responsibilities
Cluster Implementation and Management:
- Administration of the CSHL HPC cluster and storage system.
- Optimizes, installs, and maintains the HPC software (EasyBuild, Anaconda).
- Administration of HPC workload managers (Slurm, Grid Engine).
- Collaborates with cross-functional teams to ensure seamless integration of hardware, software, and networking components.
- Optimizes system performance, scalability, and reliability. Optimizes GPU performance and firmware to enhance the efficiency and scalability of decentralized AI inference tasks and general performance, processing and utilization.
- Monitors cluster performance, identifies bottlenecks, and implements performance enhancements.
- Adheres to best practice models to improve client services including ISO 20000 practices for service and application support, problem and incident management, server technology management, identity and access management, and management of continuous improvement.
- Provides support and/or services for provisioning, installation/configuration, and maintenance of IT server systems hardware, software, and related infrastructure in alignment with organizational goals and requirements. Supports the CSHL community to adhere to standards for configurations.
- Participates in new initiatives such as cluster expansion and storage usage efficiency.
- Manages the full lifecycle of hardware development, from conception through deployment and maintenance.
User Support and/or Services:
- Creates and updates end-user HPC documentation.
- Works closely with scientists to optimize computational workloads, data movement, and parallel processing. Trains scientists on using the cluster effectively for AI workloads.
- Optimizes, deploys, and maintains robust software to support high-performance AI/ML computations and parallel processing.
- Collaborates with scientists and AI/ML engineers to tailor solutions that meet the specific needs of their research
- Provides technical support, troubleshoots issues, and addresses user queries related to the cluster.
- Assists in developing best practices for AI model training and deployment
Service Management:
- As a key member of the IT Systems Engineering team, provides efficient, and effective resolution of incidents, and problems with a service-centric approach ensuring the stability and performance of CSHL services.
- Documents systems configurations, processes, and procedures to ensure reproducible, stable systems that can be efficiently supported by CHSL IT teams. Works with other CSHL IT teams to assure knowledge transfer resulting in effective resolution of problems.
- Contributes to the continual improvement of effective management of issues and incidents. Collaborates with other members of the Systems Engineering team to establish and monitor key performance indicators (KPIs) to measure systems and identify areas for improvement.
- Maintains current knowledge of key technology trends, proactively preparing to assist the community with recommendations.
- Communicates with and builds strong collaborative relationships with key stakeholders.
Vendor Management:
- Coordinates with vendors and/or other CSHL teams to aid the procurement of necessary hardware, software, and services, ensuring cost-effective solutions that align with business needs.
- Position Requirements
EDUCATION:
Bachelor's degree in information technology, computer science, or a related field (or equivalent combination of education and work experience).
EXPERIENCE:
- 2+ years of experience in GPU computing, with a focus on performance optimization and parallel programming.
- Proficiency in GPU programming languages such as CUDA.
- Strong understanding of computer architecture, memory systems and parallel algorithms.
- Experience with profiling and debugging tools for GPU applications desired, such as NVIDIA Nsignt.IT system administration and in IT server infrastructure operations.
- IT Systems Engineering experience, including incident, problem, and request management processes.
- Strong verbal and written communication skills, including ability to communicate, motivate, and collaborate effectively with diverse groups of people.
- Ability to troubleshoot and support/drive issues to resolution, including root cause analysis.
- Motivated, friendly, committed, and energetic self-starter, dedicated to providing high quality and responsive IT services.
- Excellent organization, documentation, time management and prioritization skills to manage multiple projects, locations, and technology needs.
- Ability to maintain problem oversight and manage multiple simultaneous project tasks, prioritizing demands across functional work areas.
- Ability to establish a practical working knowledge of CSHL business processes, interacting with key users to recommend solutions that best meet the strategic needs.
- Has a mindset to improve standards, simplify, enhance functionality and/or transition to solutions to improve supportability.
- Supplemental Information
How to Apply:
For immediate consideration, candidates should create an account and apply to the position found here:Systems Engineer II HPC | Job Details tab | Career PagesPosition ID: 01409 EnvironmentCold Spring Harbor Laboratory is a world-renowned biomedical research institution in New York. It has shaped contemporary biomedical research and is the home of eight Nobel Prize laureates. Cold Spring Harbor Laboratory provides a highly dynamic and interactive research environment and also a unique opportunity of timely exposure to advances in various biomedical research fields and of interaction with a broad range of researchers from all over the world through its renowned Meetings and Courses program. We believe that science is for everyone. We have had researchers with a variety of backgrounds and believe in the importance of diversity, equity, and inclusion.
Compensation and BenefitsOur employees are compensated in many ways for their contributions to our mission, including competitive pay, exceptional health benefits, retirement plans, time off, and a range of recognition and wellness programs. Visit our CSHL Benefits sites to learn more. The salary range for this role is $90,000- $100,000. The salary range and/or hourly rate listed is a good faith determination of potential base compensation that may be offered to a successful applicant for this position at the time of this job advertisement and may be modified in the future. When determining a base salary and/or rate, several factors may be considered as applicable (e.g., years of relevant experience, education, credentials, and internal equity).
CSHL is an EO/AA Employer. All qualified applicants will receive consideration for employment and will not be discriminated against on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability or protected veteran status. VEVRAA Federal Contractor
CSHL is an EO/AA Employer. All qualified applicants will receive consideration for employment and will not be discriminated against on the basis of race, color, religion, sex, sexual orientation, national origin, age, disability or protected veteran status.
Minimum Salary: 90000.00
Maximum Salary: 100000.00
Salary Unit: Yearly