HPC System Administrator
U.S., Canada, UK, and elsewhere
Position : HPC System Administrator
As an HPC System Administrator at Qvelo, you will be responsible for managing and maintaining high-performance computing (HPC) systems that power advanced simulations, data analytics, and AI-driven workloads. Your expertise in system administration, cluster management, and performance tuning will ensure the reliability, security, and efficiency of our clients’ HPC environments. You will play a key role in troubleshooting issues, optimizing system performance, and ensuring that HPC resources are effectively utilized to meet the demands of large-scale, data-intensive applications.
Key Responsibilities:
- Install, configure, and maintain HPC clusters including compute nodes, storage systems, and high-speed networking components.
- Monitor system performance and troubleshoot issues related to hardware, software, and network connectivity, ensuring maximum uptime and availability.
- Manage job scheduling and workload management systems (e.g., SLURM, PBS, LSF) to optimize the use of HPC resources.
- Perform software upgrades, patches, and security updates across HPC environments, ensuring system stability and compliance with security protocols.
- Manage storage systems such as Lustre, GPFS, or NFS, ensuring efficient data access and management for large-scale workloads.
- Implement backup and disaster recovery strategies for critical HPC resources and data.
- Develop and maintain scripts and automation tools for routine system tasks, improving operational efficiency.
- Collaborate with HPC users to identify performance bottlenecks and recommend optimizations to improve system performance.
- Ensure network infrastructure is configured and optimized for low-latency, high-throughput communication between compute nodes.
- Document system configurations, processes, and troubleshooting procedures to ensure consistency and knowledge sharing within the team.
Requirements:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Proven experience in managing and administering HPC systems including installation, configuration, and maintenance of clusters.
- In-depth knowledge of Linux operating systems (e.g., CentOS, Ubuntu, RHEL) and shell scripting (Bash, Python).
- Experience with job scheduling systems such as SLURM, PBS, or LSF.
- Knowledge of parallel file systems (Lustre, GPFS) and storage management in HPC environments.
- Experience with networking technologies such as InfiniBand, Ethernet, and TCP/IP protocols for high-performance environments.
- Strong troubleshooting and performance tuning skills.
- Excellent communication and collaboration abilities, with a focus on supporting end-users in optimizing their HPC workflows.
- Familiarity with security best practices and implementation of system monitoring tools (e.g., Nagios, Ganglia).
Preferred Qualifications:
- Experience with cloud-based HPC or hybrid HPC environments (AWS, Azure, or Google Cloud).
- Familiarity with containerization technologies (e.g., Docker, Singularity) in an HPC context.
- Certifications in HPC systems administration or related technologies are a plus.
- Knowledge of GPU computing and how to manage GPU-accelerated HPC workloads
Department
CTO Office
Employment Type
Contract
Location
Remote or Hybrid (depending on your flexibility)
Workplace type
Hybrid/Remote
Compensation
Competitive, based on experience
Security Clearance
Canadian, U.S., or NATO clearance levels are desirable, but not mandatory. Some projects will require applicants to obtain a clearance at Secret-level clearance or higher.
Why Join Us?
As an HPC System Administrator at Qvelo, you’ll play a vital role in supporting cutting-edge research, simulations, and AI workloads across a variety of industries. You’ll work alongside a team of experienced professionals in a collaborative, innovative environment where your expertise will directly contribute to the success of critical projects. We offer opportunities for continuous learning, exposure to the latest HPC technologies, and a career path that evolves with the rapidly changing world of high-performance computing.