HPC System Administrator

U.S., Canada, UK, and elsewhere

Position : HPC System Administrator

As an HPC System Administrator at Qvelo, you will be responsible for managing and maintaining high-performance computing (HPC) systems that power advanced simulations, data analytics, and AI-driven workloads. Your expertise in system administration, cluster management, and performance tuning will ensure the reliability, security, and efficiency of our clients’ HPC environments. You will play a key role in troubleshooting issues, optimizing system performance, and ensuring that HPC resources are effectively utilized to meet the demands of large-scale, data-intensive applications.

 

Key Responsibilities:

  • Install, configure, and maintain HPC clusters including compute nodes, storage systems, and high-speed networking components.
  • Monitor system performance and troubleshoot issues related to hardware, software, and network connectivity, ensuring maximum uptime and availability.
  • Manage job scheduling and workload management systems (e.g., SLURM, PBS, LSF) to optimize the use of HPC resources.
  • Perform software upgrades, patches, and security updates across HPC environments, ensuring system stability and compliance with security protocols.
  • Manage storage systems such as Lustre, GPFS, or NFS, ensuring efficient data access and management for large-scale workloads.
  • Implement backup and disaster recovery strategies for critical HPC resources and data.
  • Develop and maintain scripts and automation tools for routine system tasks, improving operational efficiency.
  • Collaborate with HPC users to identify performance bottlenecks and recommend optimizations to improve system performance.
  • Ensure network infrastructure is configured and optimized for low-latency, high-throughput communication between compute nodes.
  • Document system configurations, processes, and troubleshooting procedures to ensure consistency and knowledge sharing within the team.

Requirements:

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • Proven experience in managing and administering HPC systems including installation, configuration, and maintenance of clusters.
  • In-depth knowledge of Linux operating systems (e.g., CentOS, Ubuntu, RHEL) and shell scripting (Bash, Python).
  • Experience with job scheduling systems such as SLURM, PBS, or LSF.
  • Knowledge of parallel file systems (Lustre, GPFS) and storage management in HPC environments.
  • Experience with networking technologies such as InfiniBand, Ethernet, and TCP/IP protocols for high-performance environments.
  • Strong troubleshooting and performance tuning skills.
  • Excellent communication and collaboration abilities, with a focus on supporting end-users in optimizing their HPC workflows.
  • Familiarity with security best practices and implementation of system monitoring tools (e.g., Nagios, Ganglia).

Preferred Qualifications:

  • Experience with cloud-based HPC or hybrid HPC environments (AWS, Azure, or Google Cloud).
  • Familiarity with containerization technologies (e.g., Docker, Singularity) in an HPC context.
  • Certifications in HPC systems administration or related technologies are a plus.
  • Knowledge of GPU computing and how to manage GPU-accelerated HPC workloads

Department
CTO Office

Employment Type
Contract

Location
Remote or Hybrid (depending on your flexibility)

Workplace type
Hybrid/Remote

Compensation
Competitive, based on experience

Security Clearance
Canadian, U.S., or NATO clearance levels are desirable, but not mandatory. Some projects will require applicants to obtain a clearance at Secret-level clearance or higher.

Why Join Us?

As an HPC System Administrator at Qvelo, you’ll play a vital role in supporting cutting-edge research, simulations, and AI workloads across a variety of industries. You’ll work alongside a team of experienced professionals in a collaborative, innovative environment where your expertise will directly contribute to the success of critical projects. We offer opportunities for continuous learning, exposure to the latest HPC technologies, and a career path that evolves with the rapidly changing world of high-performance computing.