Senior Systems Administrator

Job Description

HPC4Health is a Compute Canada and Compute Ontario project dedicated to provide High Performance Computing (HPC), bioinformatics, and software development support to Health institutions. The project is lead by the SickKids Hospital and the Princess Margaret Cancer Centre, part of UHN. The HPC4Health infrastructure is located in the SickKids’ PGCRL data centre (686 Bay Street) and it is maintained by HPC4Health’s centralized support team (SickKids Staff). The HPC4Health uses cloud technologies to provide HPC services to participating partners maintaining the highly demanding security standards required in biomedical research. This position would be working on the operations and development of the HPC4Health’s HPC cloud environment.

Employment Type:

Temporary, Full-Time (one or two year contract with possibility of extension)

Apply at Sickkids's Careers.

Responsibilities

  • Support the SickKids’ HPC system formed by more than 11,000 compute threads, high speed networks (Infiniband, 10 GigE), and petabytes of high performance and archive storage
  • Provide guidance and support in all aspects of high-end computing research to a large community composed of researchers and clinicians from SickKids, other Toronto hospitals, and University of Toronto research groups.
  • Maintain and update technical documentation.
  • Manage hardware and interact with vendors support teams.
  • Manage petabyte data stores and archives with leading edge data management tools, such as IRODS, Ceph and Isilon storage.

Desired Skills and Experience

Required Skills

    The successful candidate is required to have:

  • Minimum of 5-7 years experience supporting HPC systems in a multi-user environment.
  • Experience configuring and managing HPC workload management and scheduling software suites required (SGE, Moab/Torque).
  • Proficiency in UNIX operating systems: Linux (Ubuntu, RedHat, CentOS, SusE Linux enterprise) and Oracle Solaris.

Additional Assets

    Qualifications and experiences below are considered an asset:

  • Experience supporting large storage devices (SAN/NAS) and good understanding of file systems like such as OneFS, ZFS and XFS.
  • Good understanding of high speed Ethernet and Infiniband networks.
  • Good understanding of common protocols such as NFS, CIFS, LDAP, DHCP, TFTP, and NTP.
  • Good understanding of using and maintaining monitoring/alerting systems.
  • Good understanding of and experience with data management at scale, including performance, backup, archive and monitoring
  • Working knowledge of scripting languages such as Bash, Perl and Python.
  • Experience installing, configuring and maintaining application tools and databases: Bright Computing, MySQL, PostgresSQL, Apache/http, Drupal
  • Experience in managing an Openstack cloud.
  • Must possess excellent verbal communication skills and the ability to interact with scientific and technical audiences.
  • Must have the initiative and ability to take ownership of assigned tasks and complete them to required standards and deadlines.