Integrating external PBS Master to CycleCloud (Cloud Bursting scenario)

Posted by

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.

In this blog post, we are discussing about how to integrate an external PBS master node to send job to CycleCloud for cloud bursting (Enabling on-premises workloads to be sent to the cloud for processing, known as “cloud bursting”) or hybrid HPC scenarios. For demonstration purpose, I am creating a PBS master node in Azure as an external PBS in a different VNET and the execute nodes are in CycleCloud in a separate VNET. we are not discussing the complexities of networking involved in Hybrid scenarios.

Architecture:

vinilv_1-1666874009491.png

Environment:

  1. External Master node (Standard D8s v4)
  2. Compute nodes on CycleCloud 8.2
  3. PBS Pro Scheduler
  4. CentOS 7 Operating system (Openlogic CentOS HPC 7.7).
  5. cyclecloud-pbspro project version 2.0.9 (Latest version can be used)
  6. 2 subnets created for deployment ( hpc + default). I select default subnet for cyclecloud and hpc for master node.

Preparing master node:

In this example, I am using OpenLogic.CentOS-HPC-7.7 image on Standard D8s v4 as master (head) node for PBS scheduler.

vinilv_0-1666872743293.png

 Install the pre-requisites for configuring NFS server and installing PBSPro scheduler. 

 

 

 

yum install python3 nfs-utils -y

 

 

Create shared directory for centralized home directory.

 

 

mkdir /shared
echo "/shared *(rw,sync,no_root_squash)" >> /etc/exports
systemctl start nfs-server
systemctl enable nfs-server

 

 

Checking the NFS server status:

 

 

[root@hnpbs ~]# showmount -e
Export list for hnpbs:
/shared *

 

 

 Download the following packages from GitHub (I am using 2.0.9 version here. if you have a different project version, use the required version). 

 

 

wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/hwloc-libs-1.11.9-3.el8.x86_64.rpm
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/pbspro-debuginfo-18.1.4-0.x86_64.rpm
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/pbspro-server-18.1.4-0.x86_64.rpm

 

 

Install the PBSPro package on the master node.

 

 

yum localinstall *.rpm

 

 

Check the pbs configuration file.

 

 

[root@hnpbs ~]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=hnpbs
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

 

 

 Start the PBS scheduler service

 

 

systemctl start pbs
systemctl enable pbs

 

 

Preparing CycleCloud environment:

Create the headless template using the stock template.

 

 

git clone https://github.com/Azure/cyclecloud-pbspro.git
cd /home/vinil/cyclecloud-pbspro/templates

 

 

 Removed the following sections to make a headless template.

[[node server]]

[[nodearray login]]

[[[parameter serverMachineType]]]

[[[parameter SchedulerImageName]]]

[[[parameter NumberLoginNodes]]]

 Update the following variables for this requirement. As I am using CentOS 7 and PBSPro 18 version.

 

 


[[[configuration cyclecloud.mounts.nfs_sched]]]
        type = nfs
        mountpoint = /sched
        disabled = true
#IMPORTANT: update the master node hostname
<--............-->
  [[nodearray execute]]
    MachineType = $ExecuteMachineType
    MaxCoreCount = $MaxExecuteCoreCount
    Interruptible = $UseLowPrio
    AdditionalClusterInitSpecs = $ExecuteClusterInitSpecs

        [[[configuration]]]
        autoscale.enabled = $Autoscale
        pbspro.scheduler = hnpbs

 

 

These are optional – You can select the PBS version and OS version in the CycleCloud portal

 

 

<--............-->

        [[[parameter ImageName]]]
        Label = Compute OS
        ParameterType = Cloud.Image
        Config.OS = linux
        DefaultValue = cycle.image.centos7
        Config.Filter := Package in {"cycle.image.centos7", "cycle.image.centos8"}

        [[[parameter PBSVersion]]]
        Label = PBS Version
        Config.Plugin = pico.form.Dropdown
        Config.Entries := {[Label="OpenPBS v20, el8-only"; Value="20.0.1-0"], [Label="PBSPro v18, el7-only"; Value="18.1.4-0"]}
        DefaultValue = 18.1.4-0

 

 

Reference template: https://github.com/vinilvadakkepurakkal/cyclecloud-pbsproheadless/blob/main/openpbs.txt  

Import the custom template into the CycleCloud.

 

 

cyclecloud import_template -f openpbs.txt

 

 

You can now see a new template named “OpenPBS-headless” in the CycleCloud portal.

vinilv_0-1666873532554.png

Create a cluster using the following parameters.

a. Select the N/W – different subnet than the master node

vinilv_2-1666874112672.png

b. Add the NFS server IP address (Master node IP)vinilv_3-1666874157538.png

c. Disable the return proxy and CentOS 7 and PBSPro v18 as software selectionvinilv_0-1666874303676.png

d.  Add the following cloud-init script for master node name resolution (change the IP and hostname based on your setup). Save and start the cluster.vinilv_1-1666874348362.png

e. Once the cluster is started, add a node to cluster. This is required to create a hostname resolution from master node to CycleCloud compute node.vinilv_2-1666874399979.png

f. You will see the node in error state, that’s normal. connect to the node and copy the /etc/hosts file.

vinilv_3-1666874447944.png

 

 

ssh cyclecloud@10.0.0.7
$ sudo -i
# cp /etc/hosts /shared/

 

 

g. terminates the newly added node

Integrating External Master node to CycleCloud

Update the cyclecloud execution nodes hostnames in master nodes’s /etc/hosts (/shared/hosts is the file we created while adding a node).

 

 

grep ip- /shared/hosts  >> /etc/hosts

 

 

Download the following package and extract the file into external master node.

 

 

wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/cyclecloud-pbspro-pkg-2.0.9.tar.gz
tar -zxf cyclecloud-pbspro-pkg-2.0.9.tar.gz

 

 

Run the installer to setup the CycleCloud environment.

 

 

cd cyclecloud-pbspro/
./initialize_pbs.sh
./initialize_default_queues.sh
./install.sh  --venv /opt/cycle/pbspro/venv

 

 

Create autoscale.json file for autoscaler requirement. Here is the command to prepare the autoscale.json.

 

 

./generate_autoscale_json.sh --username username --password password --url https://fqdn:port --cluster-name cluster_name

 

 

Here is the output:

 

 

./generate_autoscale_json.sh --username vinil --password <password> --url https://<ipaddress_of_cc_server> --cluster-name hbpbs
testing that we can connect to CycleCloud...
success!

 

 

Run azpbs validate command to validate the configuration. It will give you suggestions to correct the configuration in the external master.

 

 

[root@hnpbs cyclecloud-pbspro]# azpbs validate
ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_priv. Please add this and restart PBS
group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_priv. Please add this and restart PBS

 

 

Edit /var/spool/pbs/sched_priv/sched_config and add ungrouped, group_id to resources.

 

 

#grep ^resources /var/spool/pbs/sched_priv/sched_config
resources: "ncpus, mem, arch, host, vnode, aoe, eoe, ungrouped, group_id"

 

 

 Run azpbs validate to verify the changes.

 

 

[root@hnpbs cyclecloud-pbspro]# azpbs validate
[root@hnpbs cyclecloud-pbspro]#

 

 

 

Run azpbs autoscale and it should come out without any error.

 

 

[root@hnpbs cyclecloud-pbspro]# azpbs autoscale
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK MEM NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR

 

 

Testing the jobs

Create a normal user for submitting the jobs. I am creating a normal users used in cyclecloud with same uid and gid. Home directory in the /shared location.

 

 

 

groupadd -g 20001 vinil
useradd -g 20001 -u 20001 -d /shared/home/vinil -s /bin/bash vinil

 

 

 Submit an interactive job using qsub -I for testing the functionality.

 

 

[root@hnpbs server_priv]# su - vinil
[vinil@hnpbs ~]$ qsub -I
qsub: waiting for job 7.hnpbs to start

 

 

You can now see a new node getting created in the CycleCloud portal.

vinilv_0-1666879835260.png

Also, you can see a node is getting created when you run azpbs autoscale command.

 

 

[root@hnpbs pbspro]# azpbs autoscale
NAME      HOSTNAME    PBS_STATE JOB_IDS STATE     VM_SIZE         DISK          MEM         NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR    ITR
execute-1 2yab3000005           7       Preparing Standard_F2s_v2 20.00g/20.00g 4.00g/4.00g 0/1   0/0       s_v2_pg0 execute   execute   false     26af9032ae0 3228.5 -1

 

 

You can now see a new node got provisioned from CycleCloud.

 

 

[root@hnpbs cyclecloud-pbspro]# su - vinil
Last login: Thu Feb 10 10:38:35 UTC 2022 on pts/0
[vinil@hnpbs ~]$ qsub -I
qsub: waiting for job 7.hnpbs to start
qsub: job 7.hnpbs ready
[vinil@ip-0A00001C ~]$

 

 

qstat output and azpbs autoscale output as follows:

 

 

[root@hnpbs pbspro]# qstat -an
hnpbs:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
7.hnpbs         vinil    workq    STDIN       11378   1   1    --    --  R 00:04
   ip-0A00001C/0

[root@hnpbs pbspro]#
[root@hnpbs pbspro]# azpbs autoscale
NAME      HOSTNAME    PBS_STATE JOB_IDS STATE VM_SIZE         DISK            MEM           NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
execute-1 ip-0A00001C job-busy  7       Ready Standard_F2s_v2 20.00gb/20.00gb 4.00gb/4.00gb 0/1   0/0       s_v2_pg0 execute   execute   false     26af9032ae0 -1  -1

 

 

 

vinilv_1-1666880012942.png

We successfully integrated an external PBS master node to CycleCloud

NOTE: when you are working with a onprem master nodes, make sure that the required network ports for PBS scheduler, compute nodes, file shares, license server etc. are opened for successful communication.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.