This post has been republished via RSS; it originally appeared at: Azure Compute articles.
Recently Science and Nature published newest AI-based protein fold algorithm of RoseTTAFold and AlphaFold2 at the same day, which will bring the revolutionary breakthrough on human protein prediction. Corresponding code repositories were also released on Github.
How to adopt this new protein folding technology to fasten your research with huge power of HPC cluster? It's more wise and convenient to use Azure HPC solutions which will get ready in several hours instead of preparing on-premises server and build a static cluster in several months.
Azure have different offerings at HPC platform layer for your prompt building under purpose built scenarios. And Azure also have rich VM types suitable for HPC scenarios including newest Milan CPU based HBv3 series, HC series with high Infiniband components and Nvidia A100/V100/T4 GPU accelerating NC series.
Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.
- Read and know the license requirements of RoseTTAFold and its weight data.
- Apply for PyRoseTTA License and download download installation package file(suggest Python3.7 Linux version).
- Have or register a new Azure cloud account.
- Create SSH Key and save the pem file.
- Select the working Azure region (suggest Southeast Asia region). Create Resource Group (eg. rgCycleCloud) and Create a VNet.
- Submit NCas_T4_v3 Quota increate requestof Azure T4 GPU Series VM. If need more performance, request the V100 series NCs_v3 quota instead.
- This hands-on will charge cost. Here is a reference if use T4 VM in Southeast Asia region: less than $50 estimated 1 day accomplishment. Detailed pricing is here.
First step is to prepare CycleCloud Server through ARM template. Open Cloud Shell in Azure console and run below command to create a service principal. Remember the returned "appId", "password", "tenant" info in your notepad.
Click the CycleCloud Server template link jump to custom deployment page in Azure console. Set region as Southeast Asia and resource group as rgCycleCloud. Provide the service principal info just created and setup a CycleCloud admin username & password for further login. Set storage account name as sacyclecloud and let other parameter as is. Click "Review+create" and then click "Create".
When resource is ready, go to the "cyclecloud" VM overview page to find its DNS name. Open it at another web browser page, then login using admin username & password set previously. In first page of initial step, give a site name and then need agree the software license agreement at second page. In third page, User ID and Password is different with admin set in Step1.2, you may set them same for easily remember. Here SSH public key string is also required, which to be used to access VMs next.
At the upper right "cycleadmin" drop menu, click "My profile -> Edit profile" and provide your SSH public key string to save. It's a must-do step because this public key is used to scale VMs. Then use SSH login to this CycleCloud Server, and execute initialize command and press 'Enter' at each hint step. Then create a id_rsa file and provide your SSH private key string. Keep this SSH window open.
Prepare RoseTTAFold VM Image
In Azure console, enter the VM creating page by Home->Create Resource->Virtual Machine. Set the basic configuration as:
- Resource Group: rgCycleCloud
- Virtual Machine name: vmRoseTTAFoldHPCImg
- Region: Southeast Asia
- Availability options: No infrastructure redundancy required
- Image: CentOS-based 7.9 HPC Gen1 with GPU driver, CUDA and HPC tools pre-installed.(Click "See all images" and search this image in Marketplace)
- Size: Standard NC16as_T4_v3
- Username: cycleadmin
- SSH public key source: Use existing public key (if use SSH Keys in Azure)
- SSH public key: <your SSH public key>
- Virtual network: azurecyclecloud (or other existed VNet)
Click 'Review+Create' to check and then Create VM.
After this VM booted as Running status, we need one more step to enlarge the system disk size. Stop VM first with click option of reserve VM's public IP address. After status is as stopped, click VM Disk menu -> click system disk link -> 'Size + performance' to set the system disk size as 64G and performance tier P6 or higher. Wait till upper right pop-up info shows update accomplished then go back to Start the VM. VM status will change to Running several minutes later.
Using your SSH terminal to login to this VM and execute the next commands to install RoseTTAFold application, which include these steps:
- Install Anaconda3. In process set the destination directory as /opt/anaconda3 and select yes when ask whether to init conda.
- Download RoseTTAFold Github repo. It refers to a branch of RoseTTAFold repo which modified for adapting to HPC building.
- Config two conda environments.
- Install the PyRosetta4 component in folding conda environment. As a optional status check of PyRosetta4, enter Python command in folding env and then execute "import pyrosetta" and "pyrosetta.init()" with expectation of no error in output.
Strongly suggest to make a snapshot of this VM's OS disk before we go on. Then run this prepare command in SSH console and press 'y' to go.
When it's completed, go to Cloud Shell to run these commands:
After custom image created, go to Azure console page through Images -> imgRoseTTAhpc -> Overview. Find the 'RESOURCE ID' as form of '/subscriptions/xxx-xx-…xxxx/resourceGroups/rgCycleCloud/providers/Microsoft.Compute/images/imgRoseTTAhpc' and save it for further use.
Create a HPC cluster in CycleCloud
In the CycleCloud UI, click add new cluster with Slurm scheduler type selected. Give a cluster name first, eg. clusRosetta1. Then config "required settings" page as below. Choose NC16as_T4_v3 as HPC VM type and set quantity in auto-scaling configuration. Network select 'azurecyclecloud-compute' subnet. Click "Next".
Change CycleCloud default NFS disk size as 5000GB (training dataset will occupy 3T), which will be mounted at cluster startup. In "advanced settings" page, config the HPC OS type as "Custom image" and modify the image id as 'RESOURCE ID' at previous step. Left other option as is and click bottom right "Save" button.
Click the "Start" to boot cluster. CycleCloud will then create VMs according configuration. After several minutes, a scheduler VM will be ready in list. Click this item and click "connect" button in below detail list to get the string like "cyclecloud connect scheduler -c clusRosetta1". Use this command in CycleCloud Server's SSH console to login to scheduler VM.
RoseTTAFold Dataset preparation
Next is to prepare the datasets including weights and reference protein pdb database. In scheduler VM SSH console, run below command to load datasets into NFS volume mounted in cluster. We provide these dataset copy link at Azure Blob storage here to fasten the download speed. Your can also switch to original links as commented. Unzip operation will cost some time in hours. Suggest to unzip in multiple SSH windows with no interruption to assure the data integrity. Suggest to check the data size through 'du -sh <directory_name>' command after unzip operations.
Run a RoseTTAFold sample
There is a job submission script in git repo named runjob.sh. Then we can submit a RoseTTAFold analysis job by SLURM sbatch command in Scheduler SSH as below.
This sample job will cost some time est. at 30+ mins including steps of MSA parameters generation, HHsearch, prediction and modeling. Job's output can be checked in job<id>.out and logging files are at ~/log_<id>/ where you can find more progress info. AI training logging info can be found at ./log_<id>/folding.stdout.
As a HPC cluster, you can submit multiple jobs. Slurm scheduler will allocate jobs to compute nodes in cluster. Multiple jobs allocation and status can be listed by 'squeue' command as below.
[cycleadmin@ip-0A00041F ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 hpc RosettaO cycleadm R 1:13 1 hpc-pg0-1
8 hpc RosettaO cycleadm R 0:08 1 hpc-pg0-2
If node is not sufficient, CycleCloud will boot new nodes for more accommodation. Meanwhile, CycleCloud will terminate nodes which no job running on it after a time window for cost saving. CycleCloud UI provide more detailed status info of cluster and nodes. GPU utilization reached near 100% in prediction steps and has idle time during running.
Successful running prompts as below. It will output 5 preferred protein pdb results at path of ~/model_<id>/ which named as model_x.pdb.
[cycleadmin@ip-0A00041F ~]$ cat job9.out
Running HHblits of JobId rjob204
Running PSIPRED of JobId rjob204
Running hhsearch of JobId rjob204
Predicting distance and orientations of JobId rjob204
Running parallel RosettaTR.py
Running DeepAccNet-msa of JobId rjob204
Picking final models of JobId rjob204
Final models saved in: /shared/home/cycleadmin/model_204
Below is the image of two pdb protein structure of pyrosetta and end2end results in PyMOL tools UI.
You can change parameters in submission script to fully utilize CPU and memory according to your VM type configuration of cluster. What to do in the next is to upload your fasta input files to NFS volume and submit your RoseTTAFold jobs.
If will not keep this environment, delete the resource group of rgCycleCloud to tear down all the related resource directly.
RoseTTAFold branch repo for HPC: RoseTTAFold for HPC