Skip to content

Commit

Permalink
OCI HPC Cluster stack
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcinZablocki committed May 26, 2021
1 parent 4511847 commit 89fe833
Show file tree
Hide file tree
Showing 263 changed files with 27,036 additions and 1,534 deletions.
4 changes: 2 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2018-2019 Oracle and/or its affiliates. All rights reserved.
Copyright (c) 2018-2020 Oracle and/or its affiliates. All rights reserved.

The Universal Permissive License (UPL), Version 1.0

Expand All @@ -24,4 +24,4 @@ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLI
THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.
IN THE SOFTWARE.
175 changes: 114 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,114 @@
# oci-quickstart-hpc

<pre>
`-/+++++++++++++++++/-.`
`/syyyyyyyyyyyyyyyyyyyyyyys/.
:yyyyo/-...............-/oyyyy/
/yyys- .oyyy+
.yyyy` `syyy-
:yyyo /yyy/ Oracle Cloud HPC Cluster Demo
.yyyy` `syyy- https://github.com/oracle-quickstart/oci-hpc
/yyys. .oyyyo
/yyyyo:-...............-:oyyyy/`
`/syyyyyyyyyyyyyyyyyyyyyyys+.
`.:/+ooooooooooooooo+/:.`
`
</pre>

High Performance Computing and storage in the cloud can be very confusing and it can be difficult to determine where to start. This repository is designed to be a first step in expoloring a cloud based HPC storage and compute architecture. There are many different configurations and deployment methods that could be used, but this repository focuses on a bare metal compute system deployed with Terraform. After deployment fully independant and functioning IaaS HPC compute cluster has been deployed based on the architecture below.

This deployment is an example of cluster provisioning using Terraform and SaltStack. Terraform is used to provision infrastructure, while Salt is a configuration and cluster management system.

Salt configuration is stored under ./salt directory, containing pillar/ (variables) and salt/ (state) information. Read more about salt in the documentation: https://docs.saltstack.com/en/latest/

## Architecture
![Architecture](images/architecture.png)

## Authentication
terraform.tvars contain required authentication variables

## Operations
Salt commands should be executed from the headnode.
IntelMPI installation: sudo salt '*' state.apply intelmpi

## SSH Key
SSH key is generated each time for the environment in the ./key.pem file.

## Networking
* Public subnet - Headnode acts a jump host and it's placed in the public subnet. The subnet is open to SSH connections from everywhere. Other ports are closed and can be opened using custom-security-list in the OCI console/cli. All connections from VCN are accepted. Host firewall service is disabled by default.
* Private subnet - All connections from VCN are accepted. Public IP's are prohibited in the subnet. Internet access is provided by NAT gateway.

## Roles
Roles are set in variables.tf as additional_headnode_roles, additional_worker_roles, additional_storage_roles or additional_role_all Additional roles provide ability to install and configure applications defined as Salt states.

Example roles:
* intelmpi: provides configured Intel yum repository and installs IntelMPI distribution
* openmpi: installs OpenMPI from OL repository

## Storage
* Storage node require to be DenseIO shape (NVME devices are detected and configured).

### Filesystems

Storage role servers will be configured as filesystem nodes, while headnode and worker nodes will act as a clients.
* GlusterFS (requires storage role) - To use GlusterFS set storage_type to glusterfs. Filesystem will be greated as :/gfs and mounted under /mnt/gluster
* BeeGFS (requires storage role) - To use BeeGFS set storage_type to beegfs. Filesystem will be mounted under /mnt/beegfs

### NFS
* Block volumes - Each node type can be configured with block volumes in the variables.tf
Headnode will export first block volume as NFS share under /mnt/share (configured in salt/salt/nfs.sls)
Other block volume attachments need to be configured manually after cluster provisioning.
* FSS - File system service endpoint will be created in the private subnet and mounted on each node under /mnt/fss
# Stack to create an HPC cluster.

## Policies to deploy the stack:
```
allow service compute_management to use tag-namespace in tenancy
allow service compute_management to manage compute-management-family in tenancy
allow service compute_management to read app-catalog-listing in tenancy
allow group user to manage all-resources in compartment compartmentName
```
## Policies for autoscaling:
As described when you specify your variables, if you select instance-principal as way of authenticating your node, make sure your generate a dynamic group and give the following policies to it:
```
Allow dynamic-group instance_principal to read app-catalog-listing in tenancy
Allow dynamic-group instance_principal to use tag-namespace in tenancy
```
And also either:

```
Allow dynamic-group instance_principal to manage compute-management-family in compartment compartmentName
Allow dynamic-group instance_principal to manage instance-family in compartment compartmentName
Allow dynamic-group instance_principal to use virtual-network-family in compartment compartmentName
```
or:

`Allow dynamic-group instance_principal to manage all-resources in compartment compartmentName`

# Autoscaling

The autoscaling will work in a “cluster per job” approach. This means that for job waiting in the queue, we will launch new cluster specifically for that job. Autoscaling will also take care of spinning down clusters. By default, a cluster is left Idle for 10 minutes before shutting down. Autoscaling is achieved with a cronjob to be able to quickly switch from one scheduler to the next.

To turn on autoscaling:
Uncomment the line in `crontab -e`:
```
* * * * * /home/opc/autoscaling/crontab/autoscale_slurm.sh >> /home/opc/autoscaling/logs/crontab_slurm.log 2>&1
```

# Submit
How to submit jobs:
Slurm jobs can be submitted as always but a few more constraints can be set:
Example in `autoscaling/submit/sleep.sbatch`:

```
#!/bin/sh
#SBATCH -n 72
#SBATCH --ntasks-per-node 36
#SBATCH --exclusive
#SBATCH --job-name sleep_job
#SBATCH --constraint cluster-size-2,BM.HPC2.36
cd /nfs/scratch
mkdir $SLURM_JOB_ID
cd $SLURM_JOB_ID
MACHINEFILE="hostfile"
# Generate Machinefile for mpi such that hosts are in the same
# order as if run via srun
#
srun -N$SLURM_NNODES -n$SLURM_NNODES hostname > $MACHINEFILE
sed -i 's/$/:36/' $MACHINEFILE
cat $MACHINEFILE
# Run using generated Machine file:
sleep 1000
```

- cluster-size: Since clusters can be reused, you can decide to only use a cluster of exactly the right size. Created cluster will have a feature cluster-size-x. You can set the constraint cluster-size-x to make sure this matches and avoid having a 1 node job using a 16 nodes cluster.

- shape: You can specify the OCI shape that you’d like to run on as a constraint. This will make sure that you run on the right shape and also generate the right cluster. Shapes are expected to be written in OCI format: BM.HPC2.36, BM.Standard.E3.128, BM.GPU4.8,…
If you’d like to use flex shapes, you can use VM.Standard.E3.x with x the number of cores that you would like.


## Clusters folders:
```
~/autoscaling/clusters/clustername
```

## Logs:
```
~/autoscaling/logs
```

Each cluster will have his own log with name: `create_clustername_date.log` and `delete_clustername_date.log`
The log of the crontab will be in `crontab_slurm.log`


## Manual clusters:
You can create and delete your clusters manually.
### Cluster Creation
```
/home/opc/autoscaling/create_cluster.sh NodeNumber clustername shape Cluster-network-enabled
```
Example:
```
/home/opc/autoscaling/create_cluster.sh 4 cluster-6-amd3128 BM.Standard.E3.128 false
```

To be registered in slurm, the cluster names must be as such:
BM.HPC2.36: cluster-i-hpc
BM.Standard2.52: cluster-i-std252
VM.Standard2.x: cluster-i-std2x
BM.Standard.E2.64: cluster-i-amd264
VM.Standard.E2.x: cluster-i-amd2x
BM.Standard.E3.128: cluster-i-amd3128
VM.Standard.E3.x: cluster-i-amd3x
BM.GPU2.2: cluster-i-gpu22
VM.GPU2.1: cluster-i-gpu21
BM.GPU3.8: cluster-i-gpu38
VM.GPU3.x: cluster-i-gpu3x
BM.GPU4.8: cluster-i-gpu48

### Cluster Deletion:
```
/home/opc/autoscaling/create_cluster.sh clustername
```
Binary file added autoscaling/.DS_Store
Binary file not shown.
16 changes: 16 additions & 0 deletions autoscaling/cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
#
# Cluster destroy script
scripts=`realpath $0`
folder=`dirname $scripts`
playbooks_path=$folder/../playbooks/
inventory_path=$folder/clusters/$1

ssh_options="-i ~/.ssh/id_rsa -o StrictHostKeyChecking=no"
if [[ "$2" == "FORCE" ]];
then
echo Force Deletion
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook $playbooks_path/destroy.yml -i $inventory_path/inventory -e "force=yes"
else
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook $playbooks_path/destroy.yml -i $inventory_path/inventory
fi
1 change: 1 addition & 0 deletions autoscaling/clusters/README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Each terraform configuration will be located in this folder. This will be used to destoy the cluster, do not remove live clusters manually.
55 changes: 55 additions & 0 deletions autoscaling/configure.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash
#
# Cluster init configuration script
#

#
# wait for cloud-init completion on the bastion host
#

scripts=`realpath $0`
folder=`dirname $scripts`
execution=1
playbooks_path=$folder/../playbooks/
inventory_path=$folder/clusters/$1

ssh_options="-i ~/.ssh/cluster.key -o StrictHostKeyChecking=no"

#
# A little waiter function to make sure all the nodes are up before we start configure
#

echo "Waiting for SSH to come up"

for host in $(cat $inventory_path/hosts_$1) ; do
r=0
echo "validating connection to: ${host}"
while ! ssh ${ssh_options} opc@${host} uptime ; do

if [[ $r -eq 10 ]] ; then
execution=0
break
fi

echo "Still waiting for ${host}"
sleep 60
r=$(($r + 1))

done
done

#
# Ansible will take care of key exchange and learning the host fingerprints, but for the first time we need
# to disable host key checking.
#

if [[ $execution -eq 1 ]] ; then
ANSIBLE_HOST_KEY_CHECKING=False ansible all -m setup --tree /tmp/ansible > /dev/null 2>&1
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook $playbooks_path/new_nodes.yml -i $inventory_path/inventory
else

cat <<- EOF > /tmp/motd
At least one of the cluster nodes has been innacessible during installation. Please validate the hosts and re-run:
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook $playbooks_path/new_nodes.yml -i $inventory_path/inventory
EOF
fi
38 changes: 38 additions & 0 deletions autoscaling/create_cluster.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash

if [ $# -eq 0 ] || [ $# -eq 1 ]
then
echo "No enough arguments supplied, please supply number of nodes and cluster name"
exit
fi
date=`date '+%Y%m%d%H%M'`
scripts=`realpath $0`
folder=`dirname $scripts`
cp -r $folder/tf_init $folder/clusters/$2
cd $folder/clusters/$2
if [[ $3 == VM.Standard.E3.* ]]
then
sed "s/##NODES##/$1/g;s/##NAME##/$2/g;s/##SHAPE##/VM.Standard.E3.Flex/g;s/##CN##/$4/g;s/##OCPU##/${3:15}/g" $folder/tf_init/variables.tf > variables.tf
elif [[ $3 == VM.Standard.E4.* ]]
then
sed "s/##NODES##/$1/g;s/##NAME##/$2/g;s/##SHAPE##/VM.Standard.E4.Flex/g;s/##CN##/$4/g;s/##OCPU##/${3:15}/g" $folder/tf_init/variables.tf > variables.tf
else
sed "s/##NODES##/$1/g;s/##NAME##/$2/g;s/##SHAPE##/$3/g;s/##CN##/$4/g" $folder/tf_init/variables.tf > variables.tf
fi
echo "Started to build $2"
start=`date +%s`
terraform init > $folder/logs/create_$2_${date}.log
echo $1 $3 $4 >> currently_building
terraform apply -auto-approve >> $folder/logs/create_$2_${date}.log 2>&1
status=$?
end=`date +%s`
runtime=$((end-start))
if [ $status -eq 0 ]
then
echo "Successfully created $2 in $runtime seconds"
rm currently_building
else
echo "Could not create $2 with $1 nodes in $runtime seconds"
rm currently_building
$folder/delete_cluster.sh $2 FORCE
fi
5 changes: 5 additions & 0 deletions autoscaling/credentials/key.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

sed 's/-----END RSA PRIVATE KEY-----//' $1 | sed 's/ /\n/4g' > $2
echo -----END RSA PRIVATE KEY----- >> $2
chmod 600 $2
Loading

0 comments on commit 89fe833

Please sign in to comment.