Researchers that have been traditional users of HPC clusters have been asking how can they make use of Amazon Web Services (AWS) aka the cloud to run their workloads. While AWS gives you a great hardware infrastructure they really are just renting you bare metal machines by the hour.
I should stress using AWS is not magic. There is a lot that needs to be known to avoid extra costs or risk of losing data if you are new to cloud computing. Before you start contact ARC at hpc-support@umich.edu.
Admins of HPC clusters know that it takes a lot more than metal to make a useful HPC service which is what researchers really want. Researchers don't want to spend time installing and configuring queueing systems, exporting shared storage, and building AMI images.
Lucky for the community the nice folks at MIT created Star Cluster. Star cluster is really a set of prebuilt AMIs and a set of python codes that uses the AWS API to create HPC clusters on the fly. Their AMIs also includes many common packages such as MPI libraries, compilers, and python packages.
There is a great Quick-Start guide form the Star team. Users can follow this, but HPC users at the University of Michigan can use the ARC Cluster Flux, which has star cluster installed as an application. Users only need user accounts to access the login node to then create clusters on AWS.
module load starcluster/0.95.5Following the rest of the Quick-Start guide will get your first cluster up and running.
Common Star Cluster Tasks
Switch Instance Type
AWS offers a number of instance types each with their own features and costs. You switch your instance type with the NODE_INSTNACE_TYPE.
NODE_INSTANCE_TYPE=m3.medium
Make a shared Disk on EBS
EBS is the persistent storage service on AWS. In Star you can create an EBS volume and then attach to your master node and share across your entire cluster. Be careful that you don't leave your volumecreator cluster running. Use the -s flag to createvolume and also check your running clusters with listclusters.
Add/Remove Nodes to Cluster
This is very important when using cloud services. Don't leave machines running you don't need. Compute nodes should be nothing special and should be clones of each other. Using the commands addnode and removenode clusters can be resized to current needs. In general if you are not computing you should remove all your compute nodes leaving only the master/head node with the shared disk to get data. You can queue jobs still in this state and then start nodes.
$ starcluster addnode -n # <clustername>
$ starcluster removenode -n 3 smallcluster
Awesome. We already are using graphite + collectl with our lustre stores. This is some good ideas and we will be looking at adding this to our lustre troubleshooting arsenal.
ReplyDelete