Update docs and clarify VirtualBox incompatibility

- Detail reasons for VirtualBox incompatibility in README
- Remove VirtualBox-specific config from Vagrantfile
- Add warnings about CPU overrides and slurm.conf updates
- Simplify apt retry setup in the provision script
This commit is contained in:
Kris Lamoureux 2024-08-18 18:13:33 -04:00
parent e03fbe5c14
commit 03a0f05b49
Signed by: kris
GPG Key ID: 3EDA9C3441EDA925
5 changed files with 108 additions and 56 deletions

128
README.md
View File

@ -6,17 +6,57 @@ This repository contains a `Vagrantfile` and the necessary configuration for
automating the setup of a Slurm cluster using Vagrant's shell provisioning on automating the setup of a Slurm cluster using Vagrant's shell provisioning on
Debian 12 x86_64 VMs. Debian 12 x86_64 VMs.
### Prerequisites ## Prerequisites
This setup was developed using vagrant-libvirt with NFS for file sharing, This setup was developed using vagrant-libvirt with NFS for file sharing.
rather than the more common VirtualBox configuration which typically uses
VirtualBox's Shared Folders. However, VirtualBox should work fine.
The core requirements for this setup are: - [Vagrant](https://wiki.debian.org/Vagrant)
- Vagrant (with functioning file sharing) (tested with 2.3.4 packaged by Debian 12)
- [Vagrant-libvirt provider](https://vagrant-libvirt.github.io/vagrant-libvirt/)
(tested with 0.11.2 packaged by Debian 12)
- Working Vagrant
[Synced Folders using NFS](https://developer.hashicorp.com/vagrant/docs/v2.3.4/synced-folders/nfs)
- (Optional) Make (for convenience commands) - (Optional) Make (for convenience commands)
### Cluster Structure ### VirtualBox Incompatibility
While efforts were made to support VirtualBox, several challenges prevent its
use in the current state.
1. **Hostname Resolution**: Unlike libvirt, VirtualBox doesn't provide
automatic hostname resolution between VMs. This requires an additional private
network and potentially custom scripting or plugins to enable inter-VM
communication by hostname.
2. **Sequential Provisioning**: VirtualBox provisions VMs sequentially, which,
while preventing a now exceedingly rare race condition with the munge.key
generation and distribution, significantly increases the overall setup time
compared to vagrant-libvirt's parallel provisioning and complicates potential
scripted solutions for hostname resolution.
3. **Shared Folder Permissions**: VirtualBox's shared folder mechanism doesn't
preserve Unix permissions from the host system. The `vagrant` user owns all
shared files in the `/vagrant` mount point, complicating the setup of a
non-privileged `submit` user and stripping the execution bit from shared
scripts.
4. **Provider-Specific Options**: Using mount options for VirtualBox shared
folders is incompatible with libvirt, making maintaining a single,
provider-agnostic configuration challenging.
Potential workarounds like assigning static IPs compromise the flexibility of
the current setup. The fundamental differences between VirtualBox and libvirt
in handling shared folders and networking make it challenging to create a truly
provider-agnostic solution without significant compromises or overhead.
For now, this project focuses on the libvirt provider due to its better
compatibility with the requirements of an automated Slurm cluster setup. Future
development could explore creating separate, provider-specific configurations
to support VirtualBox, acknowledging the additional maintenance this would
require.
## Cluster Structure
- `node1`: Head Node (runs `slurmctld`) - `node1`: Head Node (runs `slurmctld`)
- `node2`: Login/Submit Node - `node2`: Login/Submit Node
- `node3` / `node4`: Compute Nodes (runs `slurmd`) - `node3` / `node4`: Compute Nodes (runs `slurmd`)
@ -25,9 +65,9 @@ By default, each node is allocated:
- 2 threads/cores (depending on architecture) - 2 threads/cores (depending on architecture)
- 2 GB of RAM - 2 GB of RAM
**Warning: 8 vCPUs and 8 GB of RAM is used in total resources** **Warning: 8 vCPUs and 8 GB of RAM are used by default in total resources**
## Quick Start ## Getting Started
1. To build the cluster, you can use either of these methods 1. To build the cluster, you can use either of these methods
@ -48,7 +88,8 @@ By default, each node is allocated:
/vagrant/primes.sh /vagrant/primes.sh
By default, this script searches for prime numbers from `1-10,000` and `10,001-20,000` By default, this script searches for prime numbers from `1-10,000` and
`10,001-20,000`
You can adjust the range searched per node by providing an integer argument, e.g.: You can adjust the range searched per node by providing an integer argument, e.g.:
@ -56,49 +97,50 @@ By default, each node is allocated:
The script will then drop you into a `watch -n0.1 squeue` view so you can see The script will then drop you into a `watch -n0.1 squeue` view so you can see
the job computing on `nodes[3-4]`. You may `CTRL+c` out of this view, and the job computing on `nodes[3-4]`. You may `CTRL+c` out of this view, and
the job will continue in the background. The home directory for the `submit` the job will continue in the background. The `submit` user's home directory
user is in the shared `/vagrant` directory, so the results from each node are is in the NFS shared `/vagrant` directory, so the results from each node
shared back to the login node. are shared with the login node.
4. View the resulting prime numbers found, check `ls` for exact filenames 4. View the resulting prime numbers found (check `ls` for exact filenames)
less slurm-1_0.out less slurm-1_0.out
less slurm-1_1.out less slurm-1_1.out
### Configuration Tool ## Configuration Tool
On the Head Node (`node1`), you can access the configuration tools specific to On the Head Node (`node1`), you can access the configuration tools specific to
the version distributed with Debian. Since this may not be the latest Slurm the version distributed with Debian. Since this may not be the latest Slurm
release, it's important to use the configuration tool that matches the release, using the configuration tool that matches the installed version is
installed version. To access these tools, you can use Python to run a simple important. To access these tools, you can use Python to run a simple web server
web server:
python3 -m http.server 8080 --directory /usr/share/doc/slurm-wlm/html/ python3 -m http.server 8080 --directory /usr/share/doc/slurm-wlm/html/
You can then access the HTML documentation via the VM's IP address at port 8080 You can then access the HTML documentation via the VM's IP address at port 8080
in your web browser on the host machine. in your web browser on the host machine.
### Cleanup ## Cleanup
To clean up files placed on the host through Vagrant file sharing: To clean up files placed on the host through Vagrant file sharing:
make clean make clean
This command is useful when you want to remove all generated files and return This command is useful to remove all generated files and return to a clean
to a clean state. The Makefile is quite simple, so you can refer to it directly state. The Makefile is quite simple, so you can refer to it directly to see
to see exactly what's being cleaned up. what's being cleaned up.
If you have included override settings that you want to remove as well, run: If you have included override settings that you want to remove as well, run:
git clean -fdx git clean -fdx
This command will remove all untracked files and directories, including those This command will remove all untracked files and directories, including those
ignored by .gitignore. Be cautious when using this command as it will delete ignored by .gitignore. Be cautious when using this command, as it will delete
files that are not tracked by Git. Use the `-n` flag to dry-run first. files that Git does not track. Use the `-n` flag to dry-run first.
## Global Overrides ## Overrides
**WARNING:** Always update `slurm.conf` to match any CPU overrides to prevent ### Global Overrides
resource allocation conflicts.
**WARNING:** Always update `slurm.conf` to match any CPU overrides on compute
nodes to prevent resource allocation conflicts.
If you wish to override the default settings on a global level, If you wish to override the default settings on a global level,
you can do so by creating a `.settings.yml` file based on the provided you can do so by creating a `.settings.yml` file based on the provided
@ -109,7 +151,7 @@ you can do so by creating a `.settings.yml` file based on the provided
Once you have copied the `example-.settings.yml` to `.settings.yml`, you can Once you have copied the `example-.settings.yml` to `.settings.yml`, you can
edit it to override the default settings. Below are the available settings: edit it to override the default settings. Below are the available settings:
### Vagrant Settings Overrides #### Vagrant Settings Overrides
- `VAGRANT_BOX` - `VAGRANT_BOX`
- Default: `debian/bookworm64` - Default: `debian/bookworm64`
- Tested most around Debian Stable x86_64 (currently Bookworm) - Tested most around Debian Stable x86_64 (currently Bookworm)
@ -123,18 +165,7 @@ edit it to override the default settings. Below are the available settings:
- Default: `false` - Default: `false`
- Enable this if you need to forward SSH agents to the Vagrant machines - Enable this if you need to forward SSH agents to the Vagrant machines
### Minimal Resource Setup #### Slurm Settings Overrides
Resource-conscious users can copy and use the provided `example-.settings.yml`
file without modifications. This results in a cluster configuration using only
1 vCPU and 1 GB RAM per node (totaling 4 threads/cores and 4 GB RAM), allowing
basic operation on modest hardware.
When using this minimal setup with 1 vCPU, you'll need to update the `slurm.conf` file.
Apply the following change to the default `slurm.conf`:
sed -i 's/CPUs=2/CPUs=1/g' slurm.conf
### Slurm Settings Overrides
- `SLURM_NODES` - `SLURM_NODES`
- Default: `4` - Default: `4`
- The _total_ number of nodes in your Slurm cluster - The _total_ number of nodes in your Slurm cluster
@ -142,7 +173,22 @@ Apply the following change to the default `slurm.conf`:
- Default: `120` - Default: `120`
- Timeout in seconds for nodes to obtain the shared munge.key - Timeout in seconds for nodes to obtain the shared munge.key
## Per-Node Overrides #### Minimal Resource Setup
Resource-conscious users can copy and use the provided `example-.settings.yml`
file without modifications. This results in a cluster configuration using only
1 vCPU and 1 GB RAM per node (totaling 4 threads/cores and 4 GB RAM), allowing
basic operation on modest hardware.
When using this minimal setup with 1 vCPU, you'll need to update the
`slurm.conf` file. Apply the following change to the default `slurm.conf`:
sed -i 's/CPUs=2/CPUs=1/g' slurm.conf
### Per-Node Overrides
**WARNING:** Remember to update `slurm.conf` to match any CPU overrides on
compute nodes to prevent resource allocation conflicts.
The naming convention for nodes follows a specific pattern: `nodeX`, where `X` The naming convention for nodes follows a specific pattern: `nodeX`, where `X`
is a number corresponding to the node's position within the cluster. This is a number corresponding to the node's position within the cluster. This
convention is strictly adhered to due to the iteration logic within the convention is strictly adhered to due to the iteration logic within the

6
Vagrantfile vendored
View File

@ -43,12 +43,6 @@ Vagrant.configure(2) do |vm_config|
virt.cpus = NODES.dig("node#{count}", 'CPU') || VAGRANT_CPU virt.cpus = NODES.dig("node#{count}", 'CPU') || VAGRANT_CPU
end end
# VirtualBox
config.vm.provider :virtualbox do |vbox|
vbox.memory = NODES.dig("node#{count}", 'MEM') || VAGRANT_MEM
vbox.cpus = NODES.dig("node#{count}", 'CPU') || VAGRANT_CPU
end
# Install and Setup Slurm # Install and Setup Slurm
config.vm.provision "shell", inline: <<-SHELL config.vm.provision "shell", inline: <<-SHELL
export JOIN_TIMEOUT=#{JOIN_TIMEOUT} export JOIN_TIMEOUT=#{JOIN_TIMEOUT}

View File

@ -1,9 +1,14 @@
######################## ########################
### Example settings ### ### Example settings ###
######################## ########################
# This configuration as-is will take 4 threads/cores and 4 GB of RAM total. # This configuration as-is will take 4 threads/cores and 4 GB of RAM total.
# Set per-node overrides in nodes.rb if your setup requires it #
# Note: This example reduces the CPUs used in the default configuration. If you
# use these settings, you must update the slurm.conf file to reflect the
# reduced CPU count. Set the CPUs value in slurm.conf to match the number of
# vCPUs you specify here, assuming there are no node-specific overrides.
#
# You may also set per-node overrides in nodes.rb if needed
# Vagrant default global overrides # Vagrant default global overrides
#VAGRANT_BOX: debian/bookworm64 #VAGRANT_BOX: debian/bookworm64

View File

@ -2,12 +2,19 @@
### Example overrides ### ### Example overrides ###
######################### #########################
# This configuration as-is will take 10 threads/cores and 10 GB of RAM assuming # This configuration as-is will take 10 threads/cores and 10 GB of RAM,
# that .settings.yml isn't overriding the defaults. Make sure you have enough # assuming that .settings.yml isn't overriding the defaults. Make sure you have
# resources before running something like this. # enough resources to run something like this.
# Set SLURM_NODES in .settings.yml and update the slurm.conf if you run
# more/less than 4 total nodes (with 2 compute nodes). If the number of
# compute nodes changes, this must be reflected in the slurm.conf file.
#
# Additionally, if the number of CPUs for the compute nodes changes, such as in
# this example, this would also need to be reflected in the slurm.conf file.
#
# NOTE: The primes.sh script was only designed to run an array across two nodes
# Set SLURM_NODES in .settings and update the slurm.conf if you run more/less than 4 nodes
# NOTE: The primes.sh script was only designed to run an array across two nodes.
NODES = { NODES = {
# Head node # Head node

View File

@ -8,7 +8,7 @@
set -xe set -xe
# Increase APT retries and timeouts to improve provisioning reliability # Increase APT retries and timeouts to improve provisioning reliability
sudo tee /etc/apt/apt.conf.d/99custom-retries << EOF cat > /etc/apt/apt.conf.d/99custom-retries << EOF
Acquire::Retries "5"; Acquire::Retries "5";
Acquire::http::Timeout "120"; Acquire::http::Timeout "120";
Acquire::ftp::Timeout "120"; Acquire::ftp::Timeout "120";