Update docs and clarify VirtualBox incompatibility
- Detail reasons for VirtualBox incompatibility in README - Remove VirtualBox-specific config from Vagrantfile - Add warnings about CPU overrides and slurm.conf updates - Simplify apt retry setup in the provision script
This commit is contained in:
parent
e03fbe5c14
commit
03a0f05b49
128
README.md
128
README.md
@ -6,17 +6,57 @@ This repository contains a `Vagrantfile` and the necessary configuration for
|
|||||||
automating the setup of a Slurm cluster using Vagrant's shell provisioning on
|
automating the setup of a Slurm cluster using Vagrant's shell provisioning on
|
||||||
Debian 12 x86_64 VMs.
|
Debian 12 x86_64 VMs.
|
||||||
|
|
||||||
### Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
This setup was developed using vagrant-libvirt with NFS for file sharing,
|
This setup was developed using vagrant-libvirt with NFS for file sharing.
|
||||||
rather than the more common VirtualBox configuration which typically uses
|
|
||||||
VirtualBox's Shared Folders. However, VirtualBox should work fine.
|
|
||||||
|
|
||||||
The core requirements for this setup are:
|
- [Vagrant](https://wiki.debian.org/Vagrant)
|
||||||
- Vagrant (with functioning file sharing)
|
(tested with 2.3.4 packaged by Debian 12)
|
||||||
|
- [Vagrant-libvirt provider](https://vagrant-libvirt.github.io/vagrant-libvirt/)
|
||||||
|
(tested with 0.11.2 packaged by Debian 12)
|
||||||
|
- Working Vagrant
|
||||||
|
[Synced Folders using NFS](https://developer.hashicorp.com/vagrant/docs/v2.3.4/synced-folders/nfs)
|
||||||
- (Optional) Make (for convenience commands)
|
- (Optional) Make (for convenience commands)
|
||||||
|
|
||||||
### Cluster Structure
|
### VirtualBox Incompatibility
|
||||||
|
|
||||||
|
While efforts were made to support VirtualBox, several challenges prevent its
|
||||||
|
use in the current state.
|
||||||
|
|
||||||
|
1. **Hostname Resolution**: Unlike libvirt, VirtualBox doesn't provide
|
||||||
|
automatic hostname resolution between VMs. This requires an additional private
|
||||||
|
network and potentially custom scripting or plugins to enable inter-VM
|
||||||
|
communication by hostname.
|
||||||
|
|
||||||
|
2. **Sequential Provisioning**: VirtualBox provisions VMs sequentially, which,
|
||||||
|
while preventing a now exceedingly rare race condition with the munge.key
|
||||||
|
generation and distribution, significantly increases the overall setup time
|
||||||
|
compared to vagrant-libvirt's parallel provisioning and complicates potential
|
||||||
|
scripted solutions for hostname resolution.
|
||||||
|
|
||||||
|
3. **Shared Folder Permissions**: VirtualBox's shared folder mechanism doesn't
|
||||||
|
preserve Unix permissions from the host system. The `vagrant` user owns all
|
||||||
|
shared files in the `/vagrant` mount point, complicating the setup of a
|
||||||
|
non-privileged `submit` user and stripping the execution bit from shared
|
||||||
|
scripts.
|
||||||
|
|
||||||
|
4. **Provider-Specific Options**: Using mount options for VirtualBox shared
|
||||||
|
folders is incompatible with libvirt, making maintaining a single,
|
||||||
|
provider-agnostic configuration challenging.
|
||||||
|
|
||||||
|
Potential workarounds like assigning static IPs compromise the flexibility of
|
||||||
|
the current setup. The fundamental differences between VirtualBox and libvirt
|
||||||
|
in handling shared folders and networking make it challenging to create a truly
|
||||||
|
provider-agnostic solution without significant compromises or overhead.
|
||||||
|
|
||||||
|
For now, this project focuses on the libvirt provider due to its better
|
||||||
|
compatibility with the requirements of an automated Slurm cluster setup. Future
|
||||||
|
development could explore creating separate, provider-specific configurations
|
||||||
|
to support VirtualBox, acknowledging the additional maintenance this would
|
||||||
|
require.
|
||||||
|
|
||||||
|
## Cluster Structure
|
||||||
|
|
||||||
- `node1`: Head Node (runs `slurmctld`)
|
- `node1`: Head Node (runs `slurmctld`)
|
||||||
- `node2`: Login/Submit Node
|
- `node2`: Login/Submit Node
|
||||||
- `node3` / `node4`: Compute Nodes (runs `slurmd`)
|
- `node3` / `node4`: Compute Nodes (runs `slurmd`)
|
||||||
@ -25,9 +65,9 @@ By default, each node is allocated:
|
|||||||
- 2 threads/cores (depending on architecture)
|
- 2 threads/cores (depending on architecture)
|
||||||
- 2 GB of RAM
|
- 2 GB of RAM
|
||||||
|
|
||||||
**Warning: 8 vCPUs and 8 GB of RAM is used in total resources**
|
**Warning: 8 vCPUs and 8 GB of RAM are used by default in total resources**
|
||||||
|
|
||||||
## Quick Start
|
## Getting Started
|
||||||
|
|
||||||
1. To build the cluster, you can use either of these methods
|
1. To build the cluster, you can use either of these methods
|
||||||
|
|
||||||
@ -48,7 +88,8 @@ By default, each node is allocated:
|
|||||||
|
|
||||||
/vagrant/primes.sh
|
/vagrant/primes.sh
|
||||||
|
|
||||||
By default, this script searches for prime numbers from `1-10,000` and `10,001-20,000`
|
By default, this script searches for prime numbers from `1-10,000` and
|
||||||
|
`10,001-20,000`
|
||||||
|
|
||||||
You can adjust the range searched per node by providing an integer argument, e.g.:
|
You can adjust the range searched per node by providing an integer argument, e.g.:
|
||||||
|
|
||||||
@ -56,49 +97,50 @@ By default, each node is allocated:
|
|||||||
|
|
||||||
The script will then drop you into a `watch -n0.1 squeue` view so you can see
|
The script will then drop you into a `watch -n0.1 squeue` view so you can see
|
||||||
the job computing on `nodes[3-4]`. You may `CTRL+c` out of this view, and
|
the job computing on `nodes[3-4]`. You may `CTRL+c` out of this view, and
|
||||||
the job will continue in the background. The home directory for the `submit`
|
the job will continue in the background. The `submit` user's home directory
|
||||||
user is in the shared `/vagrant` directory, so the results from each node are
|
is in the NFS shared `/vagrant` directory, so the results from each node
|
||||||
shared back to the login node.
|
are shared with the login node.
|
||||||
|
|
||||||
4. View the resulting prime numbers found, check `ls` for exact filenames
|
4. View the resulting prime numbers found (check `ls` for exact filenames)
|
||||||
|
|
||||||
less slurm-1_0.out
|
less slurm-1_0.out
|
||||||
less slurm-1_1.out
|
less slurm-1_1.out
|
||||||
|
|
||||||
### Configuration Tool
|
## Configuration Tool
|
||||||
|
|
||||||
On the Head Node (`node1`), you can access the configuration tools specific to
|
On the Head Node (`node1`), you can access the configuration tools specific to
|
||||||
the version distributed with Debian. Since this may not be the latest Slurm
|
the version distributed with Debian. Since this may not be the latest Slurm
|
||||||
release, it's important to use the configuration tool that matches the
|
release, using the configuration tool that matches the installed version is
|
||||||
installed version. To access these tools, you can use Python to run a simple
|
important. To access these tools, you can use Python to run a simple web server
|
||||||
web server:
|
|
||||||
|
|
||||||
python3 -m http.server 8080 --directory /usr/share/doc/slurm-wlm/html/
|
python3 -m http.server 8080 --directory /usr/share/doc/slurm-wlm/html/
|
||||||
|
|
||||||
You can then access the HTML documentation via the VM's IP address at port 8080
|
You can then access the HTML documentation via the VM's IP address at port 8080
|
||||||
in your web browser on the host machine.
|
in your web browser on the host machine.
|
||||||
|
|
||||||
### Cleanup
|
## Cleanup
|
||||||
To clean up files placed on the host through Vagrant file sharing:
|
To clean up files placed on the host through Vagrant file sharing:
|
||||||
|
|
||||||
make clean
|
make clean
|
||||||
|
|
||||||
This command is useful when you want to remove all generated files and return
|
This command is useful to remove all generated files and return to a clean
|
||||||
to a clean state. The Makefile is quite simple, so you can refer to it directly
|
state. The Makefile is quite simple, so you can refer to it directly to see
|
||||||
to see exactly what's being cleaned up.
|
what's being cleaned up.
|
||||||
|
|
||||||
If you have included override settings that you want to remove as well, run:
|
If you have included override settings that you want to remove as well, run:
|
||||||
|
|
||||||
git clean -fdx
|
git clean -fdx
|
||||||
|
|
||||||
This command will remove all untracked files and directories, including those
|
This command will remove all untracked files and directories, including those
|
||||||
ignored by .gitignore. Be cautious when using this command as it will delete
|
ignored by .gitignore. Be cautious when using this command, as it will delete
|
||||||
files that are not tracked by Git. Use the `-n` flag to dry-run first.
|
files that Git does not track. Use the `-n` flag to dry-run first.
|
||||||
|
|
||||||
## Global Overrides
|
## Overrides
|
||||||
|
|
||||||
**WARNING:** Always update `slurm.conf` to match any CPU overrides to prevent
|
### Global Overrides
|
||||||
resource allocation conflicts.
|
|
||||||
|
**WARNING:** Always update `slurm.conf` to match any CPU overrides on compute
|
||||||
|
nodes to prevent resource allocation conflicts.
|
||||||
|
|
||||||
If you wish to override the default settings on a global level,
|
If you wish to override the default settings on a global level,
|
||||||
you can do so by creating a `.settings.yml` file based on the provided
|
you can do so by creating a `.settings.yml` file based on the provided
|
||||||
@ -109,7 +151,7 @@ you can do so by creating a `.settings.yml` file based on the provided
|
|||||||
Once you have copied the `example-.settings.yml` to `.settings.yml`, you can
|
Once you have copied the `example-.settings.yml` to `.settings.yml`, you can
|
||||||
edit it to override the default settings. Below are the available settings:
|
edit it to override the default settings. Below are the available settings:
|
||||||
|
|
||||||
### Vagrant Settings Overrides
|
#### Vagrant Settings Overrides
|
||||||
- `VAGRANT_BOX`
|
- `VAGRANT_BOX`
|
||||||
- Default: `debian/bookworm64`
|
- Default: `debian/bookworm64`
|
||||||
- Tested most around Debian Stable x86_64 (currently Bookworm)
|
- Tested most around Debian Stable x86_64 (currently Bookworm)
|
||||||
@ -123,18 +165,7 @@ edit it to override the default settings. Below are the available settings:
|
|||||||
- Default: `false`
|
- Default: `false`
|
||||||
- Enable this if you need to forward SSH agents to the Vagrant machines
|
- Enable this if you need to forward SSH agents to the Vagrant machines
|
||||||
|
|
||||||
### Minimal Resource Setup
|
#### Slurm Settings Overrides
|
||||||
Resource-conscious users can copy and use the provided `example-.settings.yml`
|
|
||||||
file without modifications. This results in a cluster configuration using only
|
|
||||||
1 vCPU and 1 GB RAM per node (totaling 4 threads/cores and 4 GB RAM), allowing
|
|
||||||
basic operation on modest hardware.
|
|
||||||
|
|
||||||
When using this minimal setup with 1 vCPU, you'll need to update the `slurm.conf` file.
|
|
||||||
Apply the following change to the default `slurm.conf`:
|
|
||||||
|
|
||||||
sed -i 's/CPUs=2/CPUs=1/g' slurm.conf
|
|
||||||
|
|
||||||
### Slurm Settings Overrides
|
|
||||||
- `SLURM_NODES`
|
- `SLURM_NODES`
|
||||||
- Default: `4`
|
- Default: `4`
|
||||||
- The _total_ number of nodes in your Slurm cluster
|
- The _total_ number of nodes in your Slurm cluster
|
||||||
@ -142,7 +173,22 @@ Apply the following change to the default `slurm.conf`:
|
|||||||
- Default: `120`
|
- Default: `120`
|
||||||
- Timeout in seconds for nodes to obtain the shared munge.key
|
- Timeout in seconds for nodes to obtain the shared munge.key
|
||||||
|
|
||||||
## Per-Node Overrides
|
#### Minimal Resource Setup
|
||||||
|
Resource-conscious users can copy and use the provided `example-.settings.yml`
|
||||||
|
file without modifications. This results in a cluster configuration using only
|
||||||
|
1 vCPU and 1 GB RAM per node (totaling 4 threads/cores and 4 GB RAM), allowing
|
||||||
|
basic operation on modest hardware.
|
||||||
|
|
||||||
|
When using this minimal setup with 1 vCPU, you'll need to update the
|
||||||
|
`slurm.conf` file. Apply the following change to the default `slurm.conf`:
|
||||||
|
|
||||||
|
sed -i 's/CPUs=2/CPUs=1/g' slurm.conf
|
||||||
|
|
||||||
|
### Per-Node Overrides
|
||||||
|
|
||||||
|
**WARNING:** Remember to update `slurm.conf` to match any CPU overrides on
|
||||||
|
compute nodes to prevent resource allocation conflicts.
|
||||||
|
|
||||||
The naming convention for nodes follows a specific pattern: `nodeX`, where `X`
|
The naming convention for nodes follows a specific pattern: `nodeX`, where `X`
|
||||||
is a number corresponding to the node's position within the cluster. This
|
is a number corresponding to the node's position within the cluster. This
|
||||||
convention is strictly adhered to due to the iteration logic within the
|
convention is strictly adhered to due to the iteration logic within the
|
||||||
|
6
Vagrantfile
vendored
6
Vagrantfile
vendored
@ -43,12 +43,6 @@ Vagrant.configure(2) do |vm_config|
|
|||||||
virt.cpus = NODES.dig("node#{count}", 'CPU') || VAGRANT_CPU
|
virt.cpus = NODES.dig("node#{count}", 'CPU') || VAGRANT_CPU
|
||||||
end
|
end
|
||||||
|
|
||||||
# VirtualBox
|
|
||||||
config.vm.provider :virtualbox do |vbox|
|
|
||||||
vbox.memory = NODES.dig("node#{count}", 'MEM') || VAGRANT_MEM
|
|
||||||
vbox.cpus = NODES.dig("node#{count}", 'CPU') || VAGRANT_CPU
|
|
||||||
end
|
|
||||||
|
|
||||||
# Install and Setup Slurm
|
# Install and Setup Slurm
|
||||||
config.vm.provision "shell", inline: <<-SHELL
|
config.vm.provision "shell", inline: <<-SHELL
|
||||||
export JOIN_TIMEOUT=#{JOIN_TIMEOUT}
|
export JOIN_TIMEOUT=#{JOIN_TIMEOUT}
|
||||||
|
@ -1,9 +1,14 @@
|
|||||||
########################
|
########################
|
||||||
### Example settings ###
|
### Example settings ###
|
||||||
########################
|
########################
|
||||||
|
|
||||||
# This configuration as-is will take 4 threads/cores and 4 GB of RAM total.
|
# This configuration as-is will take 4 threads/cores and 4 GB of RAM total.
|
||||||
# Set per-node overrides in nodes.rb if your setup requires it
|
#
|
||||||
|
# Note: This example reduces the CPUs used in the default configuration. If you
|
||||||
|
# use these settings, you must update the slurm.conf file to reflect the
|
||||||
|
# reduced CPU count. Set the CPUs value in slurm.conf to match the number of
|
||||||
|
# vCPUs you specify here, assuming there are no node-specific overrides.
|
||||||
|
#
|
||||||
|
# You may also set per-node overrides in nodes.rb if needed
|
||||||
|
|
||||||
# Vagrant default global overrides
|
# Vagrant default global overrides
|
||||||
#VAGRANT_BOX: debian/bookworm64
|
#VAGRANT_BOX: debian/bookworm64
|
||||||
|
@ -2,12 +2,19 @@
|
|||||||
### Example overrides ###
|
### Example overrides ###
|
||||||
#########################
|
#########################
|
||||||
|
|
||||||
# This configuration as-is will take 10 threads/cores and 10 GB of RAM assuming
|
# This configuration as-is will take 10 threads/cores and 10 GB of RAM,
|
||||||
# that .settings.yml isn't overriding the defaults. Make sure you have enough
|
# assuming that .settings.yml isn't overriding the defaults. Make sure you have
|
||||||
# resources before running something like this.
|
# enough resources to run something like this.
|
||||||
|
|
||||||
|
# Set SLURM_NODES in .settings.yml and update the slurm.conf if you run
|
||||||
|
# more/less than 4 total nodes (with 2 compute nodes). If the number of
|
||||||
|
# compute nodes changes, this must be reflected in the slurm.conf file.
|
||||||
|
#
|
||||||
|
# Additionally, if the number of CPUs for the compute nodes changes, such as in
|
||||||
|
# this example, this would also need to be reflected in the slurm.conf file.
|
||||||
|
#
|
||||||
|
# NOTE: The primes.sh script was only designed to run an array across two nodes
|
||||||
|
|
||||||
# Set SLURM_NODES in .settings and update the slurm.conf if you run more/less than 4 nodes
|
|
||||||
# NOTE: The primes.sh script was only designed to run an array across two nodes.
|
|
||||||
|
|
||||||
NODES = {
|
NODES = {
|
||||||
# Head node
|
# Head node
|
||||||
|
@ -8,7 +8,7 @@
|
|||||||
set -xe
|
set -xe
|
||||||
|
|
||||||
# Increase APT retries and timeouts to improve provisioning reliability
|
# Increase APT retries and timeouts to improve provisioning reliability
|
||||||
sudo tee /etc/apt/apt.conf.d/99custom-retries << EOF
|
cat > /etc/apt/apt.conf.d/99custom-retries << EOF
|
||||||
Acquire::Retries "5";
|
Acquire::Retries "5";
|
||||||
Acquire::http::Timeout "120";
|
Acquire::http::Timeout "120";
|
||||||
Acquire::ftp::Timeout "120";
|
Acquire::ftp::Timeout "120";
|
||||||
|
Loading…
Reference in New Issue
Block a user