(3.9.0‐3.9.1) Default ThreadsPerCore Slurm setting causes reduced CPU utilization

Issue Description

ParallelCluster does not explicitly set the ThreadsPerCore for compute node configuration causing Slurm to use the default value of 1. Slurm v23.11 introduced a change that requires the ThreadsPerCore setting to match the threads per physical core of the underlying instance. For compute resources where multi-threading has not been disabled, this will result in CPU under utilization at around 50%.

Affected Versions (OSes, schedulers)

ParallelCluster 3.9.0, 3.9.1
Slurm 23.11.4
All operating systems supported by ParallelCluster

Mitigation

To mitigate the issue, it is recommended to set ThreadsPerCore value using the CustomSlurmSettings property of each compute resource in your cluster configuration where multi-threading is enabled (which is the default).

The steps are as follows:

For each compute resource where multi-threading is enabled, add the following section:

    CustomSlurmSettings:
        ThreadsPerCore: <default-threads-per-core>

Note: You can determine the default-threads-per-core of the instance type by running this command: aws ec2 describe-instance-types --instance-types <instance-type> --region <region> | grep DefaultThreadsPerCore

Update your existing clusters or create new clusters using the updated configuration file for changes to take effect by following the instructions here

Note: Please notice that if your system is configured with more than one thread per core, execution of a different job on each thread is not supported. However a job can execute a one task per thread from within one job step or execute a distinct job step on each of the threads. This is reported in the official Slurm doc here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.9.0‐3.9.1) Default ThreadsPerCore Slurm setting causes reduced CPU utilization

Issue Description

Affected Versions (OSes, schedulers)

Mitigation

Clone this wiki locally