These are simple helm charts to run HPC applications in Kubernetes using the Flux Operator. You can customize each different application to your needs, from the container, to size, to iterations, etc. We have a simple strategy that uses:
- base-template: A base template MiniCluster that is used acrossed apps.
- Applications: are each included in a subdirectory here. Usage is consistent across applications, with the exception of the application specific parameters. For each application, those are included in the respective READMEs.
Each application can be customized for anything related to the MiniCluster (e.g., size, flux view, logging, TBA resources), and anything related to the application itself (parameters, containers, etc). Given the use of a common template, the actual definition of the application is fairly small (and thus they are easy to write). This is a nice approach because:
- We don't require extra software installed into the MiniCluster
- An application definition is simple (and can be written easily / quickly)
- Changing logic for the MiniCluster only needs to be done in one place!
- Applications can be programatically built and tested (when possible)
- Experiments can be orchestrated via using these helm charts with a custom values.yaml for each application (see our example runs in the Google Performance Study).
The following variables are available for every experiment, and already part of the template. Variables with a default will have the default set, otherwise the flag (or similar) is usually left out.
Name | Description | Default | Options |
---|---|---|---|
nodes | Number of nodes -N for each job |
1 | |
tasks | Number of tasks -n for each job |
unset | |
cpu_affinity | Set --cpu-affinity |
per-task |
(off,per-task,map:LIST,on) |
gpu_affinity | Set --gpu-affinity |
off |
(off,per-task,map:LIST,on) |
run_threads | sets OMP_NUM_THREADS |
unset | |
cores_per_task | Set --cores-per-task |
unset | |
exclusive | Add the --exclusive flag |
unset |
You define them via --set experiment.<name>=<value>
or in a values.yaml to create the experiment from:
experiment:
nodes: 5
Experiment specific variables are defined in the values.yaml files associated with the experiment.
This example will walk through running lammps. Other example runs are also provided below.
For simple local development:
# Create the cluster
kind create cluster --config ./kind-config.yaml
For ebpf (that requires mounting the host) I recommend a cloud:
NODES=1
GOOGLE_PROJECT=llnl-flux
INSTANCE=h3-standard-88
ctime gcloud container clusters create test-cluster --threads-per-core=1 --num-nodes=$NODES --machine-type=$INSTANCE --placement-type=COMPACT --image-type=UBUNTU_CONTAINERD --region=us-central1-a --project=${GOOGLE_PROJECT}
# When time to delete
gcloud container clusters delete test-cluster --region=us-central1-a
Finally, install the Flux Operator
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
Here are the values we can customize (any can be exposed really, it's very simple).
helm show values ./lammps-reax
# Default values for lammps experiment
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
# Logging (quiet will hide flux setup)
logging:
quiet: true
experiment:
iterations: 1
nodes: 1
tasks: 2
env:
app: lammps
lammps:
binary: /usr/bin/lmp
input: in.reaxff.hns
x: 2
y: 2
z: 2
flux:
image: ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
minicluster:
# Container image
image: "ghcr.io/converged-computing/lammps-reax:ubuntu2204"
# Interactive MiniCluster?
interactive: false
# MiniCluster size
size: 1
# Number of NVIDIA gpus
gpus: 0
# Add flux on the fly (set to false if Flux is already in the container)
addFlux: false
If there are changes to the base template:
helm dependency update lammps-reax/
helm install lammps lammps-reax/ --debug --dry-run
Then install the chart. This will deploy the Flux MiniCluster and run lammps for some number of iterations. All variables are technically defined so you don't need any --set
.
container=$(ocifit ghcr.io/converged-computing/lammps-reax --instance)
helm install \
--set minicluster.size=1 \
--set minicluster.image= \
--set minicluster.addFlux=true \
lammps ./lammps-reax
NAME: lammps
LAST DEPLOYED: Sun May 11 13:10:50 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Or just look at the chart
If you want to debug or otherwise print to the console:
helm template --debug \
--set minicluster.size=4 \
--set minicluster.image=ghcr.io/converged-computing/metric-lammps-cpu:zen4-reax \
./lammps-reax
The output can be seen in the lead broker pod!
kubectl logs lammps-0-xxxx -f
LAMMPS output
Defaulted container "lammps" out of: lammps, flux-view (init)
#!/bin/bash
set -euo pipefail
mkdir -p /tmp/output
flux resource list
for i in {1..1}
do
echo "FLUX-RUN START lammps-iter-$i"
flux run --setattr=user.study_id=lammps-iter-$i -N1 -n 2 -o cpu-affinity=per-task -o gpu-affinity=off /usr/bin/lmp -v x 2 -v y 2 -v z 2 -in in.reaxff.hns -nocite |& tee /tmp/lammps.out
echo "FLUX-RUN END lammps-iter-$i"
done
STATE NNODES NCORES NGPUS NODELIST
free 1 8 0 lammps-0
allocated 0 0 0
down 0 0 0
FLUX-RUN START lammps-iter-1
LAMMPS (17 Apr 2024 - Development - a8687b5)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
triclinic box = (0 0 0) to (22.326 11.1412 13.778966) with tilt (0 -5.02603 0)
2 by 1 by 1 MPI processor grid
reading atoms ...
304 atoms
reading velocities ...
304 velocities
read_data CPU = 0.001 seconds
Replication is creating a 2x2x2 = 8 times larger system...
triclinic box = (0 0 0) to (44.652 22.2824 27.557932) with tilt (0 -10.05206 0)
2 by 1 by 1 MPI processor grid
bounding box image = (0 -1 -1) to (0 1 1)
bounding box extra memory = 0.03 MB
average # of replicas added to proc = 5.00 out of 8 (62.50%)
2432 atoms
replicate CPU = 0.000 seconds
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 11
ghost atom cutoff = 11
binsize = 5.5, bins = 10 5 6
2 neighbor lists, perpetual/occasional/extra = 2 0 0
(1) pair reaxff, perpetual
attributes: half, newton off, ghost
pair build: half/bin/ghost/newtoff
stencil: full/ghost/bin/3d
bin: standard
(2) fix qeq/reax, perpetual, copy from (1)
attributes: half, newton off
pair build: copy
stencil: none
bin: none
Setting up Verlet run ...
Unit style : real
Current step : 0
Time step : 0.1
Per MPI rank memory allocation (min/avg/max) = 143.9 | 143.9 | 143.9 Mbytes
Step Temp PotEng Press E_vdwl E_coul Volume
0 300 -113.27833 437.52134 -111.57687 -1.7014647 27418.867
10 299.38517 -113.27631 1439.2511 -111.57492 -1.7013814 27418.867
20 300.27107 -113.27884 3764.3921 -111.57762 -1.7012246 27418.867
30 302.21063 -113.28428 7007.6315 -111.58335 -1.7009364 27418.867
40 303.52265 -113.28799 9844.7899 -111.58747 -1.7005187 27418.867
50 301.87059 -113.28324 9663.0837 -111.58318 -1.7000523 27418.867
60 296.67807 -113.26777 7273.8688 -111.56815 -1.6996136 27418.867
70 292.2 -113.25435 5533.5999 -111.55514 -1.6992157 27418.867
80 293.58679 -113.25831 5993.3978 -111.55946 -1.6988534 27418.867
90 300.62637 -113.27925 7202.8885 -111.58069 -1.6985591 27418.867
100 305.38276 -113.29357 10085.741 -111.59518 -1.6983875 27418.867
Loop time of 9.43821 on 2 procs for 100 steps with 2432 atoms
Performance: 0.092 ns/day, 262.173 hours/ns, 10.595 timesteps/s, 25.768 katom-step/s
99.8% CPU use with 2 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 6.9119 | 7.0673 | 7.2228 | 5.8 | 74.88
Neigh | 0.11603 | 0.11763 | 0.11922 | 0.5 | 1.25
Comm | 0.013927 | 0.16934 | 0.32476 | 37.8 | 1.79
Output | 0.00029069 | 0.00029232 | 0.00029395 | 0.0 | 0.00
Modify | 2.0813 | 2.0829 | 2.0845 | 0.1 | 22.07
Other | | 0.0006819 | | | 0.01
Nlocal: 1216 ave 1216 max 1216 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Nghost: 7591.5 ave 7597 max 7586 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 432912 ave 432942 max 432882 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Total # of neighbors = 865824
Ave neighs/atom = 356.01316
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:09
To clean up the run, you need to uninstall:
helm uninstall lammps
If you specify more than one iteration (what we often do for running experiments) each will be done.
helm install \
--set minicluster.size=1 \
--set minicluster.image=ghcr.io/converged-computing/metric-lammps-cpu:zen4-reax \
--set experiment.iterations=2 \
--set minicluster.addFlux=true \
lammps ./lammps-reax
For actual experiments, we usually want to capture the total wrapped duration and other events from the workload manager, and be able to pipe the entire kubectl logs to file that we can parse later. That's easy to add:
```bash
helm install \
--set minicluster.save_logs=true \
lammps ./lammps-reax
Here is how the output file has changed:
LAMMPS output with flux events
Defaulted container "lammps" out of: lammps, flux-view (init)
#!/bin/bash
set -euo pipefail
mkdir -p /tmp/output
flux resource list
for i in {1..1}
do
echo "FLUX-RUN START lammps-iter-$i"
flux run --setattr=user.study_id=lammps-iter-$i -N1 -n 2 -o cpu-affinity=per-task -o gpu-affinity=off /usr/bin/lmp -v x 2 -v y 2 -v z 2 -in in.reaxff.hns -nocite |& tee /tmp/lammps.out
echo "FLUX-RUN END lammps-iter-$i"
done
output=./results/${app}
(apt-get update && apt-get install -y jq) || (yum update -y && yum install -y jq)
mkdir -p $output
for jobid in $(flux jobs -a --json | jq -r .jobs[].id); do
echo
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "FLUX-JOB START ${jobid} ${study_id}"
echo "FLUX-JOB-JOBSPEC START"
flux job info $jobid jobspec
echo "FLUX-JOB-JOBSPEC END"
echo "FLUX-JOB-RESOURCES START"
flux job info ${jobid} R
echo "FLUX-JOB-RESOURCES END"
echo "FLUX-JOB-EVENTLOG START"
flux job info $jobid guest.exec.eventlog
echo "FLUX-JOB-EVENTLOG END"
echo "FLUX-JOB END ${jobid} ${study_id}"
done
echo "FLUX JOB STATS"
flux job stats
STATE NNODES NCORES NGPUS NODELIST
free 1 8 0 lammps-0
allocated 0 0 0
down 0 0 0
FLUX-RUN START lammps-iter-1
LAMMPS (17 Apr 2024 - Development - a8687b5)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
triclinic box = (0 0 0) to (22.326 11.1412 13.778966) with tilt (0 -5.02603 0)
2 by 1 by 1 MPI processor grid
reading atoms ...
304 atoms
reading velocities ...
304 velocities
read_data CPU = 0.002 seconds
Replication is creating a 2x2x2 = 8 times larger system...
triclinic box = (0 0 0) to (44.652 22.2824 27.557932) with tilt (0 -10.05206 0)
2 by 1 by 1 MPI processor grid
bounding box image = (0 -1 -1) to (0 1 1)
bounding box extra memory = 0.03 MB
average # of replicas added to proc = 5.00 out of 8 (62.50%)
2432 atoms
replicate CPU = 0.000 seconds
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 11
ghost atom cutoff = 11
binsize = 5.5, bins = 10 5 6
2 neighbor lists, perpetual/occasional/extra = 2 0 0
(1) pair reaxff, perpetual
attributes: half, newton off, ghost
pair build: half/bin/ghost/newtoff
stencil: full/ghost/bin/3d
bin: standard
(2) fix qeq/reax, perpetual, copy from (1)
attributes: half, newton off
pair build: copy
stencil: none
bin: none
Setting up Verlet run ...
Unit style : real
Current step : 0
Time step : 0.1
Per MPI rank memory allocation (min/avg/max) = 143.9 | 143.9 | 143.9 Mbytes
Step Temp PotEng Press E_vdwl E_coul Volume
0 300 -113.27833 437.52134 -111.57687 -1.7014647 27418.867
10 299.38517 -113.27631 1439.2511 -111.57492 -1.7013814 27418.867
20 300.27107 -113.27884 3764.3921 -111.57762 -1.7012246 27418.867
30 302.21063 -113.28428 7007.6315 -111.58335 -1.7009364 27418.867
40 303.52265 -113.28799 9844.7899 -111.58747 -1.7005187 27418.867
50 301.87059 -113.28324 9663.0837 -111.58318 -1.7000523 27418.867
60 296.67807 -113.26777 7273.8688 -111.56815 -1.6996136 27418.867
70 292.2 -113.25435 5533.5999 -111.55514 -1.6992157 27418.867
80 293.58679 -113.25831 5993.3978 -111.55946 -1.6988534 27418.867
90 300.62637 -113.27925 7202.8885 -111.58069 -1.6985591 27418.867
100 305.38276 -113.29357 10085.741 -111.59518 -1.6983875 27418.867
Loop time of 9.48714 on 2 procs for 100 steps with 2432 atoms
Performance: 0.091 ns/day, 263.532 hours/ns, 10.541 timesteps/s, 25.635 katom-step/s
99.8% CPU use with 2 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 6.8829 | 7.0529 | 7.2229 | 6.4 | 74.34
Neigh | 0.11578 | 0.11587 | 0.11596 | 0.0 | 1.22
Comm | 0.010545 | 0.18042 | 0.35029 | 40.0 | 1.90
Output | 0.00031558 | 0.00032584 | 0.0003361 | 0.0 | 0.00
Modify | 2.1369 | 2.137 | 2.137 | 0.0 | 22.52
Other | | 0.0006946 | | | 0.01
Nlocal: 1216 ave 1216 max 1216 min
Histogram: 2 0 0 0 0 0 0 0 0 0
Nghost: 7591.5 ave 7597 max 7586 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Neighs: 432912 ave 432942 max 432882 min
Histogram: 1 0 0 0 0 0 0 0 0 1
Total # of neighbors = 865824
Ave neighs/atom = 356.01316
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:09
FLUX-RUN END lammps-iter-1
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2901 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1245 kB]
Get:6 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [4282 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3211 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1546 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [4436 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [83.2 kB]
Fetched 18.1 MB in 2s (8642 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
jq is already the newest version (1.6-2.1ubuntu3).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
FLUX-JOB START 6660554752 lammps-iter-1
FLUX-JOB-JOBSPEC START
{"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 2, "with": [{"type": "core", "count": 1}], "label": "task"}]}], "tasks": [{"command": ["/usr/bin/lmp", "-v", "x", "2", "-v", "y", "2", "-v", "z", "2", "-in", "in.reaxff.hns", "-nocite"], "slot": "task", "count": {"per_slot": 1}}], "attributes": {"system": {"duration": 0, "cwd": "/opt/lammps/examples/reaxff/HNS", "shell": {"options": {"rlimit": {"cpu": -1, "fsize": -1, "data": -1, "stack": 8388608, "core": -1, "nofile": 1048576, "as": -1, "rss": -1, "nproc": -1}, "cpu-affinity": "per-task", "gpu-affinity": "off"}}}, "user": {"study_id": "lammps-iter-1"}}, "version": 1}
FLUX-JOB-JOBSPEC END
FLUX-JOB-RESOURCES START
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "6-7"}}], "nodelist": ["lammps-0"], "starttime": 1746991421, "expiration": 4900591421}}
FLUX-JOB-RESOURCES END
FLUX-JOB-EVENTLOG START
{"timestamp":1746991421.5843747,"name":"init"}
{"timestamp":1746991421.5915587,"name":"shell.init","context":{"service":"0-shell-fB9a5su","leader-rank":0,"size":1}}
{"timestamp":1746991421.5945508,"name":"shell.start","context":{"taskmap":{"version":1,"map":[[0,1,2,1]]}}}
{"timestamp":1746991421.5846651,"name":"starting"}
{"timestamp":1746991432.7146101,"name":"shell.task-exit","context":{"localid":1,"rank":1,"state":"Exited","pid":107,"wait_status":0,"signaled":0,"exitcode":0}}
{"timestamp":1746991432.7171538,"name":"complete","context":{"status":0}}
{"timestamp":1746991432.7171805,"name":"done"}
FLUX-JOB-EVENTLOG END
FLUX-JOB END 6660554752 lammps-iter-1
FLUX JOB STATS
{"job_states":{"depend":0,"priority":0,"sched":0,"run":0,"cleanup":0,"inactive":1,"total":1},"successful":1,"failed":0,"canceled":0,"timeout":0,"inactive_purged":0,"queues":[]}
We have functions that are useful to parse the log from metadata that we will provide in an associated library.
The normal running mode assumes a distributed application (across node) and simply runs iterations and prints application output from the lead broker. However, we have a few custom modes for different cases.
For paired runs (between pairs of nodes) you might want to run something that selects samples from pairs. We support that with experiment.pairs
. Here is how to select 28 combinations, 8 nodes (the pairs parameter), 2 at a time, for a loop over three OSU benchmarks. This is intended to run in kind on a local machine, but you'd want to adjust the sizes for your cluster.
helm dependency update osu-benchmarks
for app in osu_latency osu_bw
do
helm install \
--set experiment.nodes=8 \
--set minicluster.size=8 \
--set minicluster.tasks=12 \
--set minicluster.save_logs=true \
--set experiment.pairs=8 \
--set osu.binary=/opt/osu-benchmark/build.openmpi/mpi/pt2pt/$app \
--set experiment.tasks=2 \
osu osu-benchmarks/
sleep 5
time kubectl wait --for=condition=ready pod -l job-name=osu --timeout=600s
pod=$(kubectl get pods -o json | jq -r .items[0].metadata.name)
kubectl logs ${pod} -f
helm uninstall osu
done
The sleep
isn't explicitly necessary, but rarely the deployment is slow enough that it will skip and cause an error in the next line. To save to an output file, you would change the second to the last line in the loop:
kubectl logs ${pod} -f |& tee ./logs/$app.out
If you have a single node benchmark, you might want to run one instance on each node in the cluster. Here is an example of doing that with our single node benchmark.
helm dependency update ./single-node
helm install \
--set experiment.nodes=2 \
--set minicluster.size=2 \
--set minicluster.tasks=8 \
--set experiment.tasks=1 \
--set minicluster.save_logs=true \
--set minicluster.show_logs=true \
--set experiment.foreach=true \
--set experiment.iterations=1 \
single-node ./single-node
time kubectl wait --for=condition=ready pod -l job-name=single-node --timeout=600s
pod=$(kubectl get pods -o json | jq -r .items[0].metadata.name)
kubectl logs ${pod} -f
helm uninstall single-node
For this setup, you'll see flux submit
so the jobs will run at the same time on single nodes. Then output is presented later, with the flux events. This is why you want to set minicluster.show_logs=true
to see that output.
This setup will deploy a sidecar and monitor different interacts with bcc. We have several programs that help to understand tcp, file open/closes, cpu, shared memory, or futex wait times. There are two approaches:
- Multiple sidecars per pod (adds overhead, but is acceptable given what the HPC community already does) and the benefit is measuring the same thing between applications.
- Single sidecar per pod (and metrics distributed across cluster) low to zero overhead, and better for summary metrics or models. We an algorithm to select from the set of programs you requested.
Although for both approaches you can filter to a cgroup or command, for the default we allow all containers in the pod to be seen. It generates a lot more data, but is interesting. Here is how to select a metric for a single sidecar per pod method:
helm install \
--set experiment.monitor=open-close \
--set minicluster.save_logs=true \
--dry-run lammps ./lammps-reax
Looking for /opt/programs/open-close/ebpf-collect.c
Starting eBPF (Tracepoint for open entry).
Start Indicator file defined '/mnt/flux/start_ebpf_collection'. Waiting.
{"event": "OPEN", "command": "python3", "retval": 12, "ts_sec": 779.540036095, "tgid": 0, "tid": 14554, "ppid": 14554, "cgroup_id": 0, "filename": "/sys/bus/event_source/devices/kprobe/type"}
{"event": "OPEN", "command": "python3", "retval": 12, "ts_sec": 779.540051234, "tgid": 0, "tid": 14554, "ppid": 14554, "cgroup_id": 0, "filename": "/sys/bus/event_source/devices/kprobe/format/retprobe"}
{"event": "OPEN", "command": "containerd", "retval": 193, "ts_sec": 779.629315113, "tgid": 0, "tid": 3600, "ppid": 3618, "cgroup_id": 0, "filename": "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/204/fs"}
{"event": "CLOSE", "command": "containerd", "retval": 0, "ts_sec": 779.629342785, "tgid": 1, "tid": 3600, "ppid": 3618, "cgroup_id": 6520}
...
{"event": "OPEN", "command": "touch", "retval": 3, "ts_sec": 803.043308743, "tgid": 0, "tid": 14883, "ppid": 14883, "cgroup_id": 3257288213055174703, "filename": "/usr/lib/locale/C.utf8/LC_NUMERIC"}
{"event": "CLOSE", "command": "touch", "retval": 0, "ts_sec": 803.043310733, "tgid": 14414, "tid": 14883, "ppid": 14883, "cgroup_id": 13176}
{"event": "OPEN", "command": "touch", "retval": 3, "ts_sec": 803.043316595, "tgid": 0, "tid": 14883, "ppid": 14883, "cgroup_id": 3257288213055174703, "filename": "/usr/lib/locale/C.utf8/LC_CTYPE"}
{"event": "CLOSE", "command": "touch", "retval": 0, "ts_sec": 803.043318514, "tgid": 14414, "tid": 14883, "ppid": 14883, "cgroup_id": 13176}
{"event": "OPEN", "command": "touch", "retval": 3, "ts_sec": 803.043359627, "tgid": 0, "tid": 14883, "ppid": 14883, "cgroup_id": 3257288213055174703, "filename": "/mnt/flux/stop_ebpf_collection"}
{"event": "CLOSE", "command": "touch", "retval": 0, "ts_sec": 803.043360931, "tgid": 14414, "tid": 14883, "ppid": 14883, "cgroup_id": 13176}
Indicator file '/mnt/flux/stop_ebpf_collection' found. Stopping.
Cleaning up BPF resources...
Here is how to do multiple at once (each still a single sidecar)
helm install \
--set experiment.monitor="cpu|shmem|tcp|futex|open-close" \
--set minicluster.save_logs=true \
lammps ./lammps-reax
Here is how to deploy multiple sidecars:
helm install \
--set experiment.monitor_multiple=flamegraph|open-close \
--set experiment.monitor.sleep=true \
--set minicluster.save_logs=true \
--dry-run lammps ./lammps-reax
For the flamegraph, you'll want to enable the monitor container to sleep so you can copy svg and folded files out after.
To clean up:
helm uninstall lammps
Here are all the examples. For any example, you need to update dependencies before you run:
helm dependency update ./<app>
helm install amg ./amg2023
helm install kripke ./kripke
helm install lammps ./lammps-reax
helm install laghos ./laghos
helm install minife ./minife
helm install mtgemm ./mixbench
helm install mtgemm ./mt-gemm
helm install stream ./osu-benchmarks
helm install stream ./quicksilver
helm install stream ./single-node
helm install stream ./stream
And an example to use a custom yaml file (more ideal for reproducible experiments):
helm install amg -f ./examples/amg2023/flux-minicluster.yaml ./amg2023
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614