-
Notifications
You must be signed in to change notification settings - Fork 176
The shape of VM to come
Get qemu
git clone https://gitlab.com/qemu-project/qemu.git
Install qemu
mkdir build \
cd build \
../configure --enable-slirp \
make -j \
sudo make install \
Get an image of debian12.2.0 we want to boot on as a virtual machine:
wget https://www.debian.org/distrib/netinst/debian-12.2.0-amd64-netinst.iso .
Create a disk image (qcow2 format) where the vm will store
qemu-img create -f qcow2 mydisk.img 20G
Install the vm running on debian with qemu:
qemu-system-x86_64 -boot d -cdrom debian-12.2.0-amd64-netinst.iso -m 4G \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
-hda mydisk.img -accel kvm
Follow all instruction from the interface and you're done. -accel kvm helps boosting the installation time (from 1h30 to 20min in my case)
Let's say we want to run debian with 8Gb of ram:
qemu-system-x86_64 -hda mydisk.img -m 8G -accel kvm
A vm can use a lot of ressources and slow down its usage, we can lighten our efforts by disabling all graphical interface: Open a terminal within the vm and run
sudo systemctl set-default multi-user.target
sudo reboot
Just in case, you can re-enable it with:
systemctl set-default graphical.target
sudo reboot
Before going further, we can save our mydisk.img
as snapshots thanks to qcow2
format:
qemu-img create -f qcow2 -b mydisk.img -F qcow2 snapshot.img
'mydisk.img' should not be modified anymore, because, any change could corrupt snapshots.
Using qemu, let's set our VM's hardware with 4 NUMA nodes, each with 4cpus of 4,2,1 and 1Gb of memory: \
qemu-system-x86_64 -hda snapshot.img -m 8G \
-accel kvm \
-smp cpus=16 \
-object memory-backend-ram,size=4G,id=ram0 \
-object memory-backend-ram,size=2G,id=ram1 \
-object memory-backend-ram,size=1G,id=ram2 \
-object memory-backend-ram,size=1G,id=ram3 \
-numa node,nodeid=0,memdev=ram0,cpus=0-3 \
-numa node,nodeid=1,memdev=ram1,cpus=4-7 \
-numa node,nodeid=2,memdev=ram2,cpus=8-11 \
-numa node,nodeid=3,memdev=ram3,cpus=12-15 \
qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
-machine pc,nvdimm=on \
-m 8G,slots=1,maxmem=9G \
-smp cpus=16 \
-object memory-backend-ram,size=4G,id=ram0 \
-object memory-backend-ram,size=2G,id=ram1 \
-object memory-backend-ram,size=1G,id=ram2 \
-object memory-backend-ram,size=1G,id=ram3 \
-device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=4 \
-object memory-backend-file,id=nvdimm1,share=on,mem-path=img/nvdimm.img,size=1G \
-numa node,nodeid=0,memdev=ram0,cpus=0-3 \
-numa node,nodeid=1,memdev=ram1,cpus=4-7 \
-numa node,nodeid=2,memdev=ram2,cpus=8-11 \
-numa node,nodeid=3,memdev=ram3,cpus=12-15 \
-numa node,nodeid=4
By running the command: ndctl list -NRD
we can list the active and enabled nvdimm devices:
{
"dimms":[
{
"dev":"nmem0",
"id":"8680-56341200",
"handle":1,
"phys_id":0
}
],
"regions":[
{
"dev":"region0",
"size":1073741824,
"align":16777216,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"mappings":[
{
"dimm":"nmem0",
"offset":0,
"length":1073741824,
"position":0
}
],
"persistence_domain":"unknown",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"raw",
"size":1073741824,
"sector_size":512,
"blockdev":"pmem0"
}
]
}
]
}
By defaults, the namespaceX.Y (here namespace0.0) is set as a raw mode. Which means, the nvdimm device acts as a memory disk not supporting dax. We need to disable the namespace, create a new one and finally set mode to devdax with following commands:
sudo ndctl disable-namespace namespace0.0
sudo ndctl create-namespace -m devdax
sudo daxctl reconfigure-device -m system-ram all --force
Node 4 is now congired as dax:
{
"dimms":[
{
"dev":"nmem0",
"id":"8680-56341200",
"handle":1,
"phys_id":0
}
],
"regions":[
{
"dev":"region0",
"size":1073741824,
"align":16777216,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"mappings":[
{
"dimm":"nmem0",
"offset":0,
"length":1073741824,
"position":0
}
],
"persistence_domain":"unknown",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":1054867456,
"uuid":"ed8bb2a9-41fb-48e0-a0b2-7dbf0d9ca9ba",
"chardev":"dax0.0",
"align":2097152
}
]
}
]
}
To be sure, ewe work with latest linux kernel: 6.7.0-rc3+
First we need a CXL hostbridge (Pci EXtended Bridge, i.e, pxb-cxl "cxl.1"), then we attach a root-port (cxl-rp "root_port13" here), then a Type 3 device.
In this case it is a pmem device so it needs two "memory-backend-file" objects, one for the memory ("pmem0" here) and one for its label storage area (LSA, i.e "cxl-lsa0"). Finally we need a Fixed Memory Window (FMW, i.e, cxl-fwm) to map that memory in the host:
qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
-machine q35,nvdimm=on,cxl=on \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 \
-netdev user,id=net0,hostfwd=tcp::10022-:22 \
-m 4G,slots=8,maxmem=8G \
-smp 4 \
-object memory-backend-ram,size=4G,id=mem0 \
-numa node,nodeid=0,cpus=0-3,memdev=mem0 \
-object memory-backend-file,id=pmem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
-object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
-device cxl-type3,bus=root_port13,persistent-memdev=pmem0,lsa=cxl-lsa0,id=cxl-pmem0 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
We need to create the region using cxl create-region and make it available as nvm numa node:
sudo cxl create-region -m -d decoder0.0 -t pmem mem0
sudo daxctl reconfigure-device -m system-ram dax0.0 --force
Lets build with 2 sockets. Each socket has 2 cpus and 2 cxl devices, 1 switch.
We need a PXB per socket with 2 RP per socket. A switch is installed on each socket. We need to set 1 upstream port per socket and 2 downstream ports per sockets. Both pxb set as upstream port for the switch, have to be attached on slot 0. Hence, we need to distinguish chassis from each other numa nodes.
In this case it is a vmem device so it needs two "memory-backend-ram" objects per socket. Finally we set 2 Fixed Memory Window to map both memory in the host:
qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
-machine q35,nvdimm=on,cxl=on \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 \
-netdev user,id=net0,hostfwd=tcp::10022-:22 \
-m 2G,slots=8,maxmem=10G \
-smp cpus=4,cores=2,sockets=2 \
-object memory-backend-ram,size=1G,id=ram0 \
-object memory-backend-ram,size=1G,id=ram1 \
-object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
-object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
-object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
-object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
-numa node,nodeid=0,cpus=0-1,memdev=ram0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram1 \
-device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
-device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
-device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
-device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=1 \
-device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
-device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=2 \
-device cxl-upstream,bus=root_port1,id=us0 \
-device cxl-upstream,bus=root_port3,id=us1 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=3 \
-device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=4 \
-device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
-device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=5 \
-device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
-device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=6 \
-device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
-M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G
Here, we selected root_port1 and root_port3 to be plugged on slot 0 on chassis 0 and chassis 1 respectively. bus_nr of PXBs may lead to error messages because they may be already used. Just change them to another value.
From the vm, list cxl memory devices with cxl list -M
:
[
{
"memdev":"mem1",
"ram_size":268435456,
"serial":0,
"numa_node":1,
"host":"0000:23:00.0"
},
{
"memdev":"mem0",
"ram_size":268435456,
"serial":0,
"numa_node":1,
"host":"0000:24:00.0"
},
{
"memdev":"mem2",
"ram_size":268435456,
"serial":0,
"numa_node":0,
"host":"0000:1b:00.0"
},
{
"memdev":"mem3",
"ram_size":268435456,
"serial":0,
"numa_node":0,
"host":"0000:1c:00.0"
}
]
We can list decoders available with cxl list -D
:
[
{
"root decoders":[
{
"decoder":"decoder0.0",
"size":4294967296,
"interleave_ways":1,
"max_available_extent":-17985175553,
"pmem_capable":true,
"volatile_capable":true,
"accelmem_capable":true,
"nr_targets":1
},
{
"decoder":"decoder0.1",
"size":4294967296,
"interleave_ways":1,
"max_available_extent":-22280142849,
"pmem_capable":true,
"volatile_capable":true,
"accelmem_capable":true,
"nr_targets":1
}
]
}
]
We assemble a cxl region with the cxl list create-region
command. We need to select the decoder where the region will be created under and containing cxl devices. Below, we first assemble mem1 and mem0 located under decoder0.1, with a 2 way interleaving:
sudo cxl create-region -m -d decoder0.1 -t ram -w 2 mem1 mem0
And we assemble with decoder 0.0 mem2 and mem3 with 1 way interleaving
sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem2
sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem3
We can see they are now available with command: daxctl list
[
{
"chardev":"dax1.0",
"size":268435456,
"target_node":3,
"align":2097152,
"mode":"system-ram"
},
{
"chardev":"dax3.0",
"size":268435456,
"target_node":3,
"align":2097152,
"mode":"system-ram"
},
{
"chardev":"dax0.0",
"size":536870912,
"target_node":2,
"align":2097152,
"mode":"system-ram"
}
]
New DAX device should appear under /sys/bus/dax/devices. By default, new NUMA nodes appear offline. Run daxctl online-memory all
to make them online. \
Lets build a vm with 4 sockets, one socket with only cpus, one with cxl pmem device, one with 2 cxl 2-way interleaved, one with 2 cxl 1-way interleaved
qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
-machine q35,nvdimm=on,cxl=on \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 \
-netdev user,id=net0,hostfwd=tcp::10022-:22 \
-m 4G,slots=8,maxmem=10G \
-smp cpus=8,cores=2,sockets=4 \
-object memory-backend-ram,size=1G,id=ram0 \
-object memory-backend-ram,size=1G,id=ram1 \
-object memory-backend-ram,size=1G,id=ram2 \
-object memory-backend-ram,size=1G,id=ram3 \
-object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
-object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
-object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
-object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
-object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest.raw,size=256M \
-object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa.raw,size=256M \
-numa node,nodeid=0,cpus=0-1,memdev=ram0 \
-numa node,nodeid=1,cpus=2-3,memdev=ram1 \
-numa node,nodeid=2,cpus=4-5,memdev=ram2 \
-numa node,nodeid=3,cpus=6-7,memdev=ram3 \
-device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
-device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
-device pxb-cxl,numa_node=3,bus_nr=40,bus=pcie.0,id=pxb-cxl.3 \
-device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
-device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=3 \
-device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
-device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=5 \
-device cxl-rp,port=0,bus=pxb-cxl.3,id=root_port5,chassis=2,slot=0 \
-device cxl-upstream,bus=root_port1,id=us0 \
-device cxl-upstream,bus=root_port3,id=us1 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=7 \
-device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
-device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
-device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=9 \
-device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
-device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=10 \
-device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
-device cxl-type3,bus=root_port5,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem0 \
-M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=pxb-cxl.3,
cxl-fmw.2.size=512M
TIP: How to identify which decoder corresponds to which device.
When listing with cxl list -Dv
, identify the id. Here the decoder0.0 corresponds to the id=24.
It corresponds to the bus number attached to a node.
From our previous qemu script, the bus_nr=24
corresponds to our numa_node=0
"decoders:root0":[
{
"decoder":"decoder0.0",
"size":4294967296,
"interleave_ways":1,
"max_available_extent":4294967296,
"pmem_capable":true,
"volatile_capable":true,
"accelmem_capable":true,
"nr_targets":1,
"targets":[
{
"target":"pci0000:18",
"alias":"ACPI0016:02",
"position":0,
"id":24
}
]
},
{
"decoder":"decoder0.1",
"size":4294967296,
"interleave_ways":1,
"max_available_extent":4294967296,
"pmem_capable":true,
"volatile_capable":true,
"accelmem_capable":true,
"nr_targets":1,
"targets":[
{
"target":"pci0000:20",
"alias":"ACPI0016:01",
"position":0,
"id":32
}
]
},
{
{
"decoder":"decoder0.2",
"size":536870912,
"interleave_ways":1,
"max_available_extent":536870912,
"pmem_capable":true,
"volatile_capable":true,
"accelmem_capable":true,
"nr_targets":1,
"targets":[
{
"target":"pci0000:28",
"alias":"ACPI0016:00",
"position":0,
"id":40
}
]
Lets select the id 24. It is attached to the decoder0.0. To identify which memory device is below that decoder, run cxl list -M
:
[
{
"memdev":"mem0",
"pmem_size":268435456,
"serial":0,
"numa_node":3,
"host":"0000:29:00.0"
},
{
"memdev":"mem1",
"ram_size":268435456,
"serial":0,
"numa_node":0,
"host":"0000:1b:00.0"
},
{
"memdev":"mem4",
"ram_size":268435456,
"serial":0,
"numa_node":0,
"host":"0000:1c:00.0"
},
{
"memdev":"mem3",
"ram_size":268435456,
"serial":0,
"numa_node":1,
"host":"0000:23:00.0"
},
{
"memdev":"mem2",
"ram_size":268435456,
"serial":0,
"numa_node":1,
"host":"0000:24:00.0"
}
]
We can see that in the numa_node 0, mem1 and mem4 are located.
So we can run: sudo cxl create-region -m -t ram -d decoder0.0 -w2 mem4 mem1
without doubt whether it is the right decoder with the rights memory devices.
Then we finalize the configuration with one way interleaving:
sudo cxl create-region -m -t ram -d decoder0.1 -w1 mem3
sudo cxl create-region -m -t ram -d decoder0.1 -w1 mem2
The region with persistent memory:
sudo cxl create-region -m -t pmem -d decoder0.2 mem0
sudo ndctl create-namespace -t pmem -m devdax -r region2 -f
And finally online all devices:
sudo daxctl online-memory all
sudo daxctl reconfigure-device -m system-ram dax2.0 --force
We oberved error messages like: failed to create namespace: No space left on device
after running the namespace creation. To encounter this issue, erase files declared in the mem-path argument (usually /tmp/ ) of your qemu script, reboot the vm.