Optimising Virtual Desktop Performance

2026-04-06 @ 10 minute(s)

notes


Proxmox VM Dashboard metrics for the Virtual Desktop
Proxmox VM Dashboard metrics for the Virtual Desktop

I ran into a bit of a problem with my virtual desktop under certain workloads. Performance on CPU-intensive tasks was worse than what I was expecting based on the CPU and the number of dedicated cores for the VM. I further confirmed this issue when I was playing the recently released Death Stranding 2 via Proton on Steam. The GPU utilisation was persistently utilised at below 60% of maximum, which is clearly unoptimal and shouldn’t happen with the used pairing of cpu and graphics card (Ryzen 9 9900X, RTX 5070-ti). So in general, the issue manifested as lower than expected thread utilisation and output metrics during workloads, which initially lead me to believe something related to either scheduling or caching was causing stalls.

CCD & CPU Affinity

Illustration of Core Complex Die (CCD) and the IO die

My first instinct was to verify that the VM was actually assigned the correct cores/threads, and to make sure that it had priority access to them for scheduling. As the Ryzen 9 9900X uses a two Core Complex Die (CCD) design, it’s rather important that the VM be given cores/threads that are all on the same CCD to avoid potential latency issues stemming from the cache being split between the CCDs. I was aware of this potential point of contention already when I was choosing the CPU, but I had presumed that the “NUMA aware” flag in Proxmox would be smart enough to handle this allocation automatically, and thus sort of handwaved the details away. Regardless, I went on to verify what the cpu numa layout was using lscpu, and then cross-referenced those details with what top was showing for the VM process.

 1root@pve.lan.tbk.fi:~# lscpu
 2Architecture:                x86_64
 3  CPU op-mode(s):            32-bit, 64-bit
 4  Address sizes:             48 bits physical, 48 bits virtual
 5  Byte Order:                Little Endian
 6CPU(s):                      24
 7  On-line CPU(s) list:       0-23
 8Vendor ID:                   AuthenticAMD
 9  Model name:                AMD Ryzen 9 9900X 12-Core Processor
10    CPU family:              26
11    Model:                   68
12    Thread(s) per core:      2
13    Core(s) per socket:      12
14    Socket(s):               1
15    Stepping:                0
16    Frequency boost:         enabled
17    CPU(s) scaling MHz:      62%
18    CPU max MHz:             5662.0161
19    CPU min MHz:             613.9540
20    BogoMIPS:                8782.95
21    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good
22                              amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf
23                             _lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault 
24                             cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq a
25                             dx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vn
26                             ni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avi
27                             c v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_de
28                             tect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
29Virtualization features:     
30  Virtualization:            AMD-V
31Caches (sum of all):         
32  L1d:                       576 KiB (12 instances)
33  L1i:                       384 KiB (12 instances)
34  L2:                        12 MiB (12 instances)
35  L3:                        64 MiB (2 instances)
36# This is the relevant section
37NUMA:                        
38  NUMA node(s):              2
39  NUMA node0 CPU(s):         0-5,12-17
40  NUMA node1 CPU(s):         6-11,18-23
41#
42Vulnerabilities:             
43  Gather data sampling:      Not affected
44  Ghostwrite:                Not affected
45  Indirect target selection: Not affected
46  Itlb multihit:             Not affected
47  L1tf:                      Not affected
48  Mds:                       Not affected
49  Meltdown:                  Not affected
50  Mmio stale data:           Not affected
51  Old microcode:             Not affected
52  Reg file data sampling:    Not affected
53  Retbleed:                  Not affected
54  Spec rstack overflow:      Mitigation; IBPB on VMEXIT only
55  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
56  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
57  Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
58  Srbds:                     Not affected
59  Tsa:                       Not affected
60  Tsx async abort:           Not affected
61  Vmscape:                   Mitigation; IBPB on VMEXIT

This output confirms the two node layout and that the enumeration for the CCDs is interleaved. My goal is therefore to make sure the high-performance VM ideally only uses cores/threads from either 0-5,12-17 or 6-11,18-23 ranges. I’d also want to make sure that the KVM process has high priority on scheduling so other host processes steal as little cycles as possible.

Next logical step is to see what CPUs the active KVM process is currently using. So I went through the active kvm processes by using ps -ef | grep "/usr/bin/kvm" and noted the process ID for my vdesk. I then used top -p <ID> to inspect the active CPUs for the VM process using the P (Last Used SMP) field.

I noticed that the VM process was, in fact, using CPUs from both NUMA node ranges. I’m a little surprised by this, but to be fair, the flag is just called “Enable NUMA”. This doesn’t really mean anything concrete at face value, and I haven’t actually looked into what the option does under the hood, so fair enough –I’ll just specify the affinity manually.

I figure twelve threads is enough for this VM, allowing me to dedicate this second CCD entirely for the vdesk. This leaves the other CCD entirely for other virtual hosts, making sure that the high performance VM can’t possibly grind the whole hypervisor to a halt. I’ve also upped the CPU units weight for this VM higher, so the scheduler will prioritise this VMs needs within that CCD when appropriate. Meaning that it should still be safe enough to allow other VMs access to the CCD hardware if needed.

Proxmox VM Processor Settings for CPU Affinity

After making these changes and restarting the vdesk, I confirmed that the affinity setting was being utilised for scheduling as directed:

root@pve.lan.tbk.fi:~# top -p 56066
  PID   USER      PR  NI   VIRT   RES     SHR  S   %CPU     P
  56192 root      20   0   42.1g  53716   9980 S   4.0       7
  56182 root      20   0   42.1g  53716   9980 S   1.7       8
  56185 root      20   0   42.1g  53716   9980 S   1.3       9
  56186 root      20   0   42.1g  53716   9980 S   1.3      10
  56190 root      20   0   42.1g  53716   9980 S   1.3      21
  56187 root      20   0   42.1g  53716   9980 S   1.0      20
  56188 root      20   0   42.1g  53716   9980 S   0.7       6
  56193 root      20   0   42.1g  53716   9980 S   0.7       8
  56183 root      20   0   42.1g  53716   9980 S   0.3      18
  56184 root      20   0   42.1g  53716   9980 S   0.3      23
  56189 root      20   0   42.1g  53716   9980 S   0.3      19
  56191 root      20   0   42.1g  53716   9980 S   0.3      10
  56066 root      20   0   42.1g  53716   9980 S   0.0       6
  56067 root      20   0   42.1g  53716   9980 S   0.0       6
  56068 root      20   0   42.1g  53716   9980 S   0.0      19
  56069 root      20   0   42.1g  53716   9980 S   0.0      11
  56167 root      20   0   42.1g  53716   9980 S   0.0      18
  56195 root      20   0   42.1g  53716   9980 S   0.0      23
  56218 root      20   0   42.1g  53716   9980 S   0.0      21
  56219 root      20   0   42.1g  53716   9980 S   0.0      21
  56260 root      20   0   42.1g  53716   9980 S   0.0       7
 713924 root      20   0   42.1g  53716   9980 S   0.0       8

Huge Pages

And just because I was curious and at it, I decided to see if I could get hugepages working and active for this VM. Supposedly this may help with memory lookup speed by reducing the amount of entries in the TLB, which I imagine can conditionally provide better performance depending on the workload and how much total memory the system has. There’s a small benchmark article on the topic at RedHat Developers; but, I’m not about to run any benchmarks for this little demo since I’m just interested in seeing if I can get it to work for now.

To use the 1GiB hugepages the cpu should support pdpe1gb, so have a gander at lscpu or cat /proc/cpuinfo for the available cpu features.

 1flags		: fpu vme de pse tsc msr pae mce cx8 
 2            apic sep mtrr pge mca cmov pat pse36
 3            clflush mmx fxsr sse sse2 ht syscall
 4            nx mmxext fxsr_opt pdpe1gb rdtscp lm # <-- 'pdpe1gb'
 5            constant_tsc rep_good amd_lbr_v2 nopl
 6            xtopology nonstop_tsc cpuid extd_apicid
 7            aperfmperf rapl pni pclmulqdq monitor
 8            ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt
 9            aes xsave avx f16c rdrand lahf_lm cmp_legacy
10            svm extapic cr8_legacy abm sse4a misalignsse
11            3dnowprefetch osvw ibs skinit wdt tce topoext
12            perfctr_core perfctr_nb bpext perfctr_llc
13            mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate
14            ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced
15            vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2
16            erms invpcid cqm rdt_a avx512f avx512dq adx smap
17            avx512ifma clflushopt clwb avx512cd sha_ni
18            avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves
19            cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
20            user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr
21            rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save
22            tsc_scale vmcb_clean flushbyasid decodeassists
23            pausefilter pfthreshold avic v_vmsave_vmload
24            vgif x2avic v_spec_ctrl vnmi avx512vbmi umip
25            pku ospke avx512_vbmi2 gfni vaes vpclmulqdq
26            avx512_vnni avx512_bitalg avx512_vpopcntdq
27            rdpid bus_lock_detect movdiri movdir64b
28            overflow_recov succor smca fsrm avx512_vp2intersect
29            flush_l1d amd_lbr_pmc_freeze

Seems like my CPU supports it, so I followed the general guide for hugepages from Proxmox by adjusting the kernel CMDLINE contents in /etc/default/grub. I’ve given vdesk 24GiB of RAM, so I want at least 24 pages (each 1GiB).

1GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=on default_hugepagesz=1G hugepagesz=1G hugepages=26"

Curiously I couldn’t find a setting for enabling or configuring hugepages in the VM hardware configuration menu within the Web UI, so I manually added the line into the VM configuration.

 1root@pve.lan.tbk.fi:~# vim /etc/pve/qemu-server/114.conf 
 2[...]
 3name: Hybrid
 4machine: q35
 5cores: 12
 6cpu: host
 7cpuunits: 2048
 8affinity: 6-11,18-23
 9hugepages: 1024        # <-- addition
10[...]

Lastly I ran update-grub ; proxmox-boot-tool refresh to make sure the new CMDLINE was pushed, and then rebooted the machine.

After the reboot I started vdesk and checked current memory information from /proc/meminfo:

 1root@pve.lan.tbk.fi:~# grep -i huge /proc/meminfo
 2AnonHugePages:  33568768 kB
 3ShmemHugePages:        0 kB
 4FileHugePages:     49152 kB
 5HugePages_Total:      26
 6HugePages_Free:        2
 7HugePages_Rsvd:        0
 8HugePages_Surp:        0
 9Hugepagesize:    1048576 kB
10Hugetlb:        27262976 kB

This would indicate that that the VM was correctly given the 24GiB worth of hugepages for its memory space. Note that I overprovisioned the created pages on Proxmox because I’m not yet familiar with how the system actually behaves under the hood and if I need some extra or not.

Storage

The VM disk is currently on a Gen5 NVME SSD. There’s not much I can do to improve that, specially with the current hardware prices.

Conclusion and Performance Improvement

So good news and bad news. The good news being that the manual cpu affinity changes definitely improved performance across the board in basically all tasks, including the video game (see image below). Bad news is that I still can’t play my game, as there appear to be unresolved issues on Linux when using Blackwell cards and Proton in select titles.

But even with the issues, comparing the initial and final metrics side by side a few key details are evident. The CPU load distribution is far more uniform across the cores when running the video game, and the frametime graph shows far more consistent timing performance. To me this indicates that the changes were very successful.

Mangohud performance metrics before and after changes

The low GPU usage quirk in DS2 is probably due to an issue in Nvidia drivers, Proton, or just the game itself. Other titles showed a similar (big) improvement in frametimes and more uniform CPU load, without the very low GPU utilisation issue.