Skip to content

Latest commit

 

History

History

topology_manager

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Feature: Topology Manager

Topology Manager coordinates Pod admission on the Node using hints from kubelet components to optimize workload performance. The Topology Manager collects Hints from kubelet components on a pod-by-pod or a container-by-container basis. These Topology Hints are generated from the spec.containers.resources and spec.initContainers.resources requests key-values and limits key-values. If the generated Topology Hints are not compatible with the Node, the Pod may be rejected.

Included in this document are supporting use-cases to see how the feature works on the IBM Power Systems hardware, and details on supporting references and utilities.

Use Cases

<To Be Added>

  1. 1_Topology_Manager_with_Hugepages.md

References

The following are useful references related to the Topology Manager.

Appendix: Utilities

  1. NUMA: Checking basic information on the CPU Topology with lscpu.
[root@iwdc04-st-pvc-n1 ~]# lscpu 
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  8
Core(s) per socket:  3
Socket(s):           3
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7,16-23,32-39,48-55,64-71
NUMA node1 CPU(s):   8-15,24-31,40-47,56-63,72-79
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

I've seen a few examples with lscpu | grep -i -e cpu -e thread -e core -e socket to really focus the output.

  1. numactl on RHEL is helpful for debugging and seeing what's going on with NUMA. Linux Manual Page: numa

    a. Install the NUMA tools.

$ yum install -y numatop numactl
...
Complete!

b. Type numatop, then press N (it gets more interesting with more than 1 node)

                                       NumaTOP v2.0, (C) 2015 Intel Corporation

Node Overview (interval: 5.0s)

 NODE     MEM.ALL    MEM.FREE     RMA(K)     LMA(K)    RMA/LMA       CPU%
    0       14.9G        2.2G        0.4       52.4        0.0        0.3

c. press b to go back.(once again more interesting with more than one Node)

                                       NumaTOP v2.0, (C) 2015 Intel Corporation

Monitoring 206 processes and 505 threads (interval: 5.0s)

   PID           PROC     RMA(K)     LMA(K)    RMA/LMA        CPI     *CPU%
 44547        haproxy        0.1       21.8        0.0       3.82       0.1
   718    xfsaild/dm-        0.1       12.8        0.0       3.61       0.1
   839    systemd-jou        0.2       10.8        0.0       2.99       0.1
  1664       rsyslogd        0.0        9.5        0.0       2.93       0.0
232987        numatop        0.0       10.6        0.0       2.91       0.0
  1365          tuned        0.0        5.0        0.0       2.43       0.0
     1        systemd        0.0        0.0        0.0       0.00       0.0
     2       kthreadd        0.0        0.0        0.0       0.00       0.0
     3         rcu_gp        0.0        0.0        0.0       0.00       0.0
     4     rcu_par_gp        0.0        0.0        0.0       0.00       0.0
     6    kworker/0:0        0.0        0.0        0.0       0.00       0.0
     8    mm_percpu_w        0.0        0.0        0.0       0.00       0.0
     9    rcu_tasks_r        0.0        0.0        0.0       0.00       0.0
    10    rcu_tasks_t        0.0        0.0        0.0       0.00       0.0
    11    ksoftirqd/0        0.0        0.0        0.0       0.00       0.0
    12      rcu_sched        0.0        0.0        0.0       0.00       0.0

<- Hotkey for sorting: 1(RMA), 2(LMA), 3(RMA/LMA), 4(CPI), 5(CPU%) ->
CPU% = system CPU utilization

A handy NUMA reference script is at numactl - it's only super helpful if you have more than one Node.

If you want to really test your NUMA node, you can use the numademo command from the numactl install.

  1. Check the kernel ring buffer with dmesg
$ dmesg | grep numa
[    0.000000] numa: Partition configured for 32 NUMA nodes.
[    0.000000] numa:   NODE_DATA [mem 0x3ffdd7c00-0x3ffde3fff]
[    0.005284] numa: Node 0 CPUs: 0-7

The output shows a single Node with eight CPUs.

  1. Use the hardware locality tool hwloc

a. Install hwloc

$ yum install -y hwloc
Updating Subscription Management repositories.
...
Total download size: 2.2 M
Installed size: 3.7 M
Downloading Packages:
(1/2): hwloc-libs-2.2.0-3.el8.ppc64le.rpm                                              8.9 MB/s | 2.0 MB     00:00    
(2/2): hwloc-2.2.0-3.el8.ppc64le.rpm                                                   730 kB/s | 167 kB     00:00    
-----------------------------------------------------------------------------------------------------------------------
Total                                                                                  9.5 MB/s | 2.2 MB     00:00     
...
Complete!

b. Check the hardware layout and more importantly the NUMANode layout

$ export HWLOC_ALLOW=all; lstopo-no-graphics -v
Machine (P#0 total=15624320KB PlatformName=pSeries PlatformModel="CHRP IBM,9009-22A" Backend=Linux OSName=Linux OSRelease=4.18.0-XYZ.el8.ppc64le OSVersion="#1 SMP Fri Apr 15 21:55:01 EDT 2022" HostName=XYZ.xip.io Architecture=ppc64le hwlocVersion=2.2.0 ProcessName=lstopo-no-graphics)
  Core L#0 (P#0 total=15624320KB)
    NUMANode L#0 (P#0 local=15624320KB total=15624320KB)
    L1dCache L#0 (size=32KB linesize=128 ways=32)
      L1iCache L#0 (size=32KB linesize=128 ways=32)
        Package L#0 (CPUModel="POWER9 (architected), altivec supported" CPURevision="2.2 (pvr XYZ)")
          PU L#0 (P#0)
        PU L#1 (P#2)
    L1dCache L#1 (size=32KB linesize=128 ways=32)
      L1iCache L#1 (size=32KB linesize=128 ways=32)
        PU L#2 (P#1)
        PU L#3 (P#3)
  Block(Disk) L#0 (Size=125829120 SectorSize=512 LinuxDeviceID=8:80 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdf"
  Block(Disk) L#1 (Size=314572800 SectorSize=512 LinuxDeviceID=8:224 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdo"
  Block(Disk) L#2 (Size=125829120 SectorSize=512 LinuxDeviceID=8:48 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdd"
  Block(Disk) L#3 (Size=314572800 SectorSize=512 LinuxDeviceID=8:192 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdm"
  Block(Disk) L#4 (Size=125829120 SectorSize=512 LinuxDeviceID=8:16 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdb"
  Block(Disk) L#5 (Size=314572800 SectorSize=512 LinuxDeviceID=8:160 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdk"
  Block(Disk) L#6 (Size=314572800 SectorSize=512 LinuxDeviceID=8:128 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdi"
  Block L#7 (Size=514 SectorSize=2048 LinuxDeviceID=11:0 Vendor=AIX Model=VOPTA) "sr0"
  Block(Disk) L#8 (Size=314572800 SectorSize=512 LinuxDeviceID=8:96 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdg"
  Block(Disk) L#9 (Size=314572800 SectorSize=512 LinuxDeviceID=8:64 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sde"
  Block(Disk) L#10 (Size=125829120 SectorSize=512 LinuxDeviceID=8:208 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdn"
  Block(Disk) L#11 (Size=314572800 SectorSize=512 LinuxDeviceID=8:32 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdc"
  Block(Disk) L#12 (Size=125829120 SectorSize=512 LinuxDeviceID=8:176 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdl"
  Block(Disk) L#13 (Size=314572800 SectorSize=512 LinuxDeviceID=8:0 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sda"
  Block(Disk) L#14 (Size=125829120 SectorSize=512 LinuxDeviceID=8:144 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdj"
  Block(Disk) L#15 (Size=125829120 SectorSize=512 LinuxDeviceID=8:112 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdh"
  Block(Disk) L#16 (Size=125829120 SectorSize=512 LinuxDeviceID=8:240 Vendor=IBM Model=2145 Revision=0000 SerialNumber=XYZ) "sdp"
  Network L#17 (Address=fa:5f:e3:71:51:21) "env3"
  Network L#18 (Address=3a:e2:ec:7d:cf:64) "env4"
  Network L#19 (Address=fa:5f:e3:71:51:20) "env2"
depth 0:           1 Machine (type #0)
 depth 1:          1 Core (type #2)
  depth 2:         2 L1dCache (type #4)
   depth 3:        2 L1iCache (type #9)
    depth 4:       1 Package (type #1)
     depth 5:      4 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -6:  20 OSDev (type #16)
60 processors not represented in topology: 0xffffffff,0xfffffff0

Note, I hit some issues when I did not export HWLOC_ALLOW=all.

  1. Use the hardware locality gui hwloc-gui to output a graphic layout of the CPU.

a. Install hwloc-gui

$  yum install -y hwloc-gui
...
Total download size: 73 k
Installed size: 142 k
Downloading Packages:
hwloc-gui-2.2.0-3.el8.ppc64le.rpm                                                      452 kB/s |  73 kB     00:00    
-----------------------------------------------------------------------------------------------------------------------
Total                                                                                  449 kB/s |  73 kB     00:00     
...
Complete!

b. Run the following lstopo to generate a graphic file

$ lstopo --pid $$ --no-io --of svg > topology.svg

c. Download and View the file (note certain parts were removed to hide the hostname and date-time)

images/topology.svg

  1. Check /sys/fs/cgroup/memory has the memory.numa_stat file
$ cat /sys/fs/cgroup/memory/memory.numa_stat 
total=227 N1=227
file=133 N1=133
anon=0 N1=0
unevictable=94 N1=94
hierarchical_total=21096 N1=20097
hierarchical_file=14617 N1=12969
hierarchical_anon=3960 N1=4851
hierarchical_unevictable=2519 N1=2277
  1. cpuset-visualizer has handy tool to visualize CPU Set usage

Is this a Red Hat or IBM supported solution?

No. This is only a proof of concept that serves as a good starting point to understand how the Descheduler Profiles works in OpenShift.