Modern multi-socket servers exhibit NUMA characteristics that may hurt application performance if ignored. On a NUMA system (Non-uniform Memory Access), all memory is shared between/among processors. Each processor has access to its own memory - local memory - as well as memory that is local to another processor -- remote memory. However the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than the remote memory, and these varying memory latencies play a big role in application performance.
Solaris organizes the hardware resources -- CPU, memory and I/O devices -- into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred as locality groups or NUMA nodes. In other words, a locality group (lgroup) is an abstraction that tells what hardware resources are near each other on a NUMA system. Each locality group has at least one processor and possibly some associated memory and/or IO devices. To minimize the impact of NUMA characteristics, Solaris considers the lgroup based physical topology when mapping threads and data to CPUs and memory.
Note that even though Solaris attempts to provide good performance out of the box, some applications may still suffer the impact of NUMA either due to misconfiguration of the hardware/software or some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up customer environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Still application developers and system/application administrators need to take NUMA factor into account while developing for and managing applications on large systems. Solaris provided tools and APIs can be used to observe, diagnose, control and even correct or fix the issues related to locality and latency. Rest of this post is about the tools that can be used to examine the locality of cores, memory and I/O devices.
Sample outputs are collected from a SPARC T4-4 server.
Locality Group Hierarchy
lgrpinfo
prints information about the lgroup hierarchy and its contents. It is useful in understanding the context in which the OS is trying to optimize applications for locality, and also in figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.
eg.,
# lgrpinfo -a lgroup 0 (root): Children: 1-4 CPUs: 0-255 Memory: installed 1024G, allocated 75G, free 948G Lgroup resources: 1-4 (CPU); 1-4 (memory) Latency: 18 lgroup 1 (leaf): Children: none, Parent: 0 CPUs: 0-63 Memory: installed 256G, allocated 18G, free 238G Lgroup resources: 1 (CPU); 1 (memory) Load: 0.0227 Latency: 12 lgroup 2 (leaf): Children: none, Parent: 0 CPUs: 64-127 Memory: installed 256G, allocated 15G, free 241G Lgroup resources: 2 (CPU); 2 (memory) Load: 0.000153 Latency: 12 lgroup 3 (leaf): Children: none, Parent: 0 CPUs: 128-191 Memory: installed 256G, allocated 20G, free 236G Lgroup resources: 3 (CPU); 3 (memory) Load: 0.016 Latency: 12 lgroup 4 (leaf): Children: none, Parent: 0 CPUs: 192-255 Memory: installed 256G, allocated 23G, free 233G Lgroup resources: 4 (CPU); 4 (memory) Load: 0.00824 Latency: 12 Lgroup latencies: ------------------ | 0 1 2 3 4 ------------------ 0 | 18 18 18 18 18 1 | 18 12 18 18 18 2 | 18 18 12 18 18 3 | 18 18 18 12 18 4 | 18 18 18 18 12 ------------------
CPU Locality
lgrpinfo
utility shown above already provides CPU locality in a clear manner. Here is another way to retrieve the association between CPU ids and lgroups.
# echo ::lgrp -p | mdb -k LGRPID PSRSETID LOAD #CPU CPUS 1 0 17873 64 0-63 2 0 17755 64 64-127 3 0 2256 64 128-191 4 0 18173 64 192-255
Memory Locality
lgrpinfo
utility shown above shows the total memory that belongs to each of the locality groups. However, it doesn't show exactly what memory blocks belong to what locality groups. One of mdb's debugger command (dcmd) helps retrieve this information.
1. List memory blocks # ldm list-devices -a memory MEMORY PA SIZE BOUND 0xa00000 32M _sys_ 0x2a00000 96M _sys_ 0x8a00000 374M _sys_ 0x20000000 1048064M primary 2. Print the physical memory layout of the system # echo ::syslayout | mdb -k STARTPA ENDPA SIZE MG MN STL ETL 20000000 200000000 7.5g 0 0 4 40 200000000 400000000 8g 1 1 800 840 400000000 600000000 8g 2 2 1000 1040 600000000 800000000 8g 3 3 1800 1840 800000000 a00000000 8g 0 0 40 80 a00000000 c00000000 8g 1 1 840 880 c00000000 e00000000 8g 2 2 1040 1080 e00000000 1000000000 8g 3 3 1840 1880 1000000000 1200000000 8g 0 0 80 c0 1200000000 1400000000 8g 1 1 880 8c0 1400000000 1600000000 8g 2 2 1080 10c0 1600000000 1800000000 8g 3 3 1880 18c0 ... ...
The values under MN column (memory node) can be treated as lgroup numbers after adding 1 to existing values. For example, a value of zero under MN translates to lgroup 1, 1 under MN translate to lgroup 2 and so on. Better yet, ::mnode
debugger command lists out the mapping of mnodes to lgroups as shown below.
# echo ::mnode | mdb -k MNODE ID LGRP ASLEEP UTOTAL UFREE UCACHE KTOTAL KFREE KCACHE 2075ad80000 0 1 - 249g 237g 114m 5.7g 714m - 2075ad802c0 1 2 - 240g 236g 288m 15g 4.8g - 2075ad80580 2 3 - 246g 234g 619m 9.6g 951m - 2075ad80840 3 4 - 247g 231g 24m 9g 897m -
Unrelated notes:
Main memory on T4-4 is interleaved across all memory banks with 8 GB interleave size -- meaning first 8 GB chunk excluding _sys_ blocks will be populated in lgroup 1 closer to processor #1, second 8 GB chunk in lgroup 2 closer to processor #2, third 8 GB chunk in lgroup 3 closer to processor #3, fourth 8 GB chunk in lgroup 4 closer to processor #4 and then the fifth 8 GB chunk again in lgroup 1 closer to processor #1 and so on. Memory is not interleaved on T5 and M6 systems (confirm by running the
::syslayout
dcmd). Conceptually memory interleaving is similar to disk striping.Keep in mind that debugger commands (dcmd) are not committed - thus, there is no guarantee that they continue to work on future versions of Solaris. Some of these dcmds may not work on some of the existing versions of Solaris.
I/O Device Locality
-d
option to lgrpinfo
utility accepts a specified path to an I/O device and return the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes - so, it is not uncommon to see more than one lgroup ID returned by lgrpinfo
.
eg.,
# lgrpinfo -d /dev/dsk/c1t0d0 lgroup ID : 1 # dladm show-phys | grep 10000 net4 Ethernet up 10000 full ixgbe0 # lgrpinfo -d /dev/ixgbe0 lgroup ID : 1 # dladm show-phys | grep ibp0 net12 Infiniband up 32000 unknown ibp0 # lgrpinfo -d /dev/ibp0 lgroup IDs : 1-4
NUMA IO Groups
Debugger command ::numaio_group
shows information about all NUMA I/O Groups.
# dladm show-phys | grep up net0 Ethernet up 1000 full igb0 net12 Ethernet up 10 full usbecm2 net4 Ethernet up 10000 full ixgbe0 # echo ::numaio_group | mdb -k ADDR GROUP_NAME CONSTRAINT 10050e1eba48 net4 lgrp : 1 10050e1ebbb0 net0 lgrp : 1 10050e1ebd18 usbecm2 lgrp : 1 10050e1ebe80 scsi_hba_ngrp_mpt_sas1 lgrp : 4 10050e1ebef8 scsi_hba_ngrp_mpt_sas0 lgrp : 1
Relying on prtconf
is another way to find the NUMA IO locality for an IO device.
eg.,
# dladm show-phys | grep up | grep ixgbe net4 Ethernet up 10000 full ixgbe0 == Find the device path for the network interface == # grep ixgbe /etc/path_to_inst | grep " 0 " "/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe" == Find NUMA IO Lgroups == # prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0 ... Hardware properties: ... name='numaio-lgrps' type=int items=1 value=00000001 ...
Resource Groups
list-rsrc-group
subcommand of the Logical Domains Manager command line interface (ldm
) shows a consolidated list of processor cores, memory blocks and IO devices that belong to each resource group. This subcommand is available in ldm 3.2 and later versions.
In a Resource Group, resources are grouped based on the underlying physical relationship between cores, memory, and I/O buses. On different hardware platforms, some of the server configurations such as SPARC M7-8 may have a Resource Group that maps directly to a Locality Group.
# ldm ls-rsrc-group NAME CORE MEMORY IO /SYS/CMIOU0 32 480G 4 /SYS/CMIOU3 32 480G 4 # ldm ls-rsrc-group -l /SYS/CMIOU0 NAME CORE MEMORY IO /SYS/CMIOU0 32 480G 4 CORE CID BOUND 0, 1, 2, 3, 8, 9, 10, 11 primary 16, 17, 18, 19, 24, 25 primary ... MEMORY PA SIZE BOUND 0x0 60M _sys_ 0x3c00000 32M _sys_ 0x5c00000 94M _sys_ 0x4c000000 64M _sys_ 0x50000000 15104M primary 0x400000000 128G primary ... 0x7400000000 16128M primary 0x77f0000000 64M _sys_ 0x77f4000000 192M _sys_ IO DEVICE PSEUDONYM BOUND pci@300 pci_0 primary pci@301 pci_1 primary pci@303 pci_3 primary pci@304 pci_4 primary
Process, Thread Locality
-
-H
ofprstat
command shows the home lgroup of active user processes and threads. -
-h
ofps
command can be used to examine the home lgroup of all user processes and threads.-H
option can be used to list all processes that are in a certain locality group.
[Related] Solaris assigns a thread to an lgroup when the thread is created. That lgroup is called the thread's home lgroup. Solaris runs the thread on the CPUs in the thread's home lgroup and allocates memory from that lgroup whenever possible. -
plgrp
tool shows the placement of threads among locality groups. Same tool can be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs. -
-L
option ofpmap
command shows the lgroup that contains the physical memory backing some virtual memory.
[Related] Breakdown of Oracle SGA into Solaris Locality Groups -
Memory placement among lgroups can possibly be achieved using
pmadvise
when the application is running or by usingmadvise()
system call during development, which provides advice to the kernel's virtual memory manager. The OS will use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.It is not possible to specify memory placement locality for OSM & ISM segments using
pmadvise
command ormadvise()
system call (DISM is an exception).
Examples:
# prstat -H PID USERNAME SIZE RSS STATE PRI NICE TIME CPU LGRP PROCESS/NLWP 1865 root 420M 414M sleep 59 0 447:51:13 0.1% 2 java/108 3659 oracle 1428M 1413M sleep 38 0 68:39:28 0.0% 4 oracle/1 1814 oracle 155M 110M sleep 59 0 70:45:17 0.0% 4 gipcd.bin/9 8 root 0K 0K sleep 60 - 70:52:21 0.0% 0 vmtasks/257 3765 root 447M 413M sleep 59 0 29:24:20 0.0% 3 crsd.bin/43 3949 oracle 505M 456M sleep 59 0 0:59:42 0.0% 2 java/124 10825 oracle 1097M 1074M sleep 59 0 18:13:27 0.0% 3 oracle/1 3941 root 210M 184M sleep 59 0 20:03:37 0.0% 4 orarootagent.bi/14 3743 root 119M 98M sleep 110 - 24:53:29 0.0% 1 osysmond.bin/13 3324 oracle 266M 225M sleep 110 - 19:52:31 0.0% 4 ocssd.bin/34 1585 oracle 122M 91M sleep 59 0 18:06:34 0.0% 3 evmd.bin/10 3918 oracle 168M 144M sleep 58 0 14:35:31 0.0% 1 oraagent.bin/28 3427 root 112M 80M sleep 59 0 12:34:28 0.0% 4 octssd.bin/12 3635 oracle 1425M 1406M sleep 101 - 13:55:31 0.0% 4 oracle/1 1951 root 183M 161M sleep 59 0 9:26:51 0.0% 4 orarootagent.bi/21 Total: 251 processes, 2414 lwps, load averages: 1.37, 1.46, 1.47 == Locality group 2 is the home lgroup of the java process with pid 1865 == # plgrp 1865 PID/LWPID HOME 1865/1 2 1865/2 2 ... ... 1865/22 4 1865/23 4 ... ... 1865/41 1 1865/42 1 ... ... 1865/60 3 1865/61 3 ... ... # plgrp 1865 | awk '{print $2}' | grep 2 | wc -l 30 # plgrp 1865 | awk '{print $2}' | grep 1 | wc -l 25 # plgrp 1865 | awk '{print $2}' | grep 3 | wc -l 25 # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l 28 == Let's reset the home lgroup of the java process id 1865 to 4 == # plgrp -H 4 1865 PID/LWPID HOME 1865/1 2 => 4 1865/2 2 => 4 1865/3 2 => 4 1865/4 2 => 4 ... ... 1865/184 1 => 4 1865/188 4 => 4 # plgrp 1865 | awk '{print $2}' | egrep "1|2|3" | wc -l 0 # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l 108 # prstat -H -p 1865 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU LGRP PROCESS/NLWP 1865 root 420M 414M sleep 59 0 447:57:30 0.1% 4 java/108 == List the home lgroup of all processes == # ps -aeH PID LGRP TTY TIME CMD 0 0 ? 0:11 sched 5 0 ? 4:47 zpool-rp 1 4 ? 21:04 init 8 0 ? 4253:54 vmtasks 75 4 ? 0:13 ipmgmtd 11 3 ? 3:09 svc.star 13 4 ? 2:45 svc.conf 3322 1 ? 301:51 cssdagen ... 11155 3 ? 0:52 oracle 13091 4 ? 0:00 sshd 13124 3 pts/5 0:00 bash 24703 4 pts/8 0:00 bash 12812 2 pts/3 0:00 bash ... == Find out the lgroups which shared memory segments are allocated from == # pmap -Ls 24513 | egrep "Lgrp|256M|2G" Address Bytes Pgsz Mode Lgrp Mapped File 0000000400000000 33554432K 2G rwxs- 1 [ osm shmid=0x78000047 ] 0000000C00000000 262144K 256M rwxs- 3 [ osm shmid=0x78000048 ] 0000000C10000000 524288K 256M rwxs- 2 [ osm shmid=0x78000048 ] 0000000C30000000 262144K 256M rwxs- 3 [ osm shmid=0x78000048 ] 0000000C40000000 524288K 256M rwxs- 1 [ osm shmid=0x78000048 ] 0000000C60000000 262144K 256M rwxs- 2 [ osm shmid=0x78000048 ] == Apply MADV_ACCESS_LWP policy advice to a segment at a specific address == # pmap -Ls 1865 | grep anon 00000007DAC00000 20480K 4M rw--- 4 [ anon ] 00000007DC000000 4096K - rw--- - [ anon ] 00000007DFC00000 90112K 4M rw--- 4 [ anon ] 00000007F5400000 110592K 4M rw--- 4 [ anon ] # pmadvise -o 7F5400000=access_lwp 1865 # pmap -Ls 1865 | grep anon 00000007DAC00000 20480K 4M rw--- 4 [ anon ] 00000007DC000000 4096K - rw--- - [ anon ] 00000007DFC00000 90112K 4M rw--- 4 [ anon ] 00000007F5400000 73728K 4M rw--- 4 [ anon ] 00000007F9C00000 28672K - rw--- - [ anon ] 00000007FB800000 8192K 4M rw--- 4 [ anon ]
SEE ALSO:
- - Man pages of
lgrpinfo(1), plgrp(1), pmap(1), prstat(1M), ps(1), pmadvise(1), madvise(3C), madv.so.1(1), mdb(1)
- - Web search keywords: NUMA, cc-NUMA, locality group, lgroup, lgrp, Memory Placement Optimization, MPO
No comments:
Post a Comment