NUMA基础实验报告
NUMA基础模型参数测量方案
实验目的
本实验是为测量在具体平台下IONUMA模型的各项具体参数数值和分布,以此项数据作为参考来建立考虑IO访问下的新NUMA模型,填充相应的矩阵,以此来形成我们通过实验所测得经验模型。
实验原理
通过不同类型的Benchmark来分别测量:
CPU-内存距离使用@Stream,这是一个经典的内存访问带宽的测量工具;CPU-网卡距离可以使用iPerf,使用@iPerf可以指定buffer size,当我们使用较小的buffer size时,memory trunk会被cpu缓存,所以Intel的处理器会使用DCA技术直接写缓存,从而给出了一个与内存无关的访问场景
网卡-内存距离,可以使用经典的@netperf来测量。当使用netperf测量网络性能时,其主要过程在与由网卡buffer通过DMA操作写内存,故与该线程的CPU相对无关。
在实验中,我们使用@numactl来绑定所运行的cpu和memory,从而达到控制实验变量的目的。
相关工具
iPerf
iPerf2 features currently supported by iPerf3 :
TCP and UDP tests
Set port (-p)
Setting TCP options: No delay, MSS, etc.
Setting UDP bandwidth (-b)
Setting socket buffer size (-w)
Reporting intervals (-i)
Setting the iPerf buffer (-l)
Bind to specific interfaces (-B)
IPv6 tests (-6)
Number of bytes to transmit (-n)
Length of test (-t)
Parallel streams (-P)
Setting DSCP/TOS bit vectors (-S)
Change number output format (-f)New Features in iPerf 3.0 :
Dynamic server (client/server parameter exchange) – Most server options from iPerf2 can now be dynamically set by the client
Client/server results exchange
A iPerf3 server accepts a single client simultaneously (multiple clients simultaneously for iPerf2)
iPerf API (libiperf) – Provides an easy way to use, customize and extend iPerf functionality
-R, Reverse test mode – Server sends, client receives
-O, –omit N : omit the first n seconds (to ignore TCP slowstart)
-b, –bandwidth n[KM] for TCP (only UDP for IPERF 2): Set target bandwidth to n bits/sec (default 1 Mbit/sec for UDP, unlimited for TCP).
-V, –verbose : more detailed output than before
-J, –json : output in JSON format
-Z, –zerocopy : use a ‘zero copy’ sendfile() method of sending data. This uses much less CPU.
-T, –title str : prefix every output line with this string
-F, –file name : xmit/recv the specified file
-A, –affinity n/n,m : set CPU affinity (cores are numbered from 0 - Linux and FreeBSD only)
-k, –blockcount #[KMG] : number of blocks (packets) to transmit (instead of -t or -n)
-4, –version4 : only use IPv4
-6, –version6 : only use IPv6
-L, –flowlabel : set IPv6 flow label (Linux only)
-C, –linux-congestion : set congestion control algorithm (Linux and FreeBSD only) (-Z in iPerf2)
-d, –debug : emit debugging output. Primarily (perhaps exclusively) of use to developers.
-s, –server : iPerf2 can handle multiple client requests. iPerf3 will only allow one iperf connection at a time.Features in iPerf 3.1 :
-I, –pidfile file write a file with the process ID, most useful when running as a daemon.
–cport : Specify the client-side port.
–sctp use SCTP rather than TCP (Linux, FreeBSD and Solaris).
–udp-counters-64bit : Support very long-running UDP tests, which could cause a counter to overflow
–logfile file : send output to a log file.iPerf2 Features Not Supported by iPerf3 :
Bidirectional testing (-d / -r)
Data transmitted from stdin (-I)
TTL : time-to-live, for multicast (-T)
Exclude C(connection) D(data) M(multicast) S(settings) V(server) reports (-x)
Report as a Comma-Separated Values (-y)
Compatibility mode allows for use with older version of iPerf (-C)
Stream
Netperf
numactl
- useage:
numactl [ –interleave nodes ] [ –preferred node ] [ –membind nodes ] [ –cpunodebind nodes ] [ –physcpubind cpus ] [ –localalloc ] [–] command {arguments …}
numactl –show
numactl –hardware
numactl [ –huge ] [ –offset offset ] [ –shmmode shmmode ] [ –length length ] [ –strict ]
[ –shmid id ] –shm shmkeyfile | –file tmpfsfile
[ –touch ] [ –dump ] [ –dump-nodes ] memory policy
本试验中应当使用的命令如下:
1 | numactl --membind= --cpubind= |
NUMA模型
Platform相关信息:
C-M Citrix
该矩阵记录该NUMA架构下,每个物理CPU访问不同区域内存的具体带宽信息
N-M Citrix
该矩阵记录了网卡访问不同区域内存的平均距离,因为是CPU无关,我们考虑在这种情况的平均情况以防误差
C-N Vector
网卡到CPU的距离有该向量记录,以同一个node中的CPU的平均值来表征。
实验内容
CPU-内存距离
C-M
C\M | Node0 | Node1 | Node2 | Node3 |
---|---|---|---|---|
cpu0~7 | ||||
cpu8~15 | ||||
cpu16~24 | ||||
cpu25~31 |
网卡-内存距离
N\M | Node0 | Node1 | Node2 | Node3 |
---|---|---|---|---|
Distance(cpu 0~7) | ||||
Distance(cpu 8~15) | ||||
Distance(cpu 16~23) | ||||
Distance(cpu 24~31) | ||||
Avg |
CPU-网卡距离
C\N | Node0(cpu 0-7) | Node1(cpu 8-15) | Node2 (cpu 16-23) | Node3 (cpu 24-31) |
---|---|---|---|---|
Bandwith(avg) |
实验步骤
CPU-内存距离测量
使用numactl工具将Stream线程的运行CPU和内存绑定在固定位置运行,从而测出带宽值,实验中我们将根据Stream的Arraysize设为100M,大于机器L3 cache的总和64M(16M*4)以防其使用cache加速网卡-内存距离测量
CPU-网卡距离测量