NUMA基础实验报告

Posted on 2016-10-22 In 研究方向

NUMA基础模型参数测量方案

实验目的

本实验是为测量在具体平台下IONUMA模型的各项具体参数数值和分布，以此项数据作为参考来建立考虑IO访问下的新NUMA模型，填充相应的矩阵，以此来形成我们通过实验所测得经验模型。

实验原理

通过不同类型的Benchmark来分别测量：

CPU-内存距离使用@Stream,这是一个经典的内存访问带宽的测量工具；CPU-网卡距离可以使用iPerf，使用@iPerf可以指定buffer size，当我们使用较小的buffer size时，memory trunk会被cpu缓存，所以Intel的处理器会使用DCA技术直接写缓存，从而给出了一个与内存无关的访问场景

网卡-内存距离，可以使用经典的@netperf来测量。当使用netperf测量网络性能时，其主要过程在与由网卡buffer通过DMA操作写内存，故与该线程的CPU相对无关。

在实验中，我们使用@numactl来绑定所运行的cpu和memory，从而达到控制实验变量的目的。

相关工具

iPerf

iPerf2 features currently supported by iPerf3 :
TCP and UDP tests
Set port (-p)
Setting TCP options: No delay, MSS, etc.
Setting UDP bandwidth (-b)
Setting socket buffer size (-w)
Reporting intervals (-i)
Setting the iPerf buffer (-l)
Bind to specific interfaces (-B)
IPv6 tests (-6)
Number of bytes to transmit (-n)
Length of test (-t)
Parallel streams (-P)
Setting DSCP/TOS bit vectors (-S)
Change number output format (-f)
New Features in iPerf 3.0 :
Dynamic server (client/server parameter exchange) – Most server options from iPerf2 can now be dynamically set by the client
Client/server results exchange
A iPerf3 server accepts a single client simultaneously (multiple clients simultaneously for iPerf2)
iPerf API (libiperf) – Provides an easy way to use, customize and extend iPerf functionality
-R, Reverse test mode – Server sends, client receives
-O, –omit N : omit the first n seconds (to ignore TCP slowstart)
-b, –bandwidth n[KM] for TCP (only UDP for IPERF 2): Set target bandwidth to n bits/sec (default 1 Mbit/sec for UDP, unlimited for TCP).
-V, –verbose : more detailed output than before
-J, –json : output in JSON format
-Z, –zerocopy : use a ‘zero copy’ sendfile() method of sending data. This uses much less CPU.
-T, –title str : prefix every output line with this string
-F, –file name : xmit/recv the specified file
-A, –affinity n/n,m : set CPU affinity (cores are numbered from 0 - Linux and FreeBSD only)
-k, –blockcount #[KMG] : number of blocks (packets) to transmit (instead of -t or -n)
-4, –version4 : only use IPv4
-6, –version6 : only use IPv6
-L, –flowlabel : set IPv6 flow label (Linux only)
-C, –linux-congestion : set congestion control algorithm (Linux and FreeBSD only) (-Z in iPerf2)
-d, –debug : emit debugging output. Primarily (perhaps exclusively) of use to developers.
-s, –server : iPerf2 can handle multiple client requests. iPerf3 will only allow one iperf connection at a time.
Features in iPerf 3.1 :
-I, –pidfile file write a file with the process ID, most useful when running as a daemon.
–cport : Specify the client-side port.
–sctp use SCTP rather than TCP (Linux, FreeBSD and Solaris).
–udp-counters-64bit : Support very long-running UDP tests, which could cause a counter to overflow
–logfile file : send output to a log file.
iPerf2 Features Not Supported by iPerf3 :
Bidirectional testing (-d / -r)
Data transmitted from stdin (-I)
TTL : time-to-live, for multicast (-T)
Exclude C(connection) D(data) M(multicast) S(settings) V(server) reports (-x)
Report as a Comma-Separated Values (-y)
Compatibility mode allows for use with older version of iPerf (-C)

Stream

Netperf

numactl

useage:
numactl [ –interleave nodes ] [ –preferred node ] [ –membind nodes ] [ –cpunodebind nodes ] [ –physcpubind cpus ] [ –localalloc ] [–] command {arguments …}
numactl –show
numactl –hardware
numactl [ –huge ] [ –offset offset ] [ –shmmode shmmode ] [ –length length ] [ –strict ]
[ –shmid id ] –shm shmkeyfile | –file tmpfsfile
[ –touch ] [ –dump ] [ –dump-nodes ] memory policy

本试验中应当使用的命令如下：

1	numactl --membind= --cpubind=

NUMA模型

Platform相关信息：

C-M Citrix
该矩阵记录该NUMA架构下，每个物理CPU访问不同区域内存的具体带宽信息

N-M Citrix
该矩阵记录了网卡访问不同区域内存的平均距离，因为是CPU无关，我们考虑在这种情况的平均情况以防误差

C-N Vector
网卡到CPU的距离有该向量记录，以同一个node中的CPU的平均值来表征。

实验内容

CPU-内存距离

C-M

C\M	Node0	Node1	Node2	Node3
cpu0~7
cpu8~15
cpu16~24
cpu25~31

网卡-内存距离

N\M	Node0	Node1	Node2	Node3
Distance(cpu 0~7)
Distance(cpu 8~15)
Distance(cpu 16~23)
Distance(cpu 24~31)
Avg

CPU-网卡距离

C\N	Node0（cpu 0-7）	Node1（cpu 8-15）	Node2 （cpu 16-23）	Node3 （cpu 24-31）
Bandwith（avg）

实验步骤

CPU-内存距离测量
使用numactl工具将Stream线程的运行CPU和内存绑定在固定位置运行，从而测出带宽值，实验中我们将根据Stream的Arraysize设为100M，大于机器L3 cache的总和64M（16M*4）以防其使用cache加速
网卡-内存距离测量
CPU-网卡距离测量