NUMA基础实验报告

NUMA基础模型参数测量方案

实验目的

本实验是为测量在具体平台下IONUMA模型的各项具体参数数值和分布,以此项数据作为参考来建立考虑IO访问下的新NUMA模型,填充相应的矩阵,以此来形成我们通过实验所测得经验模型。

实验原理

通过不同类型的Benchmark来分别测量:

CPU-内存距离使用@Stream,这是一个经典的内存访问带宽的测量工具;CPU-网卡距离可以使用iPerf,使用@iPerf可以指定buffer size,当我们使用较小的buffer size时,memory trunk会被cpu缓存,所以Intel的处理器会使用DCA技术直接写缓存,从而给出了一个与内存无关的访问场景

网卡-内存距离,可以使用经典的@netperf来测量。当使用netperf测量网络性能时,其主要过程在与由网卡buffer通过DMA操作写内存,故与该线程的CPU相对无关。

在实验中,我们使用@numactl来绑定所运行的cpu和memory,从而达到控制实验变量的目的。

相关工具

iPerf

  • iPerf2 features currently supported by iPerf3 :
    TCP and UDP tests
    Set port (-p)
    Setting TCP options: No delay, MSS, etc.
    Setting UDP bandwidth (-b)
    Setting socket buffer size (-w)
    Reporting intervals (-i)
    Setting the iPerf buffer (-l)
    Bind to specific interfaces (-B)
    IPv6 tests (-6)
    Number of bytes to transmit (-n)
    Length of test (-t)
    Parallel streams (-P)
    Setting DSCP/TOS bit vectors (-S)
    Change number output format (-f)

  • New Features in iPerf 3.0 :
    Dynamic server (client/server parameter exchange) – Most server options from iPerf2 can now be dynamically set by the client
    Client/server results exchange
    A iPerf3 server accepts a single client simultaneously (multiple clients simultaneously for iPerf2)
    iPerf API (libiperf) – Provides an easy way to use, customize and extend iPerf functionality
    -R, Reverse test mode – Server sends, client receives
    -O, –omit N : omit the first n seconds (to ignore TCP slowstart)
    -b, –bandwidth n[KM] for TCP (only UDP for IPERF 2): Set target bandwidth to n bits/sec (default 1 Mbit/sec for UDP, unlimited for TCP).
    -V, –verbose : more detailed output than before
    -J, –json : output in JSON format
    -Z, –zerocopy : use a ‘zero copy’ sendfile() method of sending data. This uses much less CPU.
    -T, –title str : prefix every output line with this string
    -F, –file name : xmit/recv the specified file
    -A, –affinity n/n,m : set CPU affinity (cores are numbered from 0 - Linux and FreeBSD only)
    -k, –blockcount #[KMG] : number of blocks (packets) to transmit (instead of -t or -n)
    -4, –version4 : only use IPv4
    -6, –version6 : only use IPv6
    -L, –flowlabel : set IPv6 flow label (Linux only)
    -C, –linux-congestion : set congestion control algorithm (Linux and FreeBSD only) (-Z in iPerf2)
    -d, –debug : emit debugging output. Primarily (perhaps exclusively) of use to developers.
    -s, –server : iPerf2 can handle multiple client requests. iPerf3 will only allow one iperf connection at a time.

  • Features in iPerf 3.1 :
    -I, –pidfile file write a file with the process ID, most useful when running as a daemon.
    –cport : Specify the client-side port.
    –sctp use SCTP rather than TCP (Linux, FreeBSD and Solaris).
    –udp-counters-64bit : Support very long-running UDP tests, which could cause a counter to overflow
    –logfile file : send output to a log file.

  • iPerf2 Features Not Supported by iPerf3 :
    Bidirectional testing (-d / -r)
    Data transmitted from stdin (-I)
    TTL : time-to-live, for multicast (-T)
    Exclude C(connection) D(data) M(multicast) S(settings) V(server) reports (-x)
    Report as a Comma-Separated Values (-y)
    Compatibility mode allows for use with older version of iPerf (-C)

Stream

Netperf

numactl

  • useage:
    numactl [ –interleave nodes ] [ –preferred node ] [ –membind nodes ] [ –cpunodebind nodes ] [ –physcpubind cpus ] [ –localalloc ] [–] command {arguments …}
    numactl –show
    numactl –hardware
    numactl [ –huge ] [ –offset offset ] [ –shmmode shmmode ] [ –length length ] [ –strict ]
    [ –shmid id ] –shm shmkeyfile | –file tmpfsfile
    [ –touch ] [ –dump ] [ –dump-nodes ] memory policy

本试验中应当使用的命令如下:

1
numactl --membind= --cpubind=

NUMA模型

Platform相关信息:

C-M Citrix
该矩阵记录该NUMA架构下,每个物理CPU访问不同区域内存的具体带宽信息

N-M Citrix
该矩阵记录了网卡访问不同区域内存的平均距离,因为是CPU无关,我们考虑在这种情况的平均情况以防误差

C-N Vector
网卡到CPU的距离有该向量记录,以同一个node中的CPU的平均值来表征。

实验内容

CPU-内存距离

C-M

C\M Node0 Node1 Node2 Node3
cpu0~7
cpu8~15
cpu16~24
cpu25~31

网卡-内存距离

N\M Node0 Node1 Node2 Node3
Distance(cpu 0~7)
Distance(cpu 8~15)
Distance(cpu 16~23)
Distance(cpu 24~31)
Avg

CPU-网卡距离

C\N Node0(cpu 0-7) Node1(cpu 8-15) Node2 (cpu 16-23) Node3 (cpu 24-31)
Bandwith(avg)

实验步骤

  • CPU-内存距离测量
    使用numactl工具将Stream线程的运行CPU和内存绑定在固定位置运行,从而测出带宽值,实验中我们将根据Stream的Arraysize设为100M,大于机器L3 cache的总和64M(16M*4)以防其使用cache加速

  • 网卡-内存距离测量

  • CPU-网卡距离测量

实验结果