MLC——内存延迟及带宽测试工具
why MLC
影响程序性能的两个重要因素:
①应用程序从处理器缓存和从内存子系统获取数据所消耗的时间,其中存在各种延迟;
②带宽b/w(bandwidth 非Bilibili World)
mlc正是做这个的
测试内容
Node访问速度
在NUMA(Non-Uniform Memory Access 非一致性内存访问)构架下,不同的内存器件和CPU核心从属不同的 Node,每个 Node 都有自己的集成内存控制器(IMC,Integrated Memory Controller),解决了“每个处理器共享相同的地址空间问题”,避免总线带宽,内存冲突问题。
(补充:core=物理cpu,独立的物理执行单元;thread=逻辑cpu,线程
socket = node 相当于主板上的cpu插槽。node内部,不同核心间使用IMC Bus通信;不同node间通过QPI(Quick Path Interconnect)进行通信
同城速达的速度肯定与国际邮件不同,所以QPI(remote)延迟明显高于IMC Bus(local)
测试样例:
查询内存访问延迟 指令
./mlc --latency_matrix
结果
Numa node Numa node 0 1 0 82.2 129.6 1 131.1 81.6
表示node之间/内部的空闲内存访问延迟矩阵,以ns为单位
带宽
带宽反映了单位时间的传输速率马路越宽,就不会堵车了。带宽反映了单位时间的传输速率
Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 69143.9 3:1 Reads-Writes : 61908.4 2:1 Reads-Writes : 60040.5 1:1 Reads-Writes : 54517.6 Stream-triad like: 57473.4
r:w 表示不同读写比下的内存带宽
一般情况下,内存的写速度慢于读取速度(Talk is easy, show me the CODE)
所以当读写比下降时,带宽会下降(路窄了,塞车了)
问题分析:如果带宽急剧下降,可能是写入程序增多;或者是写入程序出问题,速度太慢了
测试样例
查询存访问带宽 指令(单独判断numa节点间内存访问是否正常还可以使用 )
./mlc --bandwidth_matrix
结果
Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 35216.6 32537.9 1 31875.1 35048.5
问题分析:如果副对角线数值相差过大,表明两个node相互访问的带宽差距较大
解决方法:出现不平衡的时候一般从内存插法、内存是否故障以及numa平衡等角度进行排查
内存访问带宽和内存延迟的关系(读操作)
Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 523.74 69057.4 00002 589.55 68668.7 00008 686.99 68571.4 00015 549.87 68873.6 00050 575.48 68673.0 00100 524.74 68877.5 00200 197.61 64225.8 00300 131.60 47141.0 00400 110.39 36803.0 00500 117.32 30135.2 00700 100.90 22179.1 01000 100.93 15762.8 01300 91.74 12351.6 01700 98.61 9475.2 02500 86.66 6927.8 03500 88.13 5132.6 05000 87.68 3818.6 09000 85.36 2473.5 20000 84.83 1538.7
可以观察内存在负载压力下的响应变化,以及是否在到达一定带宽时,出现不可接受的内存响应时间
测量CPU cache到CPU cache之间的访问延迟
Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 38.6 Local Socket L2->L2 HITM latency 43.6 Remote Socket L2->L2 HITM latency (data address homed in writer socket) Reader Socket Writer Socket 0 1 0 - 133.4 1 133.7 - Remote Socket L2->L2 HITM latency (data address homed in reader socket) Reader Socket Writer Socket 0 1 0 - 133.5 1 133.7 -
峰值带宽
指令
mlc --peak_bandwidth
结果
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 50035.2 3:1 Reads-Writes : 48119.3 2:1 Reads-Writes : 47434.3 1:1 Reads-Writes : 48325.5 Stream-triad like: 44029.0
空闲内存延迟
指令
mlc --idle_latency
结果
Using buffer size of 200.000MB Each iteration took 260.5 core clocks ( 113.3 ns)
有负载内存延时
指令
mlc --loaded_latency
结果
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 217.32 49703.4 00002 258.98 49482.4 00008 217.48 49908.1 00015 220.12 49973.7 00050 206.33 49185.7 00100 174.02 43811.8 00200 141.63 27651.1 00300 130.65 19614.6 00400 126.05 15217.0 00500 122.70 12506.0 00700 121.46 9253.0 01000 120.55 6690.6 01300 118.75 5314.9 01700 120.18 4148.7 02500 119.53 3055.7 03500 119.60 2349.4 05000 116.60 1816.9 09000 116.17 1257.8 20000 116.87 867.6
其余操作(未完待续
-
测量指定node之间的访问延迟
-
测量CPU cache的访问延迟
-
测量cores/Socket的指定子集内的访问带宽
-
测量不同读写比下的带宽
-
指定随机的访问模式以替换默认的顺序模式进行测量
-
指定测试时的步幅