啊哈!神秘主义者是对的!硬件预取以某种方式优化了我的读/写。
如果这是缓存优化,那么强制使用内存屏障将使优化失败:
c = __sync_fetch_and_add(((char*)x) + j, 1);
但这没有任何区别。确实有所作为的是,将我的迭代器索引乘以质数1009来破坏预取优化:
*(((char*)x) + ((j * 1009) % N)) += 1;
有了这一更改,NUMA的不对称性就清楚地显示出来了:
numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114
至少我认为这是正在发生的事情。
感谢Mysticial!
对于只看一下这篇文章以大致了解NUMA性能特征的任何人,根据我的测试,这是底线:
对非本地NUMA节点的内存访问的延迟约为对本地节点的内存访问的延迟的1.33倍。