In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm. For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory. Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced. To hide this latency our processors employ evermore complex cache sub-systems that have many layers.
The 1994 paper "Hitting the memory wall: implications of the obvious" describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses. I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.
Let's start putting the problem in context with some examples. Our hardware tries to hide the main-memory latency via a number of techniques. Basically three major bets are taken on memory access patterns:
public class TestMemoryAccessPatterns { private static final int LONG_SIZE = 8; private static final int PAGE_SIZE = 2 * 1024 * 1024; private static final int ONE_GIG = 1024 * 1024 * 1024; private static final long TWO_GIG = 2L * ONE_GIG; private static final int ARRAY_SIZE = (int)(TWO_GIG / LONG_SIZE); private static final int WORDS_PER_PAGE = PAGE_SIZE / LONG_SIZE; private static final int ARRAY_MASK = ARRAY_SIZE - 1; private static final int PAGE_MASK = WORDS_PER_PAGE - 1; private static final int PRIME_INC = 514229; private static final long[] memory = new long[ARRAY_SIZE]; static { for (int i = 0; i < ARRAY_SIZE; i++) { memory[i] = 777; } } public enum StrideType { LINEAR_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (pos + 1) & ARRAY_MASK; } }, RANDOM_PAGE_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return pageOffset + ((pos + PRIME_INC) & PAGE_MASK); } }, RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (pos + PRIME_INC) & ARRAY_MASK; } }; public abstract int next(int pageOffset, int wordOffset, int pos); } public static void main(final String[] args) { final StrideType strideType; switch (Integer.parseInt(args[0])) { case 1: strideType = StrideType.LINEAR_WALK; break; case 2: strideType = StrideType.RANDOM_PAGE_WALK; break; case 3: strideType = StrideType.RANDOM_HEAP_WALK; break; default: throw new IllegalArgumentException("Unknown StrideType"); } for (int i = 0; i < 5; i++) { perfTest(i, strideType); } } private static void perfTest(final int runNumber, final StrideType strideType) { final long start = System.nanoTime(); int pos = -1; long result = 0; for (int pageOffset = 0; pageOffset < ARRAY_SIZE; pageOffset += WORDS_PER_PAGE) { for (int wordOffset = pageOffset, limit = pageOffset + WORDS_PER_PAGE; wordOffset < limit; wordOffset++) { pos = strideType.next(pageOffset, wordOffset, pos); result += memory[pos]; } } final long duration = System.nanoTime() - start; final double nsOp = duration / (double)ARRAY_SIZE; if (208574349312L != result) { throw new IllegalStateException(); } System.out.format("%d - %.2fns %s\n", Integer.valueOf(runNumber), Double.valueOf(nsOp), strideType); } }
Intel U4100 @ 1.3GHz, 4GB RAM DDR2 800MHz, Windows 7 64-bit, Java 1.7.0_05 =========================================== 0 - 2.38ns LINEAR_WALK 1 - 2.41ns LINEAR_WALK 2 - 2.35ns LINEAR_WALK 3 - 2.36ns LINEAR_WALK 4 - 2.39ns LINEAR_WALK 0 - 12.45ns RANDOM_PAGE_WALK 1 - 12.27ns RANDOM_PAGE_WALK 2 - 12.17ns RANDOM_PAGE_WALK 3 - 12.22ns RANDOM_PAGE_WALK 4 - 12.18ns RANDOM_PAGE_WALK 0 - 152.86ns RANDOM_HEAP_WALK 1 - 151.80ns RANDOM_HEAP_WALK 2 - 151.72ns RANDOM_HEAP_WALK 3 - 151.91ns RANDOM_HEAP_WALK 4 - 151.36ns RANDOM_HEAP_WALK Intel i7-860 @ 2.8GHz, 8GB RAM DDR3 1333MHz, Windows 7 64-bit, Java 1.7.0_05 ============================================= 0 - 1.06ns LINEAR_WALK 1 - 1.05ns LINEAR_WALK 2 - 0.98ns LINEAR_WALK 3 - 1.00ns LINEAR_WALK 4 - 1.00ns LINEAR_WALK 0 - 3.80ns RANDOM_PAGE_WALK 1 - 3.85ns RANDOM_PAGE_WALK 2 - 3.79ns RANDOM_PAGE_WALK 3 - 3.65ns RANDOM_PAGE_WALK 4 - 3.64ns RANDOM_PAGE_WALK 0 - 30.04ns RANDOM_HEAP_WALK 1 - 29.05ns RANDOM_HEAP_WALK 2 - 29.14ns RANDOM_HEAP_WALK 3 - 28.88ns RANDOM_HEAP_WALK 4 - 29.57ns RANDOM_HEAP_WALK Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 0.91ns LINEAR_WALK 1 - 0.92ns LINEAR_WALK 2 - 0.88ns LINEAR_WALK 3 - 0.89ns LINEAR_WALK 4 - 0.89ns LINEAR_WALK 0 - 3.29ns RANDOM_PAGE_WALK 1 - 3.35ns RANDOM_PAGE_WALK 2 - 3.33ns RANDOM_PAGE_WALK 3 - 3.31ns RANDOM_PAGE_WALK 4 - 3.30ns RANDOM_PAGE_WALK 0 - 9.58ns RANDOM_HEAP_WALK 1 - 9.20ns RANDOM_HEAP_WALK 2 - 9.44ns RANDOM_HEAP_WALK 3 - 9.46ns RANDOM_HEAP_WALK 4 - 9.47ns RANDOM_HEAP_WALK
perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,626,053 L1-dcache-loads 274,255,164 L1-dcache-misses # 18.32% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,537,057,965 L1-dcache-loads 1,570,105,933 L1-dcache-misses # 102.15% of all L1-dcache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,321,888,497 L1-dcache-loads 1,780,223,433 L1-dcache-misses # 41.19% of all L1-dcache hits likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 5.94918e+09 | | CPU_CLK_UNHALTED_CORE | 5.15969e+09 | | L2_TRANS_ALL_REQUESTS | 1.07252e+09 | | L2_RQSTS_MISS | 3.25413e+08 | +-----------------------+-------------+ +-----------------+-----------+ | Metric | core 2 | +-----------------+-----------+ | Runtime [s] | 2.15481 | | CPI | 0.867293 | | L2 request rate | 0.18028 | | L2 miss rate | 0.0546988 | | L2 miss ratio | 0.303409 | +-----------------+-----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.26545e+08 | | L3_LAT_CACHE_MISS | 2.59059e+07 | +------------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 1.48772e+10 | | CPU_CLK_UNHALTED_CORE | 1.64712e+10 | | L2_TRANS_ALL_REQUESTS | 3.41061e+09 | | L2_RQSTS_MISS | 1.5547e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 6.87876 | | CPI | 1.10714 | | L2 request rate | 0.22925 | | L2 miss rate | 0.104502 | | L2 miss ratio | 0.455843 | +-----------------+----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.52088e+09 | | L3_LAT_CACHE_MISS | 1.72918e+08 | +------------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 3 +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 6.49533e+09 | | CPU_CLK_UNHALTED_CORE | 4.18416e+10 | | L2_TRANS_ALL_REQUESTS | 4.67488e+09 | | L2_RQSTS_MISS | 1.43442e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 17.474 | | CPI | 6.4418 | | L2 request rate | 0.71973 | | L2 miss rate | 0.220838 | | L2 miss ratio | 0.306835 | +-----------------+----------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.40079e+09 | | L3_LAT_CACHE_MISS | 1.34832e+09 | +------------------------+-------------+Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.
perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $ Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1': 1,496,128,634 dTLB-loads 310,901 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2': 1,551,585,263 dTLB-loads 340,230 dTLB-misses # 0.02% of all dTLB cache hits Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 4,031,344,537 dTLB-loads 1,345,807,418 dTLB-misses # 33.38% of all dTLB cache hitsNote: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.
likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $ java -Xmx4g TestMemoryAccessPatterns 1 +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 1.31613e+09 | +--------------------+-------------+ java -Xmx4g TestMemoryAccessPatterns 2 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 368930 | +--------------------+--------+ java -Xmx4g TestMemoryAccessPatterns 3 +--------------------+--------+ | Event | core 2 | +--------------------+--------+ | LOAD_HIT_PRE_HW_PF | 324373 | +--------------------+--------+Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.
private static final long LARGE_PRIME_INC = 70368760954879L; RANDOM_HEAP_WALK { public int next(final int pageOffset, final int wordOffset, final int pos) { return (int)(pos + LARGE_PRIME_INC) & ARRAY_MASK; } };
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz, Linux 3.4.6 kernel 64-bit, Java 1.7.0_05 ================================================= 0 - 29.06ns RANDOM_HEAP_WALK 1 - 29.47ns RANDOM_HEAP_WALK 2 - 29.48ns RANDOM_HEAP_WALK 3 - 29.43ns RANDOM_HEAP_WALK 4 - 29.42ns RANDOM_HEAP_WALK Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3': 9,444,928,682 dTLB-loads 4,371,982,327 dTLB-misses # 46.29% of all dTLB cache hits 9,390,675,639 L1-dcache-loads 1,471,647,016 L1-dcache-misses # 15.67% of all L1-dcache hits +-----------------------+-------------+ | Event | core 2 | +-----------------------+-------------+ | INSTR_RETIRED_ANY | 7.71171e+09 | | CPU_CLK_UNHALTED_CORE | 1.31717e+11 | | L2_TRANS_ALL_REQUESTS | 8.4912e+09 | | L2_RQSTS_MISS | 2.79635e+09 | +-----------------------+-------------+ +-----------------+----------+ | Metric | core 2 | +-----------------+----------+ | Runtime [s] | 55.0094 | | CPI | 17.0801 | | L2 request rate | 1.10108 | | L2 miss rate | 0.362611 | | L2 miss ratio | 0.329324 | +-----------------+----------+ +--------------------+-------------+ | Event | core 2 | +--------------------+-------------+ | LOAD_HIT_PRE_HW_PF | 3.59509e+06 | +--------------------+-------------+ +------------------------+-------------+ | Event | core 2 | +------------------------+-------------+ | L3_LAT_CACHE_REFERENCE | 1.30318e+09 | | L3_LAT_CACHE_MISS | 2.62346e+07 | +------------------------+-------------+
Ref: http://mechanical-sympathy.blogspot.com/2012/08/memory-access-patterns-are-important.html