In high-performance computing it is often said that the cost of a
cache-miss is the largest performance penalty for an algorithm. For
many years the increase in speed of our processors has greatly
outstripped latency gains to main-memory. Bandwidth to main-memory has
greatly increased via wider, and multi-channel, buses however the
latency has not significantly reduced. To hide this latency our
processors employ evermore complex cache sub-systems that have many
layers.
To illustrate these three bets in action let’s write some code and measure the results.
The random heap walk produced an interesting result. This is a our
worst case scenario, and given the hardware specifications of these
systems, we could be looking at 150ns, 65ns, and 75ns for the above
tests respectively based on memory controller and memory module
latencies. For the Nehalem (i7-860) I can further subvert the cache
sub-system by using a 4GB array resulting in ~55ns on average per
iteration. The i7-2760QM has larger load buffers, TLB caches, and Linux
is running with transparent huge pages which are all working to further
hide the latency. By playing with different prime numbers for the
stride, results can vary wildly depending on processor type, e.g.
try PRIME_INC = 39916801 for Nehalem. I’d like to test this on a much
larger heap with Sandy Bridge.The main take away is the more predictable
the pattern of access to memory, then the better the cache sub-systems
are at hiding main-memory latency. Let’s look at these cache
sub-systems in a little detail to try and understand the observed
results.
Hardware Components
We have many layers of cache plus the pre-fetchers to consider for
how latency gets hidden. In this section I’ll try and cover the major
components used to hide latency that our hardware and systems software
friends have put in place. We will investigate these latency hiding
components and use the Linux perf and Lightweight Performance Countersutilities
to retrieve the performance counters from our CPUs which tell how
effective these components are when we execute our programs.
Performance counters are CPU specific and what I’ve used here are
specific to Sandy Bridge.
Data Caches
Processors typically have 2 or 3 layers of data cache. Each layer as we
move out is progressively larger with increasing latency. The latest
Intel processors have 3 layers (L1D, L2, and L3); with sizes 32KB,
256KB, and 4-30MB; and ~1ns, ~4ns, and ~15ns latency respectively for a
3.0GHz CPU.
Data caches are effectively hardware hash tables with a fixed number
of slots for each hash value. These slots are known as “ways”. An
8-way associative cache will have 8 slots to hold values for addresses
that hash to the same cache location. Within these slots the data
caches do not store words, they store cache-lines of multiple words.
For an Intel processor these cache-lines are typically 64-bytes, that
is 8 words on a 64-bit machine. This plays to the spatial bet that
adjacent memory is likely to be required soon, which is typically the
case if we think of arrays or fields of an object.
Data caches are typically evicted in a LRU manner. Caches work by
using a write-back algorithm were stores need only be propagated to
main-memory when a modified cache-line is evicted. This gives rise the
the interesting phenomenon that a load can cause a write-back to the
outer cache layers and eventually to main-memory.
perf stat -e L1-dcache-loads,L1-dcache-load-misses java -Xmx4g TestMemoryAccessPatterns $
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
1,496,626,053 L1-dcache-loads
274,255,164 L1-dcache-misses
# 18.32% of all L1-dcache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
1,537,057,965 L1-dcache-loads
1,570,105,933 L1-dcache-misses
# 102.15% of all L1-dcache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
4,321,888,497 L1-dcache-loads
1,780,223,433 L1-dcache-misses
# 41.19% of all L1-dcache hits
likwid-perfctr -C 2 -g L2CACHE java -Xmx4g TestMemoryAccessPatterns $
java -Xmx4g TestMemoryAccessPatterns 1
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 5.94918e+09 |
| CPU_CLK_UNHALTED_CORE | 5.15969e+09 |
| L2_TRANS_ALL_REQUESTS | 1.07252e+09 |
| L2_RQSTS_MISS | 3.25413e+08 |
+-----------------------+-------------+
+-----------------+-----------+
| Metric | core 2 |
+-----------------+-----------+
| Runtime [s] | 2.15481 |
| CPI | 0.867293 |
| L2 request rate | 0.18028 |
| L2 miss rate | 0.0546988 |
| L2 miss ratio | 0.303409 |
+-----------------+-----------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.26545e+08 |
| L3_LAT_CACHE_MISS | 2.59059e+07 |
+------------------------+-------------+
java -Xmx4g TestMemoryAccessPatterns 2
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 1.48772e+10 |
| CPU_CLK_UNHALTED_CORE | 1.64712e+10 |
| L2_TRANS_ALL_REQUESTS | 3.41061e+09 |
| L2_RQSTS_MISS | 1.5547e+09 |
+-----------------------+-------------+
+-----------------+----------+
| Metric | core 2 |
+-----------------+----------+
| Runtime [s] | 6.87876 |
| CPI | 1.10714 |
| L2 request rate | 0.22925 |
| L2 miss rate | 0.104502 |
| L2 miss ratio | 0.455843 |
+-----------------+----------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.52088e+09 |
| L3_LAT_CACHE_MISS | 1.72918e+08 |
+------------------------+-------------+
java -Xmx4g TestMemoryAccessPatterns 3
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 6.49533e+09 |
| CPU_CLK_UNHALTED_CORE | 4.18416e+10 |
| L2_TRANS_ALL_REQUESTS | 4.67488e+09 |
| L2_RQSTS_MISS | 1.43442e+09 |
+-----------------------+-------------+
+-----------------+----------+
| Metric | core 2 |
+-----------------+----------+
| Runtime [s] | 17.474 |
| CPI | 6.4418 |
| L2 request rate | 0.71973 |
| L2 miss rate | 0.220838 |
| L2 miss ratio | 0.306835 |
+-----------------+----------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.40079e+09 |
| L3_LAT_CACHE_MISS | 1.34832e+09 |
+------------------------+-------------+
Note: The cache-miss rate of the combined L1D, L2 and L3 increases significantly as the pattern of access becomes more random.
Translation Lookaside Buffers (TLBs)
Our programs deal with virtual memory addresses that need to be
translated to physical memory addresses. Virtual memory systems do this
by mapping pages. We need to know the offset for a given page and its
size for any memory operation. Typically page sizes are 4KB and are
gradually moving to 2MB and greater. Linux introduced Transparent Huge Pages in the 2.6.38 kernel giving us 2MB pages. The translation of virtual memory pages to physical pages is maintained by the page table.
This translation can require multiple accesses to the page table which
is a huge performance penalty. To accelerate this lookup, processors
have a small hardware cache at each cache level called the TLB cache. A
miss on the TLB cache can be hugely expensive because the page table
may not be in a nearby data cache. By moving to larger pages, a TLB
cache can cover a larger address range for the same number of entries.
perf stat -e dTLB-loads,dTLB-load-misses java -Xmx4g TestMemoryAccessPatterns $
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 1':
1,496,128,634 dTLB-loads
310,901 dTLB-misses
# 0.02% of all dTLB cache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 2':
1,551,585,263 dTLB-loads
340,230 dTLB-misses
# 0.02% of all dTLB cache hits
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
4,031,344,537 dTLB-loads
1,345,807,418 dTLB-misses
# 33.38% of all dTLB cache hits
Note: We only incur significant TLB misses when randomly walking the whole heap when huge pages are employed.
Hardware Pre-Fetchers
Hardware will try and predict the next memory access our programs will
make and speculatively load that memory into fill buffers. This is done
at it simplest level by pre-loading adjacent cache-lines for the
spatial bet, or by recognising regular stride based access patterns,
typically less than 2KB in stride length. The tests below we are
measuring the number of loads that hit a fill buffer from a hardware
pre-fetch.
likwid-perfctr -C 2 -g LOAD_HIT_PRE_HW_PF:PMC0 java -Xmx4g TestMemoryAccessPatterns $
java -Xmx4g TestMemoryAccessPatterns 1
+--------------------+-------------+
| Event | core 2 |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 1.31613e+09 |
+--------------------+-------------+
java -Xmx4g TestMemoryAccessPatterns 2
+--------------------+--------+
| Event | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 368930 |
+--------------------+--------+
java -Xmx4g TestMemoryAccessPatterns 3
+--------------------+--------+
| Event | core 2 |
+--------------------+--------+
| LOAD_HIT_PRE_HW_PF | 324373 |
+--------------------+--------+
Note: We have a significant success rate for load hits with the pre-fetcher on the linear walk.
Memory Controllers and Row Buffers
Beyond our last level cache (LLC) sits the memory controllers that
manage access to the SDRAM banks. Memory is organised into rows and
columns. To access an address, first the row address must be selected
(RAS), then the column address is selected (CAS) within that row to get
the word. The row is typically a page in size and loaded into a row
buffer. Even at this stage the hardware is still helping hide the
latency. A queue of memory access requests is maintained and re-ordered
so that multiple words can be fetched from the same row if possible.
Non-Uniform Memory Access (NUMA)
Systems now have memory controllers on the CPU socket. This move to
on-socket memory controllers gave an ~50ns latency reduction over
existing front side bus (FSB) and external Northbridge memory controllers. Systems with multiple sockets employ memory interconnects, QPI from
Intel, which are used when one CPU wants to access memory managed by
another CPU socket. The presence of these interconnects gives rise to
the non-uniform nature of server memory access. In a 2-socket system
memory may be local or 1 hop away. On a 8-socket system memory can be
up to 3 hops away, were each hop adds 20ns latency in each direction.
What does this mean for algorithms?
The difference between an L1D cache-hit, and a full miss resulting in
main-memory access, is 2 orders of magnitude; i.e. <1ns vs.
65-100ns. If algorithms randomly walk around our ever increasing
address spaces, then we are less likely to benefit from the hardware
support that hides this latency.
Is there anything we can do about this when designing algorithms and
data-structures? Yes there is a lot we can do. If we perform chunks of
work on data that is co-located, and we stride around memory in a
predictable fashion, then our algorithms can be many times faster. For
example rather than using bucket and chain hash tables,
like in the JDK, we can employ hash tables using open-addressing with
linear-probing. Rather than using linked-lists or trees with single
items in each node, we can store an array of many items in each node.
Research is advancing on algorithmic approaches that work in harmony with cache sub-systems. One area I find fascinating is Cache Oblivious Algorithms.
The name is a bit misleading but there are some great concepts here
for how to improve software performance and better execute in parallel.
This article is a great illustration of the performance benefits that can be gained.
Conclusion
To achieve great performance it is important to have sympathy for the
cache sub-systems. We have seen in this article what can be achieved
by accessing memory in patterns which work with, rather than against,
these caches. When designing algorithms and data structures, it is now
vitally important to consider cache-misses, probably even more so than
counting steps in the algorithm. This is not what we were taught in
algorithm theory when studying computer science. The last decade has
seen some fundamental changes in technology. For me the two most
significant are the rise of multi-core, and now big-memory systems with
64-bit address spaces.
One thing is certain, if we want software to execute faster and scale
better, we need to make better use of the many cores in our CPUs, and
pay attention to memory access patterns.
Update: 06-August-2012
Trying to design a random walk algorithm for all processors and memory
sizes is tricky. If I use the algorithm below then my Sandy Bridge
processor is slower but the Nehalem is faster. The point is performance
will be veryunpredictable when you walk around memory in a
random fashion. I’ve also included the L3 cache counters for more
detail in all the tests.
private static final long LARGE_PRIME_INC = 70368760954879L;
RANDOM_HEAP_WALK
{
public int next( final int pageOffset, final int wordOffset, final int pos)
{
return ( int )(pos + LARGE_PRIME_INC) & ARRAY_MASK;
}
};
|
Intel i7-2760QM @ 2.40GHz, 8GB RAM DDR3 1600MHz,
Linux 3.4.6 kernel 64-bit, Java 1.7.0_05
=================================================
0 - 29.06ns RANDOM_HEAP_WALK
1 - 29.47ns RANDOM_HEAP_WALK
2 - 29.48ns RANDOM_HEAP_WALK
3 - 29.43ns RANDOM_HEAP_WALK
4 - 29.42ns RANDOM_HEAP_WALK
Performance counter stats for 'java -Xmx4g TestMemoryAccessPatterns 3':
9,444,928,682 dTLB-loads
4,371,982,327 dTLB-misses
# 46.29% of all dTLB cache hits
9,390,675,639 L1-dcache-loads
1,471,647,016 L1-dcache-misses
# 15.67% of all L1-dcache hits
+-----------------------+-------------+
| Event | core 2 |
+-----------------------+-------------+
| INSTR_RETIRED_ANY | 7.71171e+09 |
| CPU_CLK_UNHALTED_CORE | 1.31717e+11 |
| L2_TRANS_ALL_REQUESTS | 8.4912e+09 |
| L2_RQSTS_MISS | 2.79635e+09 |
+-----------------------+-------------+
+-----------------+----------+
| Metric | core 2 |
+-----------------+----------+
| Runtime [s] | 55.0094 |
| CPI | 17.0801 |
| L2 request rate | 1.10108 |
| L2 miss rate | 0.362611 |
| L2 miss ratio | 0.329324 |
+-----------------+----------+
+--------------------+-------------+
| Event | core 2 |
+--------------------+-------------+
| LOAD_HIT_PRE_HW_PF | 3.59509e+06 |
+--------------------+-------------+
+------------------------+-------------+
| Event | core 2 |
+------------------------+-------------+
| L3_LAT_CACHE_REFERENCE | 1.30318e+09 |
| L3_LAT_CACHE_MISS | 2.62346e+07 |
+------------------------+-------------+
文章转自
并发编程网-ifeve.com