1003
阿里云
技术社区[云栖]
Inter Thread Latency
Message rates between threads are fundamentally determined by the
latency of memory exchange between CPU cores. The minimum unit of
transfer will be a cache line exchanged via shared caches or socket
interconnects. In a previous article I explained Memory Barriers and
why they are important to concurrent programming between threads.
These are the instructions that cause a CPU to make memory visible to
other cores in an ordered and timely manner.
Lately I’ve been asked a lot about how much faster the Disruptor would
be if C++ was used instead of Java. For sure C++ would give more
control for memory alignment and potential access to underlying CPU
instructions such as memory barriers and lock instructions. In this
article I’ll directly compare C++ and Java to measure the cost of
signalling a change between threads.
For the test we’ll use two counters each updated by their own
thread. A simple ping-pong algorithm will be used to signal from one to
the other and back again. The exchange will be repeated millions of
times to measure the average latency between cores. This measurement
will give us the latency of exchanging a cache line between cores in a
serial manner.
For Java we’ll use volatile counters which the JVM will kindly insert
a lock instruction for the update giving us an effective memory
barrier.
01 |
public final class InterThreadLatency
|
04 |
public static final long ITERATIONS = 500L * 1000L * 1000L;
|
06 |
public static volatile long s1;
|
07 |
public static volatile long s2;
|
09 |
public static void main( final String[] args)
|
11 |
Thread t = new Thread( new InterThreadLatency());
|
15 |
long start = System.nanoTime();
|
18 |
while (s1 < ITERATIONS)
|
27 |
long duration = System.nanoTime() - start;
|
29 |
System.out.println( "duration = " + duration);
|
30 |
System.out.println( "ns per op = " + duration / (ITERATIONS * 2 ));
|
31 |
System.out.println( "op/sec = " +
|
32 |
(ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration); |
33 |
System.out.println( "s1 = " + s1 + ", s2 = " + s2);
|
For C++ we’ll use the <a
href=”https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html”>GNU
Atomic Builtins</a> which give us a similar lock instruction
insertion to that which the JVM uses.
05 |
typedef unsigned long long uint64; |
06 |
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL; |
08 |
volatile uint64 s1 = 0; |
09 |
volatile uint64 s2 = 0; |
13 |
register uint64 value = s2;
|
20 |
value = __sync_add_and_fetch(&s2, 1);
|
24 |
int main (int argc, char *argv[]) |
27 |
pthread_create(&threads[0], NULL, run, NULL);
|
31 |
clock_gettime(CLOCK_MONOTONIC, &ts_start);
|
33 |
register uint64 value = s1;
|
34 |
while (s1 < ITERATIONS)
|
40 |
value = __sync_add_and_fetch(&s1, 1);
|
43 |
clock_gettime(CLOCK_MONOTONIC, &ts_finish);
|
45 |
uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
|
46 |
uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
|
47 |
uint64 duration = finish - start;
|
49 |
printf("duration = %lld\n", duration);
|
50 |
printf("ns per op = %lld\n", (duration / (ITERATIONS * 2)));
|
51 |
printf("op/sec = %lld\n",
|
52 |
((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
|
53 |
printf("s1 = %lld, s2 = %lld\n", s1, s2);
|
Results
01 |
$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency |
02 |
duration = 50790271150 |
05 |
s1 = 500000000, s2 = 500000000 |
07 |
$ g++ -O3 -lpthread -lrt -o itl itl.cpp |
08 |
$ taskset -c 2,4 ./itl |
09 |
duration = 45087955393 |
12 |
s1 = 500000000, s2 = 500000000 |
The C++ version is slightly faster on my Intel Sandybridge laptop.
So what does this tell us? Well, that the latency between 2 cores on a
2.2 GHz machine is ~45ns and that you can exchange 22m messages per
second in a serial fashion. On an Intel CPU this is fundamentally the
cost of the lock instruction enforcing total order and forcing the store
buffer and write combining buffers to
drain, followed by the resulting cache coherency traffic between the
cores. Note that each core has a 96GB/s port onto the L3 cache ring
bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have
measured latency and not throughput. We could easily fit some nice fat
messages between those memory barriers as part of the exchange if the
data has been written before the lock instruction was executed.
So what does this all mean for the Disruptor? Basically, the latency
of the Disruptor is about as low as we can get from Java. It would be
possible to get a ~10% latency improvement by moving to C++. I’d expect
a similar improvement in throughput for C++. The main win with C++
would be the control, and therefore, the predictability that comes with
it if used correctly. The JVM gives us nice safety features like
garbage collection in complex applications but we pay a little for that
with the extra instructions it inserts that can be seen if you get
Hotspot to dump the assembler instructions it is generating.
How does the Disruptor achieve more than 25m messages per second I
hear you say??? Well that is one of the neat parts of its design. The
“waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with theRingBuffer,
all without incurring a memory barrier. For real world applications
this batching effect is really significant. For micro benchmarks it
only makes the results more random, especially when there is little
work done other than accepting the message.
Conclusion
So when processing events in series, the measurements tell us that
the current generation of processors can do between 20-30 million
exchanges per second at a latency less than 50ns. The Disruptor design
allows us to get greater throughput without explicit batching on the
publisher side. In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.
文章转自 并发编程网-ifeve.com
最后更新:2017-05-22 17:31:54