-
Notifications
You must be signed in to change notification settings - Fork 4
/
Concurrent.tex
228 lines (186 loc) · 12.1 KB
/
Concurrent.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
\newpage
\section{Destructive Compiler Optimizatios}
In (say) the early 1990s, compilers did fewer optimizations,
in part because there were fewer compiler writers and in part due to the
relatively small memories of that era. Nevertheless, problems did arise, as
shown in Listing 4.14, which the compiler is within its rights to transform
into Listing 4.15. As you can see, the temporary on line 1 of Listing 4.14
has been optimized away, so that global\_ptr will be loaded up to three
times.
Given code that does plain loads and stores,\footnote{That is, normal loads and stores instead of C11 atomics, inline assembly, or volatile
accesses.} the compiler is within its rights
to assume that the affected variables are neither accessed nor modified by
any other thread. This assumption allows the compiler to carry out a large
number of transformations, including load tearing, store tearing, load fusing,
store fusing, code reordering, invented loads, invented stores, store-to-load
transformations, and dead-code elimination, all of which work just fine in
single-threaded code. But concurrent code can be broken by each of these
transformations, or shared-variable shenanigans, as described below.
\subsection{Load tearing}
Load tearing occurs when the compiler uses multiple load instructions
for a single access. For example, the compiler could in theory compile the
load from global\_ptr (see line 1 of Listing 4.14) as a series of one-byte
loads. If some other thread was concurrently setting global\_ptr to NULL,
the result might have all but one byte of the pointer set to zero, thus forming
a “wild pointer”. Stores using such a wild pointer could corrupt arbitrary
regions of memory, resulting in rare and difficult-to-debug crashes.
Worse yet, on (say) an 8-bit system with 16-bit pointers, the compiler
might have no choice but to use a pair of 8-bit instructions to access a given
pointer. Because the C standard must support all manner of systems, the
standard cannot rule out load tearing in the general case.
\subsection{Store tearing}
Store tearing occurs when the compiler uses multiple store instructions
for a single access. For example, one thread might store 0x12345678 to
a four-byte integer variable at the same time another thread stored
0xabcdef00. If the compiler used 16-bit stores for either access, the result
might well be 0x1234ef00, which could come as quite a surprise to code
loading from this integer. Nor is this a strictly theoretical issue. For example,
there are CPUs that feature small immediate instruction fields, and on such
CPUs, the compiler might split a 64-bit store into two 32-bit stores in order
to reduce the overhead of explicitly forming the 64-bit constant in a register,
even on a 64-bit CPU. There are historical reports of this actually happening
in the wild.
\subsection{Load fusing}
Load fusing occurs when the compiler uses the result of a prior load
from a given variable instead of repeating the load. Not only is this sort
of optimization just fine in single-threaded code, it is often just fine in
multithreaded code. Unfortunately, the word “often” hides some truly
annoying exceptions.
For example, suppose that a real-time system needs to invoke a function
named do\_something\_quickly() repeatedly until the variable need\_
to\_stop was set, and that the compiler can see that do\_something\_
quickly() does not store to need\_to\_stop. One (unsafe) way to code
this is shown in Listing 4.16. The compiler might reasonably unroll this loop
sixteen times in order to reduce the per-invocation of the backwards branch
at the end of the loop. Worse yet, because the compiler knows that do\_
something\_quickly() does not store to need\_to\_stop, the compiler
could quite reasonably decide to check this variable only once, resulting
in the code shown in Listing 4.17. Once entered, the loop on lines 2–19
will never exit, regardless of how many times some other thread stores a
non-zero value to need\_to\_stop. The result will at best be consternation,
and might well also include severe physical damage.
The compiler can fuse loads across surprisingly large spans of code.
For example, in Listing 4.18, t0() and t1() run concurrently, and do\_
something() and do\_something\_else() are inline functions. Line 1
declares pointer gp, which C initializes to NULL by default. At some point,
line 5 of t0() stores a non-NULL pointer to gp. Meanwhile, t1() loads
from gp three times on lines 10, 12, and 15. Given that line 13 finds that
gp is non-NULL, one might hope that the dereference on line 15 would be
guaranteed never to fault. Unfortunately, the compiler is within its rights to
fuse the read on lines 10 and 15, which means that if line 10 loads NULL and
line 12 loads \&myvar, line 15 could load NULL, resulting in a fault. 8 Note
that the intervening READ\_ONCE() does not prevent the other two loads
from being fused, despite the fact that all three are loading from the same
variable.
\subsection{Store fusing}
Store fusing can occur when the compiler notices a pair of successive
stores to a given variable with no intervening loads from that variable. In
this case, the compiler is within its rights to omit the first store. This is never
a problem in single-threaded code, and in fact it is usually not a problem in
correctly written concurrent code. After all, if the two stores are executed
in quick succession, there is very little chance that some other thread could
load the value from the first store.
However, there are exceptions, for example as shown in Listing 4.19. The
function shut\_it\_down() stores to the shared variable status on lines 3
and 8, and so assuming that neither start\_shutdown() nor finish\_
shutdown() access status, the compiler could reasonably remove the
store to status on line 3. Unfortunately, this would mean that work\_
until\_shut\_down() would never exit its loop spanning lines 14 and 15,
and thus would never set other\_task\_ready, which would in turn mean
that shut\_it\_down() would never exit its loop spanning lines 5 and 6,
even if the compiler chooses not to fuse the successive loads from other\_
task\_ready on line 5.
And there are more problems with the code in Listing 4.19, including
code reordering.
\subsection{Code reordering}
Code reordering is a common compilation technique used to combine
common subexpressions, reduce register pressure, and improve utilization of
the many functional units available on modern superscalar microprocessors.
It is also another reason why the code in Listing 4.19 is buggy. For example,
suppose that the do\_more\_work() function on line 15 does not access
other\_task\_ready. Then the compiler would be within its rights to move
the assignment to other\_task\_ready on line 16 to precede line 14, which
might be a great disappointment for anyone hoping that the last call to do\_
more\_work() on line 15 happens before the call to finish\_shutdown()
on line 7.
It might seem futile to prevent the compiler from changing the order of
accesses in cases where the underlying hardware is free to reorder them.
However, modern machines have exact exceptions and exact interrupts,
meaning that any interrupt or exception will appear to have happened at a
specific place in the instruction stream. This means that the handler will see
the effect of all prior instructions, but won’t see the effect of any subsequent
instructions. READ\_ONCE() and WRITE\_ONCE() can therefore be used to
control communication between interrupted code and interrupt handlers,
independent of the ordering provided by the underlying hardware. 9
\subsection{Invented loads}
Invented loads were illustrated by the code in Listings 4.14 and 4.15, in
which the compiler optimized away a temporary variable, thus loading from
a shared variable more often than intended.
Invented loads can be a performance hazard. These hazards can occur
when a load of variable in a “hot” cacheline is hoisted out of an if
statement. These hoisting optimizations are not uncommon, and can cause
significant increases in cache misses, and thus significant degradation of
both performance and scalability.
\subsection{Invented stores}
Invented stores can occur in a number of situations. For example,
a compiler emitting code for work\_until\_shut\_down() in Listing 4.19
might notice that other\_task\_ready is not accessed by do\_more\_work(),
and stored to on line 16. If do\_more\_work() was a complex inline function,
it might be necessary to do a register spill, in which case one attractive place
to use for temporary storage is other\_task\_ready. After all, there are no
accesses to it, so what is the harm?
Of course, a non-zero store to this variable at just the wrong time would
result in the while loop on line 5 terminating prematurely, again allowing
finish\_shutdown() to run concurrently with do\_more\_work(). Given
that the entire point of this while appears to be to prevent such concurrency,
this is not a good thing.
Using a stored-to variable as a temporary might seem outlandish, but
it is permitted by the standard. Nevertheless, readers might be justified
in wanting a less outlandish example, which is provided by Listings 4.20
and 4.21.
A compiler emitting code for Listing 4.20 might know that the value of
a is initially zero, which might be a strong temptation to optimize away
one branch by transforming this code to that in Listing 4.21. Here, line 1
unconditionally stores 1 to a, then resets the value back to zero on line 3
if condition was not set. This transforms the if-then-else into an if-then,
saving one branch.
Finally, pre-C11 compilers could invent writes to unrelated variables
that happened to be adjacent to written-to variables [Boe05, Section 4.2].
This variant of invented stores has been outlawed by the prohibition against
compiler optimizations that invent data races.
\subsection{Store-to-load transformations}
Store-to-load transformations can occur when the compiler notices that
a plain store might not actually change the value in memory. For example,
consider Listing 4.22. Line 1 fetches p, but the “if” statement on line 2
also tells the compiler that the developer thinks that p is usually zero. 10 The
barrier() statment on line 4 forces the compiler to forget the value of
p, but one could imagine a compiler choosing to remember the hint—or
getting an additional hint via feedback-directed optimization. Doing so
would cause the compiler to realize that line 5 is often an expensive no-op.
Such a compiler might therefore guard the store of NULL with a check,
as shown on lines 5 and 6 of Listing 4.23. Although this transformation is
often desirable, it could be problematic if the actual store was required for
ordering. For example, a write memory barrier (Linux kernel smp\_wmb())
would order the store, but not the load. This situation might suggest use of
smp\_store\_release() over smp\_wmb().
\subsection{Dead-code elimination}
Dead-code elimination can occur when the compiler notices that the
value from a load is never used, or when a variable is stored to, but never
loaded from. This can of course eliminate an access to a shared variable,
which can in turn defeat a memory-ordering primitive, which could cause
your concurrent code to act in surprising ways. Experience thus far indicates
that relatively few such surprises will be at all pleasant. Elimination of
store-only variables is especially dangerous in cases where external code
locates the variable via symbol tables: The compiler is necessarily ignorant
of such external-code accesses, and might thus eliminate a variable that the
external code relies upon.
\subsection{A Volatile Solution}
The volatile keyword can prevent load tearing and store
tearing in cases where the loads and stores are machine-sized and properly
aligned. It can also prevent load fusing, store fusing, invented loads, and
invented stores. However, although it does prevent the compiler from
reordering volatile accesses with each other, it does nothing to prevent
the CPU from reordering these accesses. Furthermore, it does nothing to
prevent either compiler or CPU from reordering non-volatile accesses
with each other or with volatile accesses. Preventing these types of
reordering requires the techniques described in the next section.