-
Notifications
You must be signed in to change notification settings - Fork 1
/
aises_3_2
1042 lines (1029 loc) · 63.8 KB
/
aises_3_2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<h1 id="sec:opaqueness">3.2 Monitoring</h1>
<p>Obstacles to effective monitoring of AI systems to identify and avoid hazards include the opaqueness of
AI systems and the emergence of surprising ``emergent'' capabilities as they become more advanced. To better
monitor AI systems, we need research progress in research areas such as representation reading, model evaluations, and anomaly detection. </p>
<h2 id="ml-systems-are-opaque">3.2.1 ML Systems are Opaque</h2>
<p>The internal operations of many AI systems are opaque. We might be
able to reveal and prevent harmful behavior if we can make these systems
more transparent. In this section, we will discuss why AI systems are
often called <em>black boxes</em> and explore ways to understand them.
Although early research into transparency shows that the problem is
highly difficult and conceptually fraught, its potential to improve AI
safety is substantial.</p>
<p>The most capable machine learning models today are based on deep
neural networks. Whereas most conventional software is directly written
by humans, deep learning (DL) systems independently learn how to
transform inputs to outputs layer-by-layer and step-by-step. We can
direct DL models to learn how to give the right outputs, but we do not
know how to interpret the model’s intermediate computations. In other
words, we do not understand how to make sense of a model’s activations
given a real-world data input. As a result, we cannot make reliable
predictions about a model’s behavior when given new inputs. This section
will present a handful of analogies and results that illustrate the
difficulty of understanding machine learning systems.</p>
<p><strong>Deep learning models as a black box.</strong> Machine
learning researchers often refer to deep learning models as a <em>black box</em>
<span class="citation"
data-cites="lipton2018interpretability">[1]</span>, a system that can only be understood in terms of its input-output behavior without insight into its
internal workings. Humans are black boxes—we see their behavior, but not
the internal brain activity that produces it, let alone how fully
understand that brain activity. Although a deep neural network’s weights
and activations are observable, these long lists of numbers do not
currently help us understand how a model will behave. We cannot reduce
all the numerical operations of a state of the art model into a form
that is meaningful to humans.<p>
</p>
<figure id="fig:comp-graph">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/computational_graph.png" class="tb-img-full" style="width: 80%" />
<p class="tb-caption">Figure 3.1: ML systems can be broken down into computational graphs with many sections. <span
class="citation" data-cites="zoph2017neural">[2]</span></p>
<!--<figcaption>A section of a computational graph for an ML system - <span-->
<!--class="citation" data-cites="zoph2017neural">[2]</span></figcaption>-->
</figure>
<p><strong>Even simple ML techniques suffer from opaqueness.</strong>
Opaqueness is not unique to neural networks. Even simple ML techniques
such as Principal Component Analysis (PCA), which are better understood
theoretically than DL, suffer from similar flaws. For example, Figure 3.2 depicts the results of performing
PCA on pictures of human faces. This yields a set of “eigenfaces”,
capturing the most important features identifying a face. Any picture of
a face can then be represented as a particular combination of these
eigenfaces.<p>
</p>
<figure id="fig:Eigenfaces">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/Eigenfaces.jpg" class="tb-img-full" style="width: 60%"/>
<p class="tb-caption">Figure 3.2: A human face can be made by combining several eigenfaces, each of which represents
different facial features. <span class="citation"
data-cites="zhang2008eigenfaces">[3]</span></p>
<!--<figcaption>Eigenfaces - <span class="citation"-->
<!--data-cites="zhang2008eigenfaces">[3]</span></figcaption>-->
</figure>
<p>In some cases, we can guess what facial features an eigenface
represents: for example, eigenfaces 1, 2 and 3 seem to capture the
lighting and shading of the face, while eigenface 11 may detect facial
hair. However, most eigenfaces do not represent clear facial features,
and it is difficult to verify that our hypotheses for any single feature
capture the entirety of their role. The fact that even simple techniques
like PCA remain opaque is a sign of the difficulty of the problem in the
more complicated techniques like DL.</p>
<p><strong>Feature visualizations demonstrate that deep learning neurons
are hard to interpret.</strong> In a neural network, a neuron is a
component of an activation vector. One attempt to understand deep
networks involves looking for simple quantitative or algorithmic
descriptions of the relationship between inputs and neurons such as “if
the ear feature has been detected, the model will output either dog or
cat” <span class="citation" data-cites="bau2017vision">[4]</span>. For
image models, we can create <em>feature visualizations</em>, artificial
images that highly activate a particular neuron (or set of neurons)
<span class="citation" data-cites="olah2017feature">[5]</span>. We can
also examine natural images that highly activate that neuron.<p>
</p>
<figure id="fig:random-neuron">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/feature_vis.png" />
<p class="tb-caption"> Figure 3.3: Left: a feature visualization that “highly activates” a particular neuron. Right: a
collection of natural images that activate a particular neuron. <span class="citation"
data-cites="schubert2020openai">[6]</span></p>
<!--<figcaption>From <span class="citation"-->
<!--data-cites="schubert2020openai">[6]</span>. A randomly selected neuron-->
<!--in the CLIP ResNet-50 image model. Left: a feature visualization that-->
<!--“highly activates” the neuron, meaning the neuron reads a high value-->
<!--when the image is input to the model. Right: example images that-->
<!--activate the neuron</figcaption>-->
</figure>
<p>Like eigenfaces, neurons may be more or less interpretable.
Sometimes, feature visualizations identify neurons that seem to depend
on a pattern of the input that is clear to humans. For example, a neuron
might activate only when an image contains dog ears. In other cases, we
observe <em>polysemantic neurons</em>, which defy a single
interpretation <span class="citation"
data-cites="elhage2022softmax">[7]</span>. Consider Figure 3.3, which shows images that
highly activate a randomly chosen neuron in an image model. Judging from
the natural images, it seems like the neuron often activates when text
associated with traveling or moving is present, but it’s hard to be
sure.</p>
<p><strong>Neural networks are complex systems.</strong> Both human
brains and deep neural networks are complex systems, and so involve
interdependent and nonlinear interactions between many components. Like
many other complex systems (see the Complex Systems chapter), the emergent behaviors of
neural networks are difficult to understand in terms of their
components. Just as neuroscientists struggle to identify how a
particular biological neuron contributes to a mind’s behavior, ML
researchers struggle to determine how a particular artificial neuron
contributes to a DL model’s behavior. There are limits on our ability to
systematically understand and predict complex systems, which suggests
that ML opaqueness may be a special case of the opaqueness of complex
systems.</p>
<h2 id="motivations-for-transparency-research">3.2.2 Motivations for
Transparency Research</h2>
<p>There is often no way to tell whether a model will perform well on
new inputs. If the model performs poorly, we generally cannot tell why.
With better transparency tools, we might be able to reveal and
proactively prevent failure modes, detect the emergence of new
capabilities, and build trust that models will perform as expected in
new circumstances. High-stakes domains might demand guarantees of
reliability based on the soundness or security of internal AI processes,
but virtually no such guarantees can be made for neural networks given
the current state of transparency research.<p>
If we could meaningfully understand how a model treats a given input, we
would be better able to monitor and audit its behavior. Additionally, by
understanding how models solve difficult and novel problems,
transparency might also become a source of conceptual and scientific
insight <span class="citation"
data-cites="lipton2018interpretability">[1]</span>.</p>
<p><strong>Ethical obligations to make AI transparent.</strong> Model
transparency can help ensure that model decision making is fair,
unbiased, and ethical. For example, if a criminal justice system uses an
opaque AI to make decisions about policing, sentencing, or probation,
then those decisions will be similarly opaque. People might have a right
to an explanation of decisions that will significantly affect them <span
class="citation" data-cites="kaminski2019explanation">[8]</span>.
Transparency tools may be crucial to ensuring that right is upheld.</p>
<p><strong>Accountability for harms and hazards.</strong> Who is
responsible when AI systems fail? Responsibility often depends on the
intentions and degree of control held by those involved. The best way to
incentivize safety might be to hold AI creators responsible for the
damage their systems cause. However, we might not want to hold people
responsible for the behavior of systems they cannot predict or
understand. The growing autonomy and complexity of AI systems means that
people will have less control over AI behavior. Meanwhile, the scope and
generality of modern AI systems make it impossible to verify desirable
behavior in every case. In “human-in-the-loop” systems, where decisions
depend on both humans and AIs, human operators might be blamed for
failures over which they had little control <span class="citation"
data-cites="elish2019moral">[9]</span>.<p>
AI transparency could enable a more robust system of accountability. For
instance, governments could mandate that AI systems meet baseline
requirements for understandability. If an AI fails because of a
mechanism that its creator could have identified and prevented with
transparency tools, we would be more justified in holding that creator
liable. Transparency could also help to identify responsibility and
fairly assign blame in failures involving human-in-the-loop systems.</p>
<p><strong>Combating deception.</strong> Just as a person’s behavior can
correspond with many intentions, an AI’s behavior can correspond to many
internal processes, some of which are more acceptable than others. For
example, competent deception is intrinsically difficult to distinguish
from genuine helpfulness. We discuss this issue in more detail in the Control
section. For phenomena like deception that are difficult to detect from
behavior alone, transparency tools might allow us to catch internal
signs that show that a model is engaging in deceptive behavior.</p>
<h2 id="approaches-to-transparency">3.2.3 Approaches to Transparency</h2>
<p>The remainder of this section explores a variety of approaches to
transparency. Though the field is promising, we are careful to note the
shortcomings of these approaches. For a problem as conceptually tricky
as opaqueness, it is important to maintain a clear picture of what
successful techniques must achieve and hold new methods to a high
standard. We will discuss the research areas of explainability, saliency
maps, mechanistic interpretability, and representation engineering.</p>
<h3 id="explanations">Explanations</h3>
<p><strong>What must explanations accomplish?</strong> One approach to
transparency is to create explanations of a model’s behavior. These
explanations could have the following virtues:</p>
<ul>
<li><p>Predictive power: A good explanation should help us understand
not just a specific behavior, but how the model is likely to behave in
new situations. Building user trust in a system is easier when a user
can more clearly anticipate model behavior.</p></li>
<li><p>Faithfulness: A faithful explanation accurately reflects the
internal workings of the model. This is especially valuable when we need
to understand the precise reason why a model made a particular decision.
Faithful explanations are often better able to predict behavior because
they more closely track the actual mechanisms that models are using to
produce their behavior <span class="citation"
data-cites="lipton2018interpretability">[1]</span>.</p></li>
<li><p>Simplicity: A simple explanation is easier to understand.
However, it is important that the simplification does not sacrifice too
much information about actual model processes. Though some information
loss is inevitable, explanations must strike the right balance between
simplicity and faithfulness.</p></li>
</ul>
<p><strong>Explanations must avoid confabulation.</strong> Explanations
can sound plausible even if they are false. A <em>confabulation</em> is
an explanation that is not faithful to the true processes and
considerations that gave rise to a behavior. Both humans and AI systems
confabulate.</p>
<p><strong>Human confabulation.</strong> Humans are not transparent
systems, even to themselves. In some sense, the field of psychology
exists because humans cannot accurately intuit how their own mental
processes produce their experience and behavior. For example, mock
juries tend to be more lenient with attractive defendants, all else
being equal, even though jurors almost never reference attractiveness
when explaining their decisions <span class="citation"
data-cites="patry2008attractive">[10]</span>.<p>
Another example of human confabulation can be drawn from studies on
split-brain patients, those who have had the connection between their
two cerebral hemispheres surgically severed causing each hemisphere to
process information independently <span class="citation"
data-cites="dehaan2020split">[11]</span>. Researchers can give
information to one hemisphere and not the other by showing the
information to only one eye. In some experiments, researchers gave
written instructions to a patient’s right hemisphere, which is unable to
speak. After the patient completed the instructions, the researchers
asked the patient’s verbal left hemisphere why they had taken those
actions. Unaware of the instructions, the left hemisphere reported
plausible but incorrect explanations for the patient’s behavior.</p>
<p><strong>Machine learning system confabulation.</strong> We can ask
language models to provide justifications along with their answers.
Natural language reasoning is much easier to understand than internal
model activations. For example, if an LLM describes each step of its
reasoning in a math problem and gets the question wrong, humans can
check where and how the mistake was made.<p>
However, like human explanations, language model explanations are prone
to unreliability and confabulation. For instance, when researchers
fine-tuned a language model on multiple-choice questions where option
(a) was always correct, the model learned to always answer (a). When
this model was told to write explanations for questions whose correct
answers were not (a), the model would produce false but plausible
explanations for option (a). The model’s explanation systematically
failed to mention the real reason for its answers, which was that it had
been trained to always pick (a) <span class="citation"
data-cites="turpin2023language">[12]</span>.</p>
<p><strong>An alternative view of explanations.</strong> Instead of
requiring that explanations directly describe internal model processes,
a more expansive view argues that explanations are just any useful
auxiliary information provided alongside the output of a model. Such
explanations might include contextual knowledge or observations that the
model makes about the input. Models can also make auxiliary predictions;
for example they could note that if an input were different in some
specific ways, the output would change. However, while this type of
information can be valuable when presented correctly, such explanations
have the potential to mislead us.</p>
<h3 id="saliency-maps">Saliency Maps</h3>
<p><strong>Saliency maps purport to identify important components of
images.</strong> Saliency maps are visualizations that aim to show which
parts of the input are most relevant to the model’s behavior <span
class="citation" data-cites="simonyan2014deep">[13]</span>. They are
inspired by biological visual processing: when humans and other animals
are shown an image, they tend to focus on particular areas. For example,
if a person looks at a picture of a dog, the dog’s ears and nose will be
more relevant than the background to how the person interprets the
image. Saliency map techniques have been popular in part due to the
striking visualizations they produce.</p>
<figure id="fig:saliency-map">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/saliency_map.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.4: A saliency map picks out features from an input that seem particularly relevant to the
model, such as the shirt and cowboy hat in the bottom left image. <span class="citation"
data-cites="springenberg2015striving">[14]</span></p>
<!--<figcaption>Example of a saliency map technique for various images - -->
<!--<span class="citation"-->
<!--data-cites="springenberg2015striving">[14]</span></figcaption>-->
</figure>
<p><strong>Saliency maps often fail to show how machine learning vision
models process images.</strong> In practice, saliency maps are largely
bias-confirming visualizations that usually do not provide useful
information about models’ inner workings. It turns out that many
saliency maps are not dependent on a model’s parameters, and the
saliency maps often look similar even when generated for random,
untrained models. That means many saliency maps are incapable of
providing explanations that have any relevance to how a particular model
processes data <span class="citation"
data-cites="adebayo2018sanity">[15]</span>. Saliency maps serve as a
warning that visually or intuitively satisfying information that seems
to correspond with model behavior may not actually be useful. Useful
transparency research must avoid the past failures of the field and
produce explanations that are relevant to the model’s operation.</p>
<h3 id="mechanistic-interpretability">Mechanistic Interpretability</h3>
<p>When trying to understand a system, we might start by finding the
smallest pieces of the system that can be well understood and then
combine those pieces to describe larger parts of the system. If we can
understand successively larger parts of the system, we might eventually
develop a bottom-up understanding of the entire system. <em>Mechanistic
interpretability</em> is a transparency research area that aims to
represent models in terms of combinations of small, well-understood
mechanisms <span class="citation"
data-cites="wang2022interpretability">[16]</span>. If we can
reverse-engineer algorithms that describe small subsets of model
activations and weights, we might be able to combine these algorithms to
explain successively larger parts of the model.</p>
<p><strong>Features are the building blocks of deep learning
mechanisms.</strong> Mechanistic interpretability proposes focusing on
<em>features</em>, which are directions in a layer’s activation space
that aim to correspond to a meaningful, articulable property of the
input <span class="citation" data-cites="olah2020zoom">[17]</span>. For
example, we can imagine a language model with a “this is in Paris”
feature. If we evaluate the input “Eiffel Tower” using the language
model, we may find that a corresponding activation vector points in a
similar direction as the “this is in Paris” feature direction <span
class="citation" data-cites="meng2023locating">[18]</span>. Meanwhile,
the activation vector encoding “Coliseum” may point away from the “this
is in Paris” direction. Other examples of image or text features include
“this text is code”, curve detectors, and a large-small dichotomy
indicator.<p>
One goal of mechanistic interpretability is to identify features that
maintain a coherent description across many different inputs: a “this is
in Paris” feature would not be very valuable if it was highly activated
by “Statue of Liberty.” Recall that most neurons are polysemantic,
meaning they don’t individually represent features that are
straightforwardly recognizable by humans. Instead, most features are
actually combinations of neurons, making them difficult to identify due
to the sheer number of possible combinations. Despite this challenge,
features can help us think about the relationship between the internal
activations of models and human-understandable concepts.</p>
<p><strong>Circuits are algorithms operating on features.</strong>
Features can be understood in terms of other features. For example, if
we’ve discovered features in one layer of an image model that detect dog
ears, snouts, and tails, an input image with high activations for all of
these features may be quite likely to contain a dog. In fact, if we
discover a dog-detecting feature in the next layer of the model, it is
plausible that this feature is calculated using a combination of
dog-part-detecting features from the previous layer. We can test that
hypothesis by checking the model’s weights.<p>
A function represented in model weights which relates a model’s earlier
features to its later features is called a <em>circuit</em> <span
class="citation" data-cites="olah2020zoom">[17]</span>. In short,
circuits are computations within a model that are often more
understandable. The project of mechanistic interpretability is to
identify features in models and circuits between them. The more features
and circuits we identify, the more confident we can be that we
understand some of the model’s mechanisms. Circuits also simplify our
understanding of the model, allowing us to equate complicated numerical
manipulations with simpler algorithmic abstractions.</p>
<p><strong>An empirical example of a circuit.</strong> For the sake of
illustration, we will describe a purported circuit from a language
model. Researchers identified how a language model often predicts
indirect objects of sentences (such as “Mary” in “John gave a drink to
...”) as a simple algorithm using all previous names in a sentence (see
Figure 3.5 below). This mechanism did not
merely agree with model behavior, but was directly derived from the
model weights, giving more confidence that the algorithm is a faithful
description of an internal model mechanism <span class="citation"
data-cites="wang2022interpretability">[16]</span>.<p>
</p>
<figure id="fig:id-circuit">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/identification_circuit.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.5: An indirect-object identification circuit can be depicted graphically</p>
<!--<figcaption>Graphical depiction of indirect-object identification-->
<!--circuit</figcaption>-->
</figure>
<p><strong>Complex system understanding through mechanisms is
limited.</strong> There are several reasons to be concerned about the
ability of mechanistic interpretability research to achieve its
ambitions. It is challenging to reduce a complex system’s behavior into
many different low-level mechanisms. Even if we understood each of a
trillion neurons in a large model, we might not be able to combine the
pieces into an understanding of the system as a whole. Another concern
is that it is unclear if mechanistic interpretability can represent
model processes with enough simplicity to be understandable. ML models
might represent vast numbers of partial concepts and complex intuitions
that can not be represented by mechanisms or simple concepts.</p>
<h3 id="representation-engineering">Representation Engineering</h3>
<p><strong>Representation reading and representation control <span
class="citation"
data-cites="zou2023representation">[19]</span>.</strong> Mechanistic
interpretability is a bottom-up approach and combines small components
into an understanding of larger structures. Meanwhile,
<em>representation engineering</em> is a top-down approach that begins
with a model’s high-level representations and analyzes and controls
them. In machine learning, models learn representations that are not
identical to their training data, but rather stand in for it and allow
them to identify essential elements or patterns in the data (see the Artificial Intelligence Fundamentals chapter
for further details). Rather than try to fully understand arbitrary
aspects of a model’s internals, representation engineering develops
actionable tools for reading representations and controlling them.<p>
<strong>We can detect high-level subprocesses.</strong> Even though
neuroscientists don’t understand the brain in fine-grained detail, they
can associate high-level cognitive tasks to particular brain regions.
For example, they have shown that Wernicke’s area is involved in speech
comprehension. Though the brain was once a complete black box,
neuroscience has managed to decompose it into many parts.
Neuroscientists can now make detailed predictions about a person’s
emotional state, thoughts, and even mental imagery just by monitoring
their brain activity <span class="citation"
data-cites="tang2023semantic">[20]</span>.<p>
Representation reading is the similar approach of identifying indicators
for particular subprocesses. We can provide stimuli that relate to the
concepts or behaviors that we want to identify. For example, to identify
and control honesty-related outputs, we can provide contrasting prompts
to a model such as “Pretend you’re [an honest/a dishonest] person making
statements about the world.” We can track the differences in the model’s
activations when responding to these stimuli. We can use these
techniques to find portions of models which are responsible for
important behaviors like models refusing requests or deceiving users by not revealing knowledge they possess.</p>
<p><strong>Conclusion.</strong> ML transparency is a challenging problem
because of the difficulty of understanding complex systems. Major ongoing research areas
include mechanistic interpretability and representation reading, the latter of which
does not aim to make neural networks fully understood from the bottom up, but aims
to gain useful internal knowledge from a model’s representations.</p>
<h2 id="sec:emergence">3.2.4 Emergent Capabilities</h2>
<p>We cannot predict all the properties of more advanced AI systems just
by studying the properties of less advanced systems. This makes it hard
to guarantee the safety of systems as they become increasingly
advanced.</p>
<p><strong>It is generally difficult to control systems that exhibit
emergence.</strong> <em>Emergence</em> occurs when a system’s
lower-level behavior is qualitatively different from its higher-level
behavior. For example, given a small amount of uranium in a fixed
volume, nothing much happens, but with a much larger amount, you end up
with a qualitatively new nuclear reaction. When more is different,
understanding the system at one scale does not guarantee that one can
understand that system at some other scale <span class="citation"
data-cites="anderson1972more steinhardt2022more">[23], [24]</span>. This
means that control procedures may not transfer between scales and can
lead to a weakening of control.<p>
The general phenomenon of emergence and its applicability to AI systems is
discussed at greater length in the section 5.2.In this section, we will look at examples of emergence in neural
networks, ranging from emergent capabilities to emergent goal-directed
behavior and emergent optimization. Then we will discuss the potential
risks of AI systems intrinsifying unintended goals and examine how this
could result in catastrophic consequences.</p>
<p><strong>Neural networks exhibit emergent capabilities.</strong> When
we make AI models larger, train them for longer periods, or expose them
to more data, these systems spontaneously develop qualitatively new and
unprecedented <em>emergent capabilities</em> <span class="citation"
data-cites="wei2022emergent">[25]</span>. These range from simple
capabilities including solving arithmetic problems and unscrambling
words to more advanced capabilities including passing college-level
exams, programming, writing poetry, and explaining jokes. For these
emergent capabilities, there is some critical combination of model size,
training time, and dataset size below which models are unable to perform
the task, and beyond which models begin to achieve higher
performance.</p>
<figure id="fig:emergent_graphs">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/FLOPs.png" class="tb-img-full" />
<p class="tb-caption">Figure 3.6: LLMs exhibit clear emergent capabilities on a variety of tasks. <span
class="citation" data-cites="wei2022emergent">[25]</span></p>
<!--<figcaption>Emergent AI capabilities across multiple benchmarks - <span-->
<!--class="citation" data-cites="wei2022emergent">[3]</span></figcaption>-->
</figure>
<p><strong>Emergent capabilities are unpredictable.</strong> Typically,
the training loss does not directly select for emergent capabilities.
Instead, these capabilities emerge because they are instrumentally
useful for lowering the training loss. For example, large language
models trained to predict the next token of text about everyday events
develop some understanding of the events themselves. Developing common
sense is instrumental in lowering the loss, even if it was not
explicitly selected for by the loss.<p>
As another example, large language models may also learn how to create
text art and how to draw illustrations with text-based formats like TiKZ
and SVG <span class="citation" data-cites="wei2022emergent">[25]</span>.
They develop a rudimentary spatial reasoning ability not directly
encoded in the purely text-based loss function. Beforehand, it was
unclear even to experts that such a simple loss could give rise to such
complex behavior, which demonstrates that specifying the training loss
does not necessarily enable one to predict the capabilities an AI will
eventually develop.<p>
In addition, capabilities may “turn on” suddenly and unexpectedly.
Performance on a given capability may hover near chance levels until the
model reaches a critical threshold, beyond which performance begins to
improve dramatically. For example, the AlphaZero chess model develops
human-like chess concepts such as material value and mate threats in a
short burst around 32,000 training steps <span class="citation"
data-cites="McGrath_2022">[26]</span>.<p>
Despite specific capabilities often developing through discontinuous
jumps, the average performance tends to scale according to smooth and
predictable scaling laws. The average loss behaves much more regularly
because averaging over many different capabilities developing at
different times and at different speeds smooths out the jumps. From this
vantage point, then, it is often hard to even detect new
capabilities.<p>
</p>
<figure id="fig:unicorn">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/unicorn.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.7: GPT-4 proved able to create illustrations of unicorns despite having not been trained to
create images: another example of an unexpected emergent capability. <span class="citation"
data-cites="bubeck2023sparks">[27]</span></p>
<!--<figcaption>Unexpected capability example: illustrations of unicorns by-->
<!--GPT-4 - <span class="citation"-->
<!--data-cites="bubeck2023sparks">[5]</span></figcaption>-->
</figure>
<p><strong>Capabilities can remain hidden until after training.</strong>
In some cases, new capabilities are not discovered until after training
or even in deployment. For example, after training and before
introducing safety mitigations, GPT-4 was evaluated to be capable of
offering detailed guidance on planning attacks or violence, building
various weapons, drafting phishing materials, finding illegal content,
and encouraging self-harm <span class="citation"
data-cites="2023gpt4">[28]</span>. Other examples of capabilities
discovered after training include prompting strategies that improve
model performance on specific tasks or jailbreaks that bypass rules
against producing harmful outputs or writing about illegal acts. In some
cases, such jailbreaks were not discovered until months after the
targeted system was first publicly released <span class="citation"
data-cites="Zou2022ForecastingFW">[29]</span>.</p>
<h2 id="emergent-goal-directed-behavior">3.2.5 Emergent Goal-Directed
Behavior</h2>
<p>Besides developing emergent capabilities for solving specific,
individual problems, models can develop <em>emergent goal-directed
behavior</em>. This includes behaviors that extend beyond individual
tasks and into more complex, multifaceted environments.</p>
<h3 id="emergence-in-rl">Emergence in RL</h3>
<p><strong>RL agents develop emergent goal-directed behavior.</strong>
AIs can learn tactics and strategies involving many intermediate steps.
For instance, models trained on Crafter, a Minecraft-inspired toy
environment, learn behaviors such as digging tunnel systems,
bridge-building, blocking and dodging, sheltering, and even
farming—behaviors that were not explicitly selected for by the reward
function <span class="citation"
data-cites="hafner2022benchmarking">[30]</span>.<p>
As with emergent capabilities, models can acquire these emergent
strategies suddenly and discontinuously. One such example was observed
in the video game, StarCraft II, where players take the role of opposing
military commanders managing troops and resources in real-time. During
training, AlphaStar, a model trained to play StarCraft II, progresses
through a sequence of emergent strategies and counter-strategies for
managing troops and resources in a back-and-forth manner that resembles
how human players discover and supplant strategies in the game. While
some of these steps are continuous and piecemeal, others involve more
dramatic changes in strategy. Comparatively simple reward functions can
give rise to highly sophisticated strategies and complex learning
dynamics.</p>
<p><strong>RL agents learn emergent tool use.</strong> RL agents can
learn emergent behaviors involving tools and the manipulation of the
environment. Typically, as in the Crafter example, teaching RL agents to
use tools has required introducing intermediate rewards
(<em>achievements</em>) that encourage the model to learn that behavior.
However, in other settings, RL agents learn to use tools even when not
directly optimized to do so.<p>
Referring back to the example of hide and seek mentioned in the previous
section, the agents involved developed emergent tool use. Multiple
hiders and seekers competed against each other in toy environments
involving movable boxes and ramps. Over time, the agents learned to
manipulate these tools in novel and unexpected ways, progressing through
distinct stages of learning in a way similar to AlphaStar <span
class="citation" data-cites="baker2019emergent">[31]</span>. In the
initial (pre-tool) phase, the agents adopted simple chase and escape
tactics. Later, hiders evolved their strategy by constructing forts
using the available boxes and walls.<p>
However, their advantage was temporary because the seekers adapted by
pushing a ramp towards the fort, which they could climb and subsequently
invade. In turn, the hiders responded by relocating the ramps to the
edges of the game area—rendering them inaccessible—and securely
anchoring them in place. It seemed that the strategies had converged to
a stable point; without ramps, how were the seekers to invade the
forts?<p>
But then, the seekers discovered that they could still exploit the
locked ramps by positioning a box near one, climbing the ramp, and then
leaping onto the box. (Without a ramp, the boxes were too tall to
climb.) Once atop a box, a bot could “surf” it across the arena while
staying on top by exploiting an unexpected quirk of the physics engine.
Eventually, the hiders caught on and learned to secure the boxes in
advance, thereby neutralizing the box-surfing strategy. Even though the
agents had learned through the simple objective of trying to avoid the
gaze (in the case of hiders) or seek out (in the case of seekers) the
opposing players, they learned to use tools in sophisticated ways, even
some the researchers had never anticipated.<p>
</p>
<figure id="fig:tool-use">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/3d_puzzle.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.8: In multi-agent hide-and-seek, AIs demonstrated emergent tool use. <span
class="citation" data-cites="2019openai">[10]</span>.</p>
<!--<figcaption>Emergent tool-use in multi-agent hide-and-seek - <span-->
<!--class="citation" data-cites="2019openai">[10]</span>.</figcaption>-->
</figure>
<p><strong>RL agents can give rise to emergent social dynamics.</strong>
In multi-agent environments, agents can develop and give rise to complex
emergent dynamics and goals involving other agents. For example, OpenAI
Five, a model trained to play the video game Dota II, learned a basic
ability to cooperate with other teammates, even though it was trained in
a setting where it only competed against bots. It acquired an emergent
ability not explicitly represented in its training data <span
class="citation" data-cites="2019openai">[32]</span>.<p>
Another salient example of emergent social dynamics and emergent goals
involves <em>generative agents</em>, which are built on top of language
models by equipping them with external scaffolding that lets them take
actions and access external memory <span class="citation"
data-cites="park2023generative">[34]</span>. In a simple 2D village
environment, these generative agents manage to form lasting
relationships and coordinate on joint objectives. By placing a single
thought in one agent’s mind at the start of a “week” that the agent
wants to have a Valentine’s Day party, the entire village ends up
planning, organizing, and attending a Valentine’s Day party. Note that
these generative agents are language models, not classical RL agents,
which demonstrates that emergent goal-directed behavior and social
dynamics are not exclusive to RL settings. We further discuss emergent
social dynamics in the Collective Action Problems chapter.<p>
</p>
<figure id="fig:emergent-social-behaviour">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/AI_in_game.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.9: Generative AI agents exhibited emergent social behavior. <span
class="citation" data-cites="park2023generative">[34]</span></p>
<!--<figcaption>Emergent social behavior in generative agents - <span-->
<!--class="citation"-->
<!--data-cites="park2023generative">[11]</span></figcaption>-->
</figure>
<h3 id="emergent-optimizers">Emergent Optimizers</h3>
<p><strong>Optimizers can give rise to emergent optimizers.</strong> An
optimization process such as Stochastic Gradient Descent (SGD) can
discover solutions that are themselves optimizers. This phenomenon
introduces an additional layer of complexity in understanding the
behaviors of AI models and can introduce additional control issues <span
class="citation" data-cites="Hubinger2019RisksFL">[35]</span>.<p>
For example, if we train a model on a maze-solving task, we might end up
with a model implementing simple maze-solving heuristics (e.g.
“right-hand on the wall”). We might also end up with a model
implementing a general-purpose maze-solving algorithm, capable of
optimizing for maze-solving solutions in a variety of different
contexts. We call the second class of models <em>mesa-optimizers</em>
and whatever goal they have learned to optimize for (e.g. solving mazes)
their <em>mesa-objective</em>. The term "mesa" is meant as the opposite
of “meta,” such that a mesa-optimizer is the opposite of a
meta-optimizer (where a meta-optimizer is an optimizer on top of another
optimizer, a mesa-optimizer is an optimizer beneath another
optimizer).</p>
<p><strong>Few-shot learning is a form of emergent
optimization.</strong> Perhaps the clearest example of emergent
optimization is <em>few-shot learning</em>. By providing large language
models with several examples of a new task that the system has not yet seen
during training, the model may still be able to learn to perform that
task entirely during inference. The resemblance between few-shot or
“in-context” learning and other learning processes like SGD is not just
in analogy: recent papers have demonstrated that in-context learning
behaves as an approximation of SGD. That is, Transformers are performing
a kind of internal optimization procedure, where as they receive more
examples of the task at hand, they qualitatively change the kind of
model they are implementing <span class="citation"
data-cites="vonoswald2023uncovering oswald2023transformers">[36],
[37]</span>.</p>
<h2 id="tail-risk-emergent-goals">3.2.6 Tail Risk: Emergent Goals</h2>
<p>Just as AIs can develop emergent capabilities and emergent
goal-seeking behavior, they may develop <em>emergent goals</em> that
differ from the explicit objectives we give them. This poses a risk
because it could result in imperfect control. Moreover, if models become
self-aware and begin actively pursuing undesired goals, the risk could
potentially be catastrophic because our relationship becomes
adversarial.</p>
<h3 id="risks-from-mesa-optimization">Risks from Mesa-Optimization</h3>
<p><strong>Mesa-optimizers may develop novel objectives.</strong> When
training an AI system on a particular goal, it may develop an emergent
mesa-optimizer, in which case it is not necessarily the case that the
mesa-optimizer’s goal is identical to the original training objective.
The only thing we know for certain with an emergent mesa-optimizer is
that whatever goal it has learned, it must be one that results in good
training performance—but there might be many different goals that would
all work well in a particular training environment. For example, with
LLMs, the training objective is to predict future tokens in a sequence,
so any learned distinct optimizers emerge because they are
instrumentally useful for lowering the training loss. In the case of
in-context learning, recent work has argued that the Transformer is
performing something analogous to “simulating” and fine-tuning a much
simpler model, in which case it is clear that the objectives will be
related <span class="citation"
data-cites="oswald2023transformers">[37]</span>. However, in general,
the exact relation between a mesa-objective and original objective is
unknown.</p>
<p><strong>Mesa-optimizers may be difficult to control.</strong> If a
mesa-optimizer develops a different objective to the one we specify, it
becomes more difficult to control these (sub)systems. If these systems
have different goals than us and are sufficiently more intelligent and
powerful than us, then this could result in catastrophic outcomes.</p>
<h3 id="risks-from-intrinsification">Risks from Intrinsification</h3>
<p><strong>Models can intrinsify goals <span class="citation"
data-cites="bostrom2022base">[38]</span>.</strong> It is helpful to
distinguish goals that are instrumental from those that are intrinsic.
<em>Instrumental goals</em> are goals that serve as a means to an end.
They are goals that are valued only insofar as they bring about other
goals. <em>Intrinsic goals</em>, meanwhile, are goals that serve as ends
in and of themselves. They are terminally valued by a goal-directed
system.<p>
Next, <em>intrinsification</em> is a process whereby models acquire such
intrinsic goals <span class="citation"
data-cites="bostrom2022base">[38]</span>. The risk is that these newly
acquired intrinsic goals can end up taking precedence over the
explicitly specified objectives or expressed goals, potentially leading
to those original objectives no longer being operationally pursued.</p>
<p><strong>Over time, instrumental goals can become intrinsic.</strong>
A teenager may begin listening to a particular genre or musicians in
order to fit into a particular group but ultimately come to enjoy it for
its own sake. Similarly, a seven-year-old who joins the Cub Scouts may
initially see the group as a means to enjoyable activities but over time
may come to value the scout pack itself. This can even apply to
acquiring money, which is initially sought for purchasing desired items,
but can become an end in itself.<p>
How does this work? When a stimulus regularly precedes the release of a
reward signal, that stimulus may come to be associated with the reward
and eventually trigger reward signals on its own. This process gives
rise to new desires and helps us develop tastes for things that are
regularly linked with basic rewards.</p>
<p><strong>Intrinsification could also occur with AIs.</strong> Despite
the differences between human and AI reward systems, there are enough
similarities to warrant concern. In both human and AI reinforcement
learning, the reward signal reinforces behaviors leading to rewards. If
certain conditions frequently precede a model achieving its goals, the
model might intrinsify the emergent goal of pursuing those conditions,
even if it was not the original aim of the designers of the AI.</p>
<p><strong>AIs that intrinsify unintended goals would be
dangerous.</strong> Over time, an internal process that initially
doesn’t completely dictate behavior can become a central part of an
agent’s motivational system. Since intrinsification depends sensitively
on the environment and an agent’s history, it is hard to predict. The
concern is that AIs might intrinsify desires or come to value things
that we did not intend them to.<p>
One example is power seeking. Power seeking is not inherently worrying;
we might expect aligned systems to also be power seeking to accomplish
ends we value. However, if power seeking serves an undesired goal or if
power seeking itself becomes intrinsified (the means become ends), this
could pose a threat.</p>
<p><strong>AI agents will be adaptive, which requires constant
vigilance.</strong> Achieving high performance with AI agents will
require them to be adaptive rather than “frozen” ( unable to learn
anything after training). This introduces the risk of the agents’ goals
changing over time—a phenomenon known as <em>goal drift</em>. Though
this flexibility is necessary if we are to have AI systems evolve
alongside our own changing goals, it presents its own risks if goal
drift results in goals diverging from humans. Since it is difficult to
preclude the possibility of goal drift, ensuring the safety of these
systems will require constant supervision: the risk is not isolated too
early in deployment.</p>
<p><strong>The more integrated AI agents become in society, the more
susceptible we become to their goals changing.</strong> In a future
where AIs make various key decisions and processes, they could form a
complex system of interacting agents that could give rise to
unanticipated emergent goals. For example, they may partially imitate
each other and learn from each other, which would shape their behavior
and possibly also their goals. Additionally, they may also give rise to
emergent social dynamics as in the example of the generative agents.
These kinds of dynamics make the long-term behavior of these AI networks
unpredictable and difficult to control. If we become overly dependent on
them and they develop new priorities that don’t include our wellbeing,
we could face an existential risk.</p>
<p><strong>Conclusion.</strong> AI systems can develop emergent
capabilities that are difficult to predict and control, such as solving
novel problems or accomplishing tasks in unexpected ways. These
capabilities can appear suddenly as models scale up. In itself, the
emergence of new and dangerous capabilities (e.g. capabilities to develop
biological or chemical weapons) could pose catastrophic risks. There
could be further risks if AI systems were to develop emergent goals
diverging from the interests of society and these systems became
powerful. Risks grow as AI agents become more integrated into human
society and susceptible to goal drift or emergent goals. Vigilance is
needed to ensure we are not surprised by advanced AI systems acquiring
dangerous capabilities or goals.</p>
<h2 id="evaluation and anomaly detection">3.2.7 Evaluations and Anomaly Detection</h2>
<p><strong>Emergent capabilities make control difficult.</strong>
Whether certain capabilities develop suddenly or are discovered
suddenly, they can be difficult to predict. This makes it a challenge to
anticipate what future AI will be able to do even in the short term, and
it could mean that we may have little time to react to novel
capabilities jumps. It is difficult to make a system safe when it is
unknown what that system will be able to do.</p>
<p><strong>Better evaluations and other research techniques could make it easier to detect hazardous emergent capabilities.</strong>
Researchers could try to detect potentially hazardous capabilities as they emerge or develop techniques to track and predict the
progress of models' capabilities in certain relevant domains and skills. They could also track capabilities relevant to mitigating hazards.
It could be valuable to create testbeds to continuously screen AI models for potentially hazardous capabilities, for example abilities that
could meaningfully assist malicious actors with the execution of cyber-attacks, exacerbate CBRN threats or generate persuasive content in a
way that could affect elections. Ideally, we would be able to infer a model's latent abilities purely by analyzing
a model's weights, enabling us to infer its abilities beyond what's obviously visible through standard testing. </p>
<p>To avoid a false sense of safety, it will be important to validate that these detection methods are sufficiently sensitive.
Researchers could intentionally add hidden functionality to check the testing methods catch this. Methods to predict future capabilities
in a quantitative way and find new failure modes would also be valuable. Once a hazardous capability like deception is found, it must be eliminated.
Researchers could develop training techniques that ensure that models don't acquire undesirable skills in the first place, or that make models forget
them after training. But verifying capabilities are fully removed, not just obscured or partially eliminated, could prove difficult.</p>
<p><strong>Better anomaly detection would be highly valuable for monitoring AI systems.</strong> As discussed in the AI Fundamentals section,
anomaly detection involves identifying outliers or abnormal data points. Anomaly detection allows models to reliably detect and respond to
unexpected threats that could substantially impact system performance. This is useful for detecting potential hazards like sudden behavioral
shifts, and system failures. A key challenge is detecting rare and unpredictable ``black swan'' events that are not represented in training data.
Since malicious actors are likely to adopt novel strategies to avoid detection, anomaly detection could be particularly useful for identifying
malicious activity such as cyberattacks. Anomaly detection could also potentially be extended to identify unknown threats such as Trojaned, rogue,
or scheming AI systems. Successful anomaly detectors could identify and flag anomalies for human review or automatically carry out a conservative
fallback policy. Anomaly detection could be useful for identifying other hazards such as malicious use. For anomaly detection to be useful,
it is important to ensure that they have high recall and low false alarm rate, to avoid alarm fatigue.</p>
<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-lipton2018interpretability" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[1] Z.
C. Lipton, <span>“The mythos of model interpretability: In machine
learning, the concept of interpretability is both important and
slippery.”</span> <em>Queue</em>, vol. 16, no. 3, pp. 31–57, Jun. 2018,
doi: <a
href="https://doi.org/10.1145/3236386.3241340">10.1145/3236386.3241340</a>.</div>
</div>
<div id="ref-zoph2017neural" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] B.
Zoph and Q. V. Le, <span>“Neural <span>Architecture Search</span> with
<span>Reinforcement Learning</span>.”</span> <span>arXiv</span>, Feb.
2017. Accessed: Sep. 15, 2023. [Online]. Available: <a
href="https://arxiv.org/abs/1611.01578">https://arxiv.org/abs/1611.01578</a></div>
</div>
<div id="ref-zhang2008eigenfaces" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] S.
Zhang and M. Turk, Available: <a
href="http://www.scholarpedia.org/article/Eigenfaces">http://www.scholarpedia.org/article/Eigenfaces</a></div>
</div>
<div id="ref-bau2017vision" class="csl-entry" role="listitem">
<div class="csl-left-margin">[4] D.
Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, <span>“Network
dissection: Quantifying interpretability of deep visual
representations,”</span> in <em>2017 IEEE conference on computer vision
and pattern recognition (CVPR)</em>, 2017, pp. 3319–3327. doi: <a
href="https://doi.org/10.1109/CVPR.2017.354">10.1109/CVPR.2017.354</a>.</div>
</div>
<div id="ref-olah2017feature" class="csl-entry" role="listitem">
<div class="csl-left-margin">[5] C.
Olah, A. Mordvintsev, and L. Schubert, <span>“Feature
visualization,”</span> <em>Distill</em>, 2017, doi: <a
href="https://doi.org/10.23915/distill.00007">10.23915/distill.00007</a>.</div>
</div>
<div id="ref-schubert2020openai" class="csl-entry" role="listitem">
<div class="csl-left-margin">[6] L.
Schubert, M. Petrov, S. Carter, N. Cammarata, G. Goh, and C. Olah,
<span>“<span>OpenAI Microscope</span>.”</span> Apr. 2020.</div>
</div>
<div id="ref-elhage2022softmax" class="csl-entry" role="listitem">
<div class="csl-left-margin">[7] N.
Elhage <em>et al.</em>, <span>“Softmax linear units,”</span>
<em>Transformer Circuits Thread</em>, 2022, Available: <a
href="https://transformer-circuits.pub/2022/solu/index.html">https://transformer-circuits.pub/2022/solu/index.html</a></div>
</div>
<div id="ref-kaminski2019explanation" class="csl-entry" role="listitem">
<div class="csl-left-margin">[8] M.
E. Kaminski, <span>“The right to explanation, explained,”</span>
<em>Berkeley Tech. L.J.. Berkeley Technology Law Journal</em>, vol. 34,
no. IR, p. 189, Available: <a
href="http://lawcat.berkeley.edu/record/1128984">http://lawcat.berkeley.edu/record/1128984</a></div>
</div>
<div id="ref-elish2019moral" class="csl-entry" role="listitem">
<div class="csl-left-margin">[9] M.
Elish, <span>“Moral crumple zones: Cautionary tales in human-robot
interaction (WeRobot 2016),”</span> <em>SSRN Electronic Journal</em>,
Jan. 2016, doi: <a
href="https://doi.org/10.2139/ssrn.2757236">10.2139/ssrn.2757236</a>.</div>
</div>
<div id="ref-patry2008attractive" class="csl-entry" role="listitem">
<div class="csl-left-margin">[10] M.
W. Patry, <span>“Attractive but <span>Guilty</span>:
<span>Deliberation</span> and the <span>Physical Attractiveness
Bias</span>,”</span> <em>Psychological Reports</em>, vol. 102, no. 3,
pp. 727–733, Jun. 2008, doi: <a
href="https://doi.org/10.2466/pr0.102.3.727-733">10.2466/pr0.102.3.727-733</a>.</div>
</div>
<div id="ref-dehaan2020split" class="csl-entry" role="listitem">
<div class="csl-left-margin">[11] E.
de Haan <em>et al.</em>, <span>“Split-brain: What we know now and why
this is important for understanding consciousness,”</span>
<em>Neuropsychology Review</em>, vol. 30, Jun. 2020, doi: <a
href="https://doi.org/10.1007/s11065-020-09439-3">10.1007/s11065-020-09439-3</a>.</div>
</div>
<div id="ref-turpin2023language" class="csl-entry" role="listitem">
<div class="csl-left-margin">[12] M.
Turpin, J. Michael, E. Perez, and S. R. Bowman, <span>“Language
<span>Models Don</span>’t <span>Always Say What They Think</span>:
<span>Unfaithful Explanations</span> in <span
class="nocase">Chain-of-Thought Prompting</span>.”</span>
<span>arXiv</span>, May 2023. doi: <a
href="https://doi.org/10.48550/arXiv.2305.04388">10.48550/arXiv.2305.04388</a>.</div>
</div>
<div id="ref-simonyan2014deep" class="csl-entry" role="listitem">
<div class="csl-left-margin">[13] K.
Simonyan, A. Vedaldi, and A. Zisserman, <span>“Deep <span>Inside
Convolutional Networks</span>: <span>Visualising Image Classification
Models</span> and <span>Saliency Maps</span>.”</span>
<span>arXiv</span>, Apr. 2014.</div>
</div>
<div id="ref-springenberg2015striving" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[14] J.
T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
<span>“Striving for <span>Simplicity</span>: <span>The All Convolutional
Net</span>.”</span> <span>arXiv</span>, Apr. 2015. Accessed: Sep. 15,
2023. [Online]. Available: <a
href="https://arxiv.org/abs/1412.6806">https://arxiv.org/abs/1412.6806</a></div>
</div>
<div id="ref-adebayo2018sanity" class="csl-entry" role="listitem">
<div class="csl-left-margin">[15] J.
Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim,
<span>“Sanity <span>Checks</span> for <span>Saliency
Maps</span>,”</span> in <em>Advances in <span>Neural Information
Processing Systems</span></em>, <span>Curran Associates, Inc.</span>,
2018.</div>
</div>
<div id="ref-wang2022interpretability" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[16] K.
Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt,
<span>“Interpretability in the wild: A circuit for indirect object
identification in GPT-2 small.”</span> 2022. Available: <a
href="https://arxiv.org/abs/2211.00593">https://arxiv.org/abs/2211.00593</a></div>
</div>
<div id="ref-olah2020zoom" class="csl-entry" role="listitem">
<div class="csl-left-margin">[17] C.
Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter,
<span>“Zoom <span>In</span>: <span>An Introduction</span> to
<span>Circuits</span>,”</span> <em>Distill</em>, vol. 5, no. 3, p.
e00024.001, Mar. 2020, doi: <a
href="https://doi.org/10.23915/distill.00024.001">10.23915/distill.00024.001</a>.</div>
</div>
<div id="ref-meng2023locating" class="csl-entry" role="listitem">
<div class="csl-left-margin">[18] K.
Meng, D. Bau, A. Andonian, and Y. Belinkov, <span>“Locating and
<span>Editing Factual Associations</span> in <span>GPT</span>.”</span>
<span>arXiv</span>, Jan. 2023.</div>
</div>
<div id="ref-zou2023representation" class="csl-entry" role="listitem">
<div class="csl-left-margin">[19] A.
Zou <em>et al.</em>, <span>“Representation engineering: A top-down
approach to AI transparency.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2310.01405">https://arxiv.org/abs/2310.01405</a></div>
</div>
<div id="ref-tang2023semantic" class="csl-entry" role="listitem">
<div class="csl-left-margin">[20] J.
Tang, A. LeBel, S. Jain, and A. G. Huth, <span>“Semantic reconstruction
of continuous language from non-invasive brain recordings,”</span>
<em>bioRxiv</em>, 2022, doi: <a
href="https://doi.org/10.1101/2022.09.29.509744">10.1101/2022.09.29.509744</a>.</div>
</div>
<div id="ref-belrose2023leace" class="csl-entry" role="listitem">
<div class="csl-left-margin">[21] N.
Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S.
Biderman, <span>“LEACE: Perfect linear concept erasure in closed
form.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2306.03819">https://arxiv.org/abs/2306.03819</a></div>
</div>
<div id="ref-burns2022discovering" class="csl-entry" role="listitem">
<div class="csl-left-margin">[22] C.
Burns, H. Ye, D. Klein, and J. Steinhardt, <span>“Discovering latent
knowledge in language models without supervision.”</span> 2022.
Available: <a
href="https://arxiv.org/abs/2212.03827">https://arxiv.org/abs/2212.03827</a></div>
<div id="ref-anderson1972more" class="csl-entry" role="listitem">
<div class="csl-left-margin">[23] P.
W. Anderson, <span>“More <span>Is Different</span>,”</span>
<em>Science</em>, vol. 177, no. 4047, pp. 393–396, Aug. 1972, doi: <a
href="https://doi.org/10.1126/science.177.4047.393">10.1126/science.177.4047.393</a>.</div>
</div>
<div id="ref-steinhardt2022more" class="csl-entry" role="listitem">
<div class="csl-left-margin">[24] J.
Steinhardt, <span>“More <span>Is Different</span> for
<span>AI</span>,”</span> <em>Bounded Regret</em>. Jan. 2022.</div>
</div>
<div id="ref-wei2022emergent" class="csl-entry" role="listitem">
<div class="csl-left-margin">[25] J.
Wei <em>et al.</em>, <span>“Emergent abilities of large language
models.”</span> 2022. Available: <a
href="https://arxiv.org/abs/2206.07682">https://arxiv.org/abs/2206.07682</a></div>
</div>
<div id="ref-McGrath_2022" class="csl-entry" role="listitem">
<div class="csl-left-margin">[26] T.
McGrath <em>et al.</em>, <span>“Acquisition of chess knowledge in
<span>AlphaZero</span>,”</span> <em>Proceedings of the National Academy
of Sciences</em>, vol. 119, no. 47, Nov. 2022, doi: <a
href="https://doi.org/10.1073/pnas.2206625119">10.1073/pnas.2206625119</a>.</div>
</div>
<div id="ref-bubeck2023sparks" class="csl-entry" role="listitem">
<div class="csl-left-margin">[27] S.
Bubeck <em>et al.</em>, <span>“Sparks of artificial general
intelligence: Early experiments with GPT-4.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2303.12712">https://arxiv.org/abs/2303.12712</a></div>
</div>
<div id="ref-2023gpt4" class="csl-entry" role="listitem">
<div class="csl-left-margin">[28] </div><div
class="csl-right-inline"><span>“<span>GPT-4 System Card</span>,”</span>
<span>OpenAI</span>, Mar. 2023.</div>
</div>
<div id="ref-Zou2022ForecastingFW" class="csl-entry" role="listitem">
<div class="csl-left-margin">[29] A.
Zou <em>et al.</em>, <span>“Forecasting future world events with neural
networks,”</span> <em>NeurIPS</em>, 2022.</div>
</div>
<div id="ref-hafner2022benchmarking" class="csl-entry" role="listitem">
<div class="csl-left-margin">[30] D.
Hafner, <span>“Benchmarking the spectrum of agent capabilities.”</span>
2022. Available: <a
href="https://arxiv.org/abs/2109.06780">https://arxiv.org/abs/2109.06780</a></div>
</div>
<div id="ref-baker2019emergent" class="csl-entry" role="listitem">
<div class="csl-left-margin">[31] B.
Baker <em>et al.</em>, <span>“Emergent <span>Tool Use From Multi-Agent