aises_3_2

<h1 id="sec:opaqueness">3.2 Monitoring</h1>

<p>Obstacles to effective monitoring of AI systems to identify and avoid hazards include the opaqueness of 
AI systems and the emergence of surprising ``emergent'' capabilities as they become more advanced. To better
 monitor AI systems, we need research progress in research areas such as representation reading, model evaluations, and anomaly detection. </p>

<h2 id="ml-systems-are-opaque">3.2.1 ML Systems are Opaque</h2>
<p>The internal operations of many AI systems are opaque. We might be
able to reveal and prevent harmful behavior if we can make these systems
more transparent. In this section, we will discuss why AI systems are
often called <em>black boxes</em> and explore ways to understand them.
Although early research into transparency shows that the problem is
highly difficult and conceptually fraught, its potential to improve AI
safety is substantial.</p>

<p>The most capable machine learning models today are based on deep
neural networks. Whereas most conventional software is directly written
by humans, deep learning (DL) systems independently learn how to
transform inputs to outputs layer-by-layer and step-by-step. We can
direct DL models to learn how to give the right outputs, but we do not
know how to interpret the model’s intermediate computations. In other
words, we do not understand how to make sense of a model’s activations
given a real-world data input. As a result, we cannot make reliable
predictions about a model’s behavior when given new inputs. This section
will present a handful of analogies and results that illustrate the
difficulty of understanding machine learning systems.</p>
<p><strong>Deep learning models as a black box.</strong> Machine
learning researchers often refer to deep learning models as a <em>black box</em>
<span class="citation"
data-cites="lipton2018interpretability">[1]</span>, a system that can only be understood in terms of its input-output behavior without insight into its
internal workings. Humans are black boxes—we see their behavior, but not
the internal brain activity that produces it, let alone how fully
understand that brain activity. Although a deep neural network’s weights
and activations are observable, these long lists of numbers do not
currently help us understand how a model will behave. We cannot reduce
all the numerical operations of a state of the art model into a form
that is meaningful to humans.<p>
</p>
<figure id="fig:comp-graph">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/computational_graph.png" class="tb-img-full" style="width: 80%" />
<p class="tb-caption">Figure 3.1: ML systems can be broken down into computational graphs with many sections. <span
class="citation" data-cites="zoph2017neural">[2]</span></p>
<!--<figcaption>A section of a computational graph for an ML system - <span-->
<!--class="citation" data-cites="zoph2017neural">[2]</span></figcaption>-->
</figure>
<p><strong>Even simple ML techniques suffer from opaqueness.</strong>
Opaqueness is not unique to neural networks. Even simple ML techniques
such as Principal Component Analysis (PCA), which are better understood
theoretically than DL, suffer from similar flaws. For example, Figure 3.2 depicts the results of performing
PCA on pictures of human faces. This yields a set of “eigenfaces”,
capturing the most important features identifying a face. Any picture of
a face can then be represented as a particular combination of these
eigenfaces.<p>
</p>
<figure id="fig:Eigenfaces">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/Eigenfaces.jpg" class="tb-img-full" style="width: 60%"/>
<p class="tb-caption">Figure 3.2: A human face can be made by combining several eigenfaces, each of which represents
different facial features. <span class="citation"
data-cites="zhang2008eigenfaces">[3]</span></p>
<!--<figcaption>Eigenfaces - <span class="citation"-->
<!--data-cites="zhang2008eigenfaces">[3]</span></figcaption>-->
</figure>
<p>In some cases, we can guess what facial features an eigenface
represents: for example, eigenfaces 1, 2 and 3 seem to capture the
lighting and shading of the face, while eigenface 11 may detect facial
hair. However, most eigenfaces do not represent clear facial features,
and it is difficult to verify that our hypotheses for any single feature
capture the entirety of their role. The fact that even simple techniques
like PCA remain opaque is a sign of the difficulty of the problem in the
more complicated techniques like DL.</p>
<p><strong>Feature visualizations demonstrate that deep learning neurons
are hard to interpret.</strong> In a neural network, a neuron is a
component of an activation vector. One attempt to understand deep
networks involves looking for simple quantitative or algorithmic
descriptions of the relationship between inputs and neurons such as “if
the ear feature has been detected, the model will output either dog or
cat” <span class="citation" data-cites="bau2017vision">[4]</span>. For
image models, we can create <em>feature visualizations</em>, artificial
images that highly activate a particular neuron (or set of neurons)
<span class="citation" data-cites="olah2017feature">[5]</span>. We can
also examine natural images that highly activate that neuron.<p>
</p>
<figure id="fig:random-neuron">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/feature_vis.png" />
<p class="tb-caption">  Figure 3.3: Left: a feature visualization that “highly activates” a particular neuron. Right: a
collection of natural images that activate a particular neuron. <span class="citation"
data-cites="schubert2020openai">[6]</span></p>

<!--<figcaption>From <span class="citation"-->
<!--data-cites="schubert2020openai">[6]</span>. A randomly selected neuron-->
<!--in the CLIP ResNet-50 image model. Left: a feature visualization that-->
<!--“highly activates” the neuron, meaning the neuron reads a high value-->
<!--when the image is input to the model. Right: example images that-->
<!--activate the neuron</figcaption>-->
</figure>
<p>Like eigenfaces, neurons may be more or less interpretable.
Sometimes, feature visualizations identify neurons that seem to depend
on a pattern of the input that is clear to humans. For example, a neuron
might activate only when an image contains dog ears. In other cases, we
observe <em>polysemantic neurons</em>, which defy a single
interpretation <span class="citation"
data-cites="elhage2022softmax">[7]</span>. Consider Figure 3.3, which shows images that
highly activate a randomly chosen neuron in an image model. Judging from
the natural images, it seems like the neuron often activates when text
associated with traveling or moving is present, but it’s hard to be
sure.</p>
<p><strong>Neural networks are complex systems.</strong> Both human
brains and deep neural networks are complex systems, and so involve
interdependent and nonlinear interactions between many components. Like
many other complex systems (see the Complex Systems chapter), the emergent behaviors of
neural networks are difficult to understand in terms of their
components. Just as neuroscientists struggle to identify how a
particular biological neuron contributes to a mind’s behavior, ML
researchers struggle to determine how a particular artificial neuron
contributes to a DL model’s behavior. There are limits on our ability to
systematically understand and predict complex systems, which suggests
that ML opaqueness may be a special case of the opaqueness of complex
systems.</p>
<h2 id="motivations-for-transparency-research">3.2.2 Motivations for
Transparency Research</h2>
<p>There is often no way to tell whether a model will perform well on
new inputs. If the model performs poorly, we generally cannot tell why.
With better transparency tools, we might be able to reveal and
proactively prevent failure modes, detect the emergence of new
capabilities, and build trust that models will perform as expected in
new circumstances. High-stakes domains might demand guarantees of
reliability based on the soundness or security of internal AI processes,
but virtually no such guarantees can be made for neural networks given
the current state of transparency research.<p>
If we could meaningfully understand how a model treats a given input, we
would be better able to monitor and audit its behavior. Additionally, by
understanding how models solve difficult and novel problems,
transparency might also become a source of conceptual and scientific
insight <span class="citation"
data-cites="lipton2018interpretability">[1]</span>.</p>
<p><strong>Ethical obligations to make AI transparent.</strong> Model
transparency can help ensure that model decision making is fair,
unbiased, and ethical. For example, if a criminal justice system uses an
opaque AI to make decisions about policing, sentencing, or probation,
then those decisions will be similarly opaque. People might have a right
to an explanation of decisions that will significantly affect them <span
class="citation" data-cites="kaminski2019explanation">[8]</span>.
Transparency tools may be crucial to ensuring that right is upheld.</p>
<p><strong>Accountability for harms and hazards.</strong> Who is
responsible when AI systems fail? Responsibility often depends on the
intentions and degree of control held by those involved. The best way to
incentivize safety might be to hold AI creators responsible for the
damage their systems cause. However, we might not want to hold people
responsible for the behavior of systems they cannot predict or
understand. The growing autonomy and complexity of AI systems means that
people will have less control over AI behavior. Meanwhile, the scope and
generality of modern AI systems make it impossible to verify desirable
behavior in every case. In “human-in-the-loop” systems, where decisions
depend on both humans and AIs, human operators might be blamed for
failures over which they had little control <span class="citation"
data-cites="elish2019moral">[9]</span>.<p>
AI transparency could enable a more robust system of accountability. For
instance, governments could mandate that AI systems meet baseline
requirements for understandability. If an AI fails because of a
mechanism that its creator could have identified and prevented with
transparency tools, we would be more justified in holding that creator
liable. Transparency could also help to identify responsibility and
fairly assign blame in failures involving human-in-the-loop systems.</p>
<p><strong>Combating deception.</strong> Just as a person’s behavior can
correspond with many intentions, an AI’s behavior can correspond to many
internal processes, some of which are more acceptable than others. For
example, competent deception is intrinsically difficult to distinguish
from genuine helpfulness. We discuss this issue in more detail in the Control
section. For phenomena like deception that are difficult to detect from
behavior alone, transparency tools might allow us to catch internal
signs that show that a model is engaging in deceptive behavior.</p>
<h2 id="approaches-to-transparency">3.2.3 Approaches to Transparency</h2>
<p>The remainder of this section explores a variety of approaches to
transparency. Though the field is promising, we are careful to note the
shortcomings of these approaches. For a problem as conceptually tricky
as opaqueness, it is important to maintain a clear picture of what
successful techniques must achieve and hold new methods to a high
standard. We will discuss the research areas of explainability, saliency
maps, mechanistic interpretability, and representation engineering.</p>
<h3 id="explanations">Explanations</h3>
<p><strong>What must explanations accomplish?</strong> One approach to
transparency is to create explanations of a model’s behavior. These
explanations could have the following virtues:</p>
<ul>
<li><p>Predictive power: A good explanation should help us understand
not just a specific behavior, but how the model is likely to behave in
new situations. Building user trust in a system is easier when a user
can more clearly anticipate model behavior.</p></li>
<li><p>Faithfulness: A faithful explanation accurately reflects the
internal workings of the model. This is especially valuable when we need
to understand the precise reason why a model made a particular decision.
Faithful explanations are often better able to predict behavior because
they more closely track the actual mechanisms that models are using to
produce their behavior <span class="citation"
data-cites="lipton2018interpretability">[1]</span>.</p></li>
<li><p>Simplicity: A simple explanation is easier to understand.
However, it is important that the simplification does not sacrifice too
much information about actual model processes. Though some information
loss is inevitable, explanations must strike the right balance between
simplicity and faithfulness.</p></li>
</ul>
<p><strong>Explanations must avoid confabulation.</strong> Explanations
can sound plausible even if they are false. A <em>confabulation</em> is
an explanation that is not faithful to the true processes and
considerations that gave rise to a behavior. Both humans and AI systems
confabulate.</p>
<p><strong>Human confabulation.</strong> Humans are not transparent
systems, even to themselves. In some sense, the field of psychology
exists because humans cannot accurately intuit how their own mental
processes produce their experience and behavior. For example, mock
juries tend to be more lenient with attractive defendants, all else
being equal, even though jurors almost never reference attractiveness
when explaining their decisions <span class="citation"
data-cites="patry2008attractive">[10]</span>.<p>
Another example of human confabulation can be drawn from studies on
split-brain patients, those who have had the connection between their
two cerebral hemispheres surgically severed causing each hemisphere to
process information independently <span class="citation"
data-cites="dehaan2020split">[11]</span>. Researchers can give
information to one hemisphere and not the other by showing the
information to only one eye. In some experiments, researchers gave
written instructions to a patient’s right hemisphere, which is unable to
speak. After the patient completed the instructions, the researchers
asked the patient’s verbal left hemisphere why they had taken those
actions. Unaware of the instructions, the left hemisphere reported
plausible but incorrect explanations for the patient’s behavior.</p>
<p><strong>Machine learning system confabulation.</strong> We can ask
language models to provide justifications along with their answers.
Natural language reasoning is much easier to understand than internal
model activations. For example, if an LLM describes each step of its
reasoning in a math problem and gets the question wrong, humans can
check where and how the mistake was made.<p>
However, like human explanations, language model explanations are prone
to unreliability and confabulation. For instance, when researchers
fine-tuned a language model on multiple-choice questions where option
(a) was always correct, the model learned to always answer (a). When
this model was told to write explanations for questions whose correct
answers were not (a), the model would produce false but plausible
explanations for option (a). The model’s explanation systematically
failed to mention the real reason for its answers, which was that it had
been trained to always pick (a) <span class="citation"
data-cites="turpin2023language">[12]</span>.</p>
<p><strong>An alternative view of explanations.</strong> Instead of
requiring that explanations directly describe internal model processes,
a more expansive view argues that explanations are just any useful
auxiliary information provided alongside the output of a model. Such
explanations might include contextual knowledge or observations that the
model makes about the input. Models can also make auxiliary predictions;
for example they could note that if an input were different in some
specific ways, the output would change. However, while this type of
information can be valuable when presented correctly, such explanations
have the potential to mislead us.</p>
<h3 id="saliency-maps">Saliency Maps</h3>
<p><strong>Saliency maps purport to identify important components of
images.</strong> Saliency maps are visualizations that aim to show which
parts of the input are most relevant to the model’s behavior <span
class="citation" data-cites="simonyan2014deep">[13]</span>. They are
inspired by biological visual processing: when humans and other animals
are shown an image, they tend to focus on particular areas. For example,
if a person looks at a picture of a dog, the dog’s ears and nose will be
more relevant than the background to how the person interprets the
image. Saliency map techniques have been popular in part due to the
striking visualizations they produce.</p>
<figure id="fig:saliency-map">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/saliency_map.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.4: A saliency map picks out features from an input that seem particularly relevant to the
model, such as the shirt and cowboy hat in the bottom left image. <span class="citation"
data-cites="springenberg2015striving">[14]</span></p>
<!--<figcaption>Example of a saliency map technique for various images - -->
<!--<span class="citation"-->
<!--data-cites="springenberg2015striving">[14]</span></figcaption>-->
</figure>
<p><strong>Saliency maps often fail to show how machine learning vision
models process images.</strong> In practice, saliency maps are largely
bias-confirming visualizations that usually do not provide useful
information about models’ inner workings. It turns out that many
saliency maps are not dependent on a model’s parameters, and the
saliency maps often look similar even when generated for random,
untrained models. That means many saliency maps are incapable of
providing explanations that have any relevance to how a particular model
processes data <span class="citation"
data-cites="adebayo2018sanity">[15]</span>. Saliency maps serve as a
warning that visually or intuitively satisfying information that seems
to correspond with model behavior may not actually be useful. Useful
transparency research must avoid the past failures of the field and
produce explanations that are relevant to the model’s operation.</p>
<h3 id="mechanistic-interpretability">Mechanistic Interpretability</h3>
<p>When trying to understand a system, we might start by finding the
smallest pieces of the system that can be well understood and then
combine those pieces to describe larger parts of the system. If we can
understand successively larger parts of the system, we might eventually
develop a bottom-up understanding of the entire system. <em>Mechanistic
interpretability</em> is a transparency research area that aims to
represent models in terms of combinations of small, well-understood
mechanisms <span class="citation"
data-cites="wang2022interpretability">[16]</span>. If we can
reverse-engineer algorithms that describe small subsets of model
activations and weights, we might be able to combine these algorithms to
explain successively larger parts of the model.</p>
<p><strong>Features are the building blocks of deep learning
mechanisms.</strong> Mechanistic interpretability proposes focusing on
<em>features</em>, which are directions in a layer’s activation space
that aim to correspond to a meaningful, articulable property of the
input <span class="citation" data-cites="olah2020zoom">[17]</span>. For
example, we can imagine a language model with a “this is in Paris”
feature. If we evaluate the input “Eiffel Tower” using the language
model, we may find that a corresponding activation vector points in a
similar direction as the “this is in Paris” feature direction <span
class="citation" data-cites="meng2023locating">[18]</span>. Meanwhile,
the activation vector encoding “Coliseum” may point away from the “this
is in Paris” direction. Other examples of image or text features include
“this text is code”, curve detectors, and a large-small dichotomy
indicator.<p>
One goal of mechanistic interpretability is to identify features that
maintain a coherent description across many different inputs: a “this is
in Paris” feature would not be very valuable if it was highly activated
by “Statue of Liberty.” Recall that most neurons are polysemantic,
meaning they don’t individually represent features that are
straightforwardly recognizable by humans. Instead, most features are
actually combinations of neurons, making them difficult to identify due
to the sheer number of possible combinations. Despite this challenge,
features can help us think about the relationship between the internal
activations of models and human-understandable concepts.</p>
<p><strong>Circuits are algorithms operating on features.</strong>
Features can be understood in terms of other features. For example, if
we’ve discovered features in one layer of an image model that detect dog
ears, snouts, and tails, an input image with high activations for all of
these features may be quite likely to contain a dog. In fact, if we
discover a dog-detecting feature in the next layer of the model, it is
plausible that this feature is calculated using a combination of
dog-part-detecting features from the previous layer. We can test that
hypothesis by checking the model’s weights.<p>
A function represented in model weights which relates a model’s earlier
features to its later features is called a <em>circuit</em> <span
class="citation" data-cites="olah2020zoom">[17]</span>. In short,
circuits are computations within a model that are often more
understandable. The project of mechanistic interpretability is to
identify features in models and circuits between them. The more features
and circuits we identify, the more confident we can be that we
understand some of the model’s mechanisms. Circuits also simplify our
understanding of the model, allowing us to equate complicated numerical
manipulations with simpler algorithmic abstractions.</p>
<p><strong>An empirical example of a circuit.</strong> For the sake of
illustration, we will describe a purported circuit from a language
model. Researchers identified how a language model often predicts
indirect objects of sentences (such as “Mary” in “John gave a drink to
...”) as a simple algorithm using all previous names in a sentence (see
Figure 3.5 below). This mechanism did not
merely agree with model behavior, but was directly derived from the
model weights, giving more confidence that the algorithm is a faithful
description of an internal model mechanism <span class="citation"
data-cites="wang2022interpretability">[16]</span>.<p>
</p>
<figure id="fig:id-circuit">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/identification_circuit.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.5: An indirect-object identification circuit can be depicted graphically</p>
<!--<figcaption>Graphical depiction of indirect-object identification-->
<!--circuit</figcaption>-->
</figure>
<p><strong>Complex system understanding through mechanisms is
limited.</strong> There are several reasons to be concerned about the
ability of mechanistic interpretability research to achieve its
ambitions. It is challenging to reduce a complex system’s behavior into
many different low-level mechanisms. Even if we understood each of a
trillion neurons in a large model, we might not be able to combine the
pieces into an understanding of the system as a whole. Another concern
is that it is unclear if mechanistic interpretability can represent
model processes with enough simplicity to be understandable. ML models
might represent vast numbers of partial concepts and complex intuitions
that can not be represented by mechanisms or simple concepts.</p>
<h3 id="representation-engineering">Representation Engineering</h3>
<p><strong>Representation reading and representation control <span
class="citation"
data-cites="zou2023representation">[19]</span>.</strong> Mechanistic
interpretability is a bottom-up approach and combines small components
into an understanding of larger structures. Meanwhile,
<em>representation engineering</em> is a top-down approach that begins
with a model’s high-level representations and analyzes and controls
them. In machine learning, models learn representations that are not
identical to their training data, but rather stand in for it and allow
them to identify essential elements or patterns in the data (see the Artificial Intelligence Fundamentals chapter
for further details). Rather than try to fully understand arbitrary
aspects of a model’s internals, representation engineering develops
actionable tools for reading representations and controlling them.<p>
<strong>We can detect high-level subprocesses.</strong> Even though
neuroscientists don’t understand the brain in fine-grained detail, they
can associate high-level cognitive tasks to particular brain regions.
For example, they have shown that Wernicke’s area is involved in speech
comprehension. Though the brain was once a complete black box,
neuroscience has managed to decompose it into many parts.
Neuroscientists can now make detailed predictions about a person’s
emotional state, thoughts, and even mental imagery just by monitoring
their brain activity <span class="citation"
data-cites="tang2023semantic">[20]</span>.<p>
Representation reading is the similar approach of identifying indicators
for particular subprocesses. We can provide stimuli that relate to the
concepts or behaviors that we want to identify. For example, to identify
and control honesty-related outputs, we can provide contrasting prompts
to a model such as “Pretend you’re [an honest/a dishonest] person making
statements about the world.” We can track the differences in the model’s
activations when responding to these stimuli. We can use these
techniques to find portions of models which are responsible for
important behaviors like models refusing requests or deceiving users by not revealing knowledge they possess.</p>
<p><strong>Conclusion.</strong> ML transparency is a challenging problem
because of the difficulty of understanding complex systems. Major ongoing research areas 
include mechanistic interpretability and representation reading, the latter of which 
does not aim to make neural networks fully understood from the bottom up, but aims 
to gain useful internal knowledge from a model’s representations.</p>

<h2 id="sec:emergence">3.2.4 Emergent Capabilities</h2>
<p>We cannot predict all the properties of more advanced AI systems just
by studying the properties of less advanced systems. This makes it hard
to guarantee the safety of systems as they become increasingly
advanced.</p>
<p><strong>It is generally difficult to control systems that exhibit
emergence.</strong> <em>Emergence</em> occurs when a system’s
lower-level behavior is qualitatively different from its higher-level
behavior. For example, given a small amount of uranium in a fixed
volume, nothing much happens, but with a much larger amount, you end up
with a qualitatively new nuclear reaction. When more is different,
understanding the system at one scale does not guarantee that one can
understand that system at some other scale <span class="citation"
data-cites="anderson1972more steinhardt2022more">[23], [24]</span>. This
means that control procedures may not transfer between scales and can
lead to a weakening of control.<p>
The general phenomenon of emergence and its applicability to AI systems is 
discussed at greater length in the section 5.2.In this section, we will look at examples of emergence in neural
networks, ranging from emergent capabilities to emergent goal-directed
behavior and emergent optimization. Then we will discuss the potential
risks of AI systems intrinsifying unintended goals and examine how this
could result in catastrophic consequences.</p>

<p><strong>Neural networks exhibit emergent capabilities.</strong> When
we make AI models larger, train them for longer periods, or expose them
to more data, these systems spontaneously develop qualitatively new and
unprecedented <em>emergent capabilities</em> <span class="citation"
data-cites="wei2022emergent">[25]</span>. These range from simple
capabilities including solving arithmetic problems and unscrambling
words to more advanced capabilities including passing college-level
exams, programming, writing poetry, and explaining jokes. For these
emergent capabilities, there is some critical combination of model size,
training time, and dataset size below which models are unable to perform
the task, and beyond which models begin to achieve higher
performance.</p>
<figure id="fig:emergent_graphs">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/FLOPs.png" class="tb-img-full" />
<p class="tb-caption">Figure 3.6: LLMs exhibit clear emergent capabilities on a variety of tasks. <span
class="citation" data-cites="wei2022emergent">[25]</span></p>
<!--<figcaption>Emergent AI capabilities across multiple benchmarks - <span-->
<!--class="citation" data-cites="wei2022emergent">[3]</span></figcaption>-->
</figure>
<p><strong>Emergent capabilities are unpredictable.</strong> Typically,
the training loss does not directly select for emergent capabilities.
Instead, these capabilities emerge because they are instrumentally
useful for lowering the training loss. For example, large language
models trained to predict the next token of text about everyday events
develop some understanding of the events themselves. Developing common
sense is instrumental in lowering the loss, even if it was not
explicitly selected for by the loss.<p>
As another example, large language models may also learn how to create
text art and how to draw illustrations with text-based formats like TiKZ
and SVG <span class="citation" data-cites="wei2022emergent">[25]</span>.
They develop a rudimentary spatial reasoning ability not directly
encoded in the purely text-based loss function. Beforehand, it was
unclear even to experts that such a simple loss could give rise to such
complex behavior, which demonstrates that specifying the training loss
does not necessarily enable one to predict the capabilities an AI will
eventually develop.<p>
In addition, capabilities may “turn on” suddenly and unexpectedly.
Performance on a given capability may hover near chance levels until the
model reaches a critical threshold, beyond which performance begins to
improve dramatically. For example, the AlphaZero chess model develops
human-like chess concepts such as material value and mate threats in a
short burst around 32,000 training steps <span class="citation"
data-cites="McGrath_2022">[26]</span>.<p>
Despite specific capabilities often developing through discontinuous
jumps, the average performance tends to scale according to smooth and
predictable scaling laws. The average loss behaves much more regularly
because averaging over many different capabilities developing at
different times and at different speeds smooths out the jumps. From this
vantage point, then, it is often hard to even detect new
capabilities.<p>
</p>
<figure id="fig:unicorn">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/unicorn.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.7: GPT-4 proved able to create illustrations of unicorns despite having not been trained to
create images: another example of an unexpected emergent capability. <span class="citation"
data-cites="bubeck2023sparks">[27]</span></p>
<!--<figcaption>Unexpected capability example: illustrations of unicorns by-->
<!--GPT-4 - <span class="citation"-->
<!--data-cites="bubeck2023sparks">[5]</span></figcaption>-->
</figure>
<p><strong>Capabilities can remain hidden until after training.</strong>
In some cases, new capabilities are not discovered until after training
or even in deployment. For example, after training and before
introducing safety mitigations, GPT-4 was evaluated to be capable of
offering detailed guidance on planning attacks or violence, building
various weapons, drafting phishing materials, finding illegal content,
and encouraging self-harm <span class="citation"
data-cites="2023gpt4">[28]</span>. Other examples of capabilities
discovered after training include prompting strategies that improve
model performance on specific tasks or jailbreaks that bypass rules
against producing harmful outputs or writing about illegal acts. In some
cases, such jailbreaks were not discovered until months after the
targeted system was first publicly released <span class="citation"
data-cites="Zou2022ForecastingFW">[29]</span>.</p>

<h2 id="emergent-goal-directed-behavior">3.2.5 Emergent Goal-Directed
Behavior</h2>
<p>Besides developing emergent capabilities for solving specific,
individual problems, models can develop <em>emergent goal-directed
behavior</em>. This includes behaviors that extend beyond individual
tasks and into more complex, multifaceted environments.</p>
<h3 id="emergence-in-rl">Emergence in RL</h3>
<p><strong>RL agents develop emergent goal-directed behavior.</strong>
AIs can learn tactics and strategies involving many intermediate steps.
For instance, models trained on Crafter, a Minecraft-inspired toy
environment, learn behaviors such as digging tunnel systems,
bridge-building, blocking and dodging, sheltering, and even
farming—behaviors that were not explicitly selected for by the reward
function <span class="citation"
data-cites="hafner2022benchmarking">[30]</span>.<p>
As with emergent capabilities, models can acquire these emergent
strategies suddenly and discontinuously. One such example was observed
in the video game, StarCraft II, where players take the role of opposing
military commanders managing troops and resources in real-time. During
training, AlphaStar, a model trained to play StarCraft II, progresses
through a sequence of emergent strategies and counter-strategies for
managing troops and resources in a back-and-forth manner that resembles
how human players discover and supplant strategies in the game. While
some of these steps are continuous and piecemeal, others involve more
dramatic changes in strategy. Comparatively simple reward functions can
give rise to highly sophisticated strategies and complex learning
dynamics.</p>
<p><strong>RL agents learn emergent tool use.</strong> RL agents can
learn emergent behaviors involving tools and the manipulation of the
environment. Typically, as in the Crafter example, teaching RL agents to
use tools has required introducing intermediate rewards
(<em>achievements</em>) that encourage the model to learn that behavior.
However, in other settings, RL agents learn to use tools even when not
directly optimized to do so.<p>
Referring back to the example of hide and seek mentioned in the previous
section, the agents involved developed emergent tool use. Multiple
hiders and seekers competed against each other in toy environments
involving movable boxes and ramps. Over time, the agents learned to
manipulate these tools in novel and unexpected ways, progressing through
distinct stages of learning in a way similar to AlphaStar <span
class="citation" data-cites="baker2019emergent">[31]</span>. In the
initial (pre-tool) phase, the agents adopted simple chase and escape
tactics. Later, hiders evolved their strategy by constructing forts
using the available boxes and walls.<p>
However, their advantage was temporary because the seekers adapted by
pushing a ramp towards the fort, which they could climb and subsequently
invade. In turn, the hiders responded by relocating the ramps to the
edges of the game area—rendering them inaccessible—and securely
anchoring them in place. It seemed that the strategies had converged to
a stable point; without ramps, how were the seekers to invade the
forts?<p>
But then, the seekers discovered that they could still exploit the
locked ramps by positioning a box near one, climbing the ramp, and then
leaping onto the box. (Without a ramp, the boxes were too tall to
climb.) Once atop a box, a bot could “surf” it across the arena while
staying on top by exploiting an unexpected quirk of the physics engine.
Eventually, the hiders caught on and learned to secure the boxes in
advance, thereby neutralizing the box-surfing strategy. Even though the
agents had learned through the simple objective of trying to avoid the
gaze (in the case of hiders) or seek out (in the case of seekers) the
opposing players, they learned to use tools in sophisticated ways, even
some the researchers had never anticipated.<p>
</p>
<figure id="fig:tool-use">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/3d_puzzle.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.8: In multi-agent hide-and-seek, AIs demonstrated emergent tool use. <span
class="citation" data-cites="2019openai">[10]</span>.</p>
<!--<figcaption>Emergent tool-use in multi-agent hide-and-seek - <span-->
<!--class="citation" data-cites="2019openai">[10]</span>.</figcaption>-->
</figure>
<p><strong>RL agents can give rise to emergent social dynamics.</strong>
In multi-agent environments, agents can develop and give rise to complex
emergent dynamics and goals involving other agents. For example, OpenAI
Five, a model trained to play the video game Dota II, learned a basic
ability to cooperate with other teammates, even though it was trained in
a setting where it only competed against bots. It acquired an emergent
ability not explicitly represented in its training data <span
class="citation" data-cites="2019openai">[32]</span>.<p>
Another salient example of emergent social dynamics and emergent goals
involves <em>generative agents</em>, which are built on top of language
models by equipping them with external scaffolding that lets them take
actions and access external memory <span class="citation"
data-cites="park2023generative">[34]</span>. In a simple 2D village
environment, these generative agents manage to form lasting
relationships and coordinate on joint objectives. By placing a single
thought in one agent’s mind at the start of a “week” that the agent
wants to have a Valentine’s Day party, the entire village ends up
planning, organizing, and attending a Valentine’s Day party. Note that
these generative agents are language models, not classical RL agents,
which demonstrates that emergent goal-directed behavior and social
dynamics are not exclusive to RL settings. We further discuss emergent
social dynamics in the Collective Action Problems chapter.<p>
</p>
<figure id="fig:emergent-social-behaviour">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/AI_in_game.png" class="tb-img-full"/>
<p class="tb-caption">Figure 3.9: Generative AI agents exhibited emergent social behavior. <span
class="citation" data-cites="park2023generative">[34]</span></p>
<!--<figcaption>Emergent social behavior in generative agents - <span-->
<!--class="citation"-->
<!--data-cites="park2023generative">[11]</span></figcaption>-->
</figure>
<h3 id="emergent-optimizers">Emergent Optimizers</h3>
<p><strong>Optimizers can give rise to emergent optimizers.</strong> An
optimization process such as Stochastic Gradient Descent (SGD) can
discover solutions that are themselves optimizers. This phenomenon
introduces an additional layer of complexity in understanding the
behaviors of AI models and can introduce additional control issues <span
class="citation" data-cites="Hubinger2019RisksFL">[35]</span>.<p>
For example, if we train a model on a maze-solving task, we might end up
with a model implementing simple maze-solving heuristics (e.g.
“right-hand on the wall”). We might also end up with a model
implementing a general-purpose maze-solving algorithm, capable of
optimizing for maze-solving solutions in a variety of different
contexts. We call the second class of models <em>mesa-optimizers</em>
and whatever goal they have learned to optimize for (e.g. solving mazes)
their <em>mesa-objective</em>. The term "mesa" is meant as the opposite
of “meta,” such that a mesa-optimizer is the opposite of a
meta-optimizer (where a meta-optimizer is an optimizer on top of another
optimizer, a mesa-optimizer is an optimizer beneath another
optimizer).</p>
<p><strong>Few-shot learning is a form of emergent
optimization.</strong> Perhaps the clearest example of emergent
optimization is <em>few-shot learning</em>. By providing large language
models with several examples of a new task that the system has not yet seen
during training, the model may still be able to learn to perform that
task entirely during inference. The resemblance between few-shot or
“in-context” learning and other learning processes like SGD is not just
in analogy: recent papers have demonstrated that in-context learning
behaves as an approximation of SGD. That is, Transformers are performing
a kind of internal optimization procedure, where as they receive more
examples of the task at hand, they qualitatively change the kind of
model they are implementing <span class="citation"
data-cites="vonoswald2023uncovering oswald2023transformers">[36],
[37]</span>.</p>
<h2 id="tail-risk-emergent-goals">3.2.6 Tail Risk: Emergent Goals</h2>
<p>Just as AIs can develop emergent capabilities and emergent
goal-seeking behavior, they may develop <em>emergent goals</em> that
differ from the explicit objectives we give them. This poses a risk
because it could result in imperfect control. Moreover, if models become
self-aware and begin actively pursuing undesired goals, the risk could
potentially be catastrophic because our relationship becomes
adversarial.</p>
<h3 id="risks-from-mesa-optimization">Risks from Mesa-Optimization</h3>
<p><strong>Mesa-optimizers may develop novel objectives.</strong> When
training an AI system on a particular goal, it may develop an emergent
mesa-optimizer, in which case it is not necessarily the case that the
mesa-optimizer’s goal is identical to the original training objective.
The only thing we know for certain with an emergent mesa-optimizer is
that whatever goal it has learned, it must be one that results in good
training performance—but there might be many different goals that would
all work well in a particular training environment. For example, with
LLMs, the training objective is to predict future tokens in a sequence,
so any learned distinct optimizers emerge because they are
instrumentally useful for lowering the training loss. In the case of
in-context learning, recent work has argued that the Transformer is
performing something analogous to “simulating” and fine-tuning a much
simpler model, in which case it is clear that the objectives will be
related <span class="citation"
data-cites="oswald2023transformers">[37]</span>. However, in general,
the exact relation between a mesa-objective and original objective is
unknown.</p>
<p><strong>Mesa-optimizers may be difficult to control.</strong> If a
mesa-optimizer develops a different objective to the one we specify, it
becomes more difficult to control these (sub)systems. If these systems
have different goals than us and are sufficiently more intelligent and
powerful than us, then this could result in catastrophic outcomes.</p>
<h3 id="risks-from-intrinsification">Risks from Intrinsification</h3>
<p><strong>Models can intrinsify goals <span class="citation"
data-cites="bostrom2022base">[38]</span>.</strong> It is helpful to
distinguish goals that are instrumental from those that are intrinsic.
<em>Instrumental goals</em> are goals that serve as a means to an end.
They are goals that are valued only insofar as they bring about other
goals. <em>Intrinsic goals</em>, meanwhile, are goals that serve as ends
in and of themselves. They are terminally valued by a goal-directed
system.<p>
Next, <em>intrinsification</em> is a process whereby models acquire such
intrinsic goals <span class="citation"
data-cites="bostrom2022base">[38]</span>. The risk is that these newly
acquired intrinsic goals can end up taking precedence over the
explicitly specified objectives or expressed goals, potentially leading
to those original objectives no longer being operationally pursued.</p>
<p><strong>Over time, instrumental goals can become intrinsic.</strong>
A teenager may begin listening to a particular genre or musicians in
order to fit into a particular group but ultimately come to enjoy it for
its own sake. Similarly, a seven-year-old who joins the Cub Scouts may
initially see the group as a means to enjoyable activities but over time
may come to value the scout pack itself. This can even apply to
acquiring money, which is initially sought for purchasing desired items,
but can become an end in itself.<p>
How does this work? When a stimulus regularly precedes the release of a
reward signal, that stimulus may come to be associated with the reward
and eventually trigger reward signals on its own. This process gives
rise to new desires and helps us develop tastes for things that are
regularly linked with basic rewards.</p>
<p><strong>Intrinsification could also occur with AIs.</strong> Despite
the differences between human and AI reward systems, there are enough
similarities to warrant concern. In both human and AI reinforcement
learning, the reward signal reinforces behaviors leading to rewards. If
certain conditions frequently precede a model achieving its goals, the
model might intrinsify the emergent goal of pursuing those conditions,
even if it was not the original aim of the designers of the AI.</p>
<p><strong>AIs that intrinsify unintended goals would be
dangerous.</strong> Over time, an internal process that initially
doesn’t completely dictate behavior can become a central part of an
agent’s motivational system. Since intrinsification depends sensitively
on the environment and an agent’s history, it is hard to predict. The
concern is that AIs might intrinsify desires or come to value things
that we did not intend them to.<p>
One example is power seeking. Power seeking is not inherently worrying;
we might expect aligned systems to also be power seeking to accomplish
ends we value. However, if power seeking serves an undesired goal or if
power seeking itself becomes intrinsified (the means become ends), this
could pose a threat.</p>
<p><strong>AI agents will be adaptive, which requires constant
vigilance.</strong> Achieving high performance with AI agents will
require them to be adaptive rather than “frozen” ( unable to learn
anything after training). This introduces the risk of the agents’ goals
changing over time—a phenomenon known as <em>goal drift</em>. Though
this flexibility is necessary if we are to have AI systems evolve
alongside our own changing goals, it presents its own risks if goal
drift results in goals diverging from humans. Since it is difficult to
preclude the possibility of goal drift, ensuring the safety of these
systems will require constant supervision: the risk is not isolated too
early in deployment.</p>
<p><strong>The more integrated AI agents become in society, the more
susceptible we become to their goals changing.</strong> In a future
where AIs make various key decisions and processes, they could form a
complex system of interacting agents that could give rise to
unanticipated emergent goals. For example, they may partially imitate
each other and learn from each other, which would shape their behavior
and possibly also their goals. Additionally, they may also give rise to
emergent social dynamics as in the example of the generative agents.
These kinds of dynamics make the long-term behavior of these AI networks
unpredictable and difficult to control. If we become overly dependent on
them and they develop new priorities that don’t include our wellbeing,
we could face an existential risk.</p>
<p><strong>Conclusion.</strong> AI systems can develop emergent
capabilities that are difficult to predict and control, such as solving
novel problems or accomplishing tasks in unexpected ways. These
capabilities can appear suddenly as models scale up. In itself, the
emergence of new and dangerous capabilities (e.g. capabilities to develop
biological or chemical weapons) could pose catastrophic risks. There
could be further risks if AI systems were to develop emergent goals
diverging from the interests of society and these systems became
powerful. Risks grow as AI agents become more integrated into human
society and susceptible to goal drift or emergent goals. Vigilance is
needed to ensure we are not surprised by advanced AI systems acquiring
dangerous capabilities or goals.</p>

<h2 id="evaluation and anomaly detection">3.2.7 Evaluations and Anomaly Detection</h2>

<p><strong>Emergent capabilities make control difficult.</strong>
Whether certain capabilities develop suddenly or are discovered
suddenly, they can be difficult to predict. This makes it a challenge to
anticipate what future AI will be able to do even in the short term, and
it could mean that we may have little time to react to novel
capabilities jumps. It is difficult to make a system safe when it is
unknown what that system will be able to do.</p>

<p><strong>Better evaluations and other research techniques could make it easier to detect hazardous emergent capabilities.</strong>
 Researchers could try to detect potentially hazardous capabilities as they emerge or develop techniques to track and predict the 
progress of models' capabilities in certain relevant domains and skills. They could also track capabilities relevant to mitigating hazards. 
It could be valuable to create testbeds to continuously screen AI models for potentially hazardous capabilities, for example abilities that 
could meaningfully assist malicious actors with the execution of cyber-attacks, exacerbate CBRN threats or generate persuasive content in a 
way that could affect elections. Ideally, we would be able to infer a model's latent abilities purely by analyzing
 a model's weights, enabling us to infer its abilities beyond what's obviously visible through standard testing. </p>

<p>To avoid a false sense of safety, it will be important to validate that these detection methods are sufficiently sensitive. 
Researchers could intentionally add hidden functionality to check the testing methods catch this. Methods to predict future capabilities
 in a quantitative way and find new failure modes would also be valuable. Once a hazardous capability like deception is found, it must be eliminated. 
Researchers could develop training techniques that ensure that models don't acquire undesirable skills in the first place, or that make models forget
 them after training. But verifying capabilities are fully removed, not just obscured or partially eliminated, could prove difficult.</p>

<p><strong>Better anomaly detection would be highly valuable for monitoring AI systems.</strong> As discussed in the AI Fundamentals section,
 anomaly detection involves identifying outliers or abnormal data points. Anomaly detection allows models to reliably detect and respond to 
unexpected threats that could substantially impact system performance. This is useful for detecting potential hazards like sudden behavioral 
shifts, and system failures. A key challenge is detecting rare and unpredictable ``black swan'' events that are not represented in training data.
 Since malicious actors are likely to adopt novel strategies to avoid detection, anomaly detection could be particularly useful for identifying 
malicious activity such as cyberattacks. Anomaly detection could also potentially be extended to identify unknown threats such as Trojaned, rogue,
 or scheming AI systems. Successful anomaly detectors could identify and flag  anomalies for human review or automatically carry out a conservative
 fallback policy. Anomaly detection could be useful for identifying other hazards such as malicious use. For anomaly detection to be useful,
 it is important to ensure that they have high recall and low false alarm rate, to avoid alarm fatigue.</p>

<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-lipton2018interpretability" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[1] Z.
C. Lipton, <span>“The mythos of model interpretability: In machine
learning, the concept of interpretability is both important and
slippery.”</span> <em>Queue</em>, vol. 16, no. 3, pp. 31–57, Jun. 2018,
doi: <a
href="https://doi.org/10.1145/3236386.3241340">10.1145/3236386.3241340</a>.</div>
</div>
<div id="ref-zoph2017neural" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] B.
Zoph and Q. V. Le, <span>“Neural <span>Architecture Search</span> with
<span>Reinforcement Learning</span>.”</span> <span>arXiv</span>, Feb.
2017. Accessed: Sep. 15, 2023. [Online]. Available: <a
href="https://arxiv.org/abs/1611.01578">https://arxiv.org/abs/1611.01578</a></div>
</div>
<div id="ref-zhang2008eigenfaces" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] S.
Zhang and M. Turk, Available: <a
href="http://www.scholarpedia.org/article/Eigenfaces">http://www.scholarpedia.org/article/Eigenfaces</a></div>
</div>
<div id="ref-bau2017vision" class="csl-entry" role="listitem">
<div class="csl-left-margin">[4] D.
Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, <span>“Network
dissection: Quantifying interpretability of deep visual
representations,”</span> in <em>2017 IEEE conference on computer vision
and pattern recognition (CVPR)</em>, 2017, pp. 3319–3327. doi: <a
href="https://doi.org/10.1109/CVPR.2017.354">10.1109/CVPR.2017.354</a>.</div>
</div>
<div id="ref-olah2017feature" class="csl-entry" role="listitem">
<div class="csl-left-margin">[5] C.
Olah, A. Mordvintsev, and L. Schubert, <span>“Feature
visualization,”</span> <em>Distill</em>, 2017, doi: <a
href="https://doi.org/10.23915/distill.00007">10.23915/distill.00007</a>.</div>
</div>
<div id="ref-schubert2020openai" class="csl-entry" role="listitem">
<div class="csl-left-margin">[6] L.
Schubert, M. Petrov, S. Carter, N. Cammarata, G. Goh, and C. Olah,
<span>“<span>OpenAI Microscope</span>.”</span> Apr. 2020.</div>
</div>
<div id="ref-elhage2022softmax" class="csl-entry" role="listitem">
<div class="csl-left-margin">[7] N.
Elhage <em>et al.</em>, <span>“Softmax linear units,”</span>
<em>Transformer Circuits Thread</em>, 2022, Available: <a
href="https://transformer-circuits.pub/2022/solu/index.html">https://transformer-circuits.pub/2022/solu/index.html</a></div>
</div>
<div id="ref-kaminski2019explanation" class="csl-entry" role="listitem">
<div class="csl-left-margin">[8] M.
E. Kaminski, <span>“The right to explanation, explained,”</span>
<em>Berkeley Tech. L.J.. Berkeley Technology Law Journal</em>, vol. 34,
no. IR, p. 189, Available: <a
href="http://lawcat.berkeley.edu/record/1128984">http://lawcat.berkeley.edu/record/1128984</a></div>
</div>
<div id="ref-elish2019moral" class="csl-entry" role="listitem">
<div class="csl-left-margin">[9] M.
Elish, <span>“Moral crumple zones: Cautionary tales in human-robot
interaction (WeRobot 2016),”</span> <em>SSRN Electronic Journal</em>,
Jan. 2016, doi: <a
href="https://doi.org/10.2139/ssrn.2757236">10.2139/ssrn.2757236</a>.</div>
</div>
<div id="ref-patry2008attractive" class="csl-entry" role="listitem">
<div class="csl-left-margin">[10] M.
W. Patry, <span>“Attractive but <span>Guilty</span>:
<span>Deliberation</span> and the <span>Physical Attractiveness
Bias</span>,”</span> <em>Psychological Reports</em>, vol. 102, no. 3,
pp. 727–733, Jun. 2008, doi: <a
href="https://doi.org/10.2466/pr0.102.3.727-733">10.2466/pr0.102.3.727-733</a>.</div>
</div>
<div id="ref-dehaan2020split" class="csl-entry" role="listitem">
<div class="csl-left-margin">[11] E.
de Haan <em>et al.</em>, <span>“Split-brain: What we know now and why
this is important for understanding consciousness,”</span>
<em>Neuropsychology Review</em>, vol. 30, Jun. 2020, doi: <a
href="https://doi.org/10.1007/s11065-020-09439-3">10.1007/s11065-020-09439-3</a>.</div>
</div>
<div id="ref-turpin2023language" class="csl-entry" role="listitem">
<div class="csl-left-margin">[12] M.
Turpin, J. Michael, E. Perez, and S. R. Bowman, <span>“Language
<span>Models Don</span>’t <span>Always Say What They Think</span>:
<span>Unfaithful Explanations</span> in <span
class="nocase">Chain-of-Thought Prompting</span>.”</span>
<span>arXiv</span>, May 2023. doi: <a
href="https://doi.org/10.48550/arXiv.2305.04388">10.48550/arXiv.2305.04388</a>.</div>
</div>
<div id="ref-simonyan2014deep" class="csl-entry" role="listitem">
<div class="csl-left-margin">[13] K.
Simonyan, A. Vedaldi, and A. Zisserman, <span>“Deep <span>Inside
Convolutional Networks</span>: <span>Visualising Image Classification
Models</span> and <span>Saliency Maps</span>.”</span>
<span>arXiv</span>, Apr. 2014.</div>
</div>
<div id="ref-springenberg2015striving" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[14] J.
T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
<span>“Striving for <span>Simplicity</span>: <span>The All Convolutional
Net</span>.”</span> <span>arXiv</span>, Apr. 2015. Accessed: Sep. 15,
2023. [Online]. Available: <a
href="https://arxiv.org/abs/1412.6806">https://arxiv.org/abs/1412.6806</a></div>
</div>
<div id="ref-adebayo2018sanity" class="csl-entry" role="listitem">
<div class="csl-left-margin">[15] J.
Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim,
<span>“Sanity <span>Checks</span> for <span>Saliency
Maps</span>,”</span> in <em>Advances in <span>Neural Information
Processing Systems</span></em>, <span>Curran Associates, Inc.</span>,
2018.</div>
</div>
<div id="ref-wang2022interpretability" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[16] K.
Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt,
<span>“Interpretability in the wild: A circuit for indirect object
identification in GPT-2 small.”</span> 2022. Available: <a
href="https://arxiv.org/abs/2211.00593">https://arxiv.org/abs/2211.00593</a></div>
</div>
<div id="ref-olah2020zoom" class="csl-entry" role="listitem">
<div class="csl-left-margin">[17] C.
Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter,
<span>“Zoom <span>In</span>: <span>An Introduction</span> to
<span>Circuits</span>,”</span> <em>Distill</em>, vol. 5, no. 3, p.
e00024.001, Mar. 2020, doi: <a
href="https://doi.org/10.23915/distill.00024.001">10.23915/distill.00024.001</a>.</div>
</div>
<div id="ref-meng2023locating" class="csl-entry" role="listitem">
<div class="csl-left-margin">[18] K.
Meng, D. Bau, A. Andonian, and Y. Belinkov, <span>“Locating and
<span>Editing Factual Associations</span> in <span>GPT</span>.”</span>
<span>arXiv</span>, Jan. 2023.</div>
</div>
<div id="ref-zou2023representation" class="csl-entry" role="listitem">
<div class="csl-left-margin">[19] A.
Zou <em>et al.</em>, <span>“Representation engineering: A top-down
approach to AI transparency.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2310.01405">https://arxiv.org/abs/2310.01405</a></div>
</div>
<div id="ref-tang2023semantic" class="csl-entry" role="listitem">
<div class="csl-left-margin">[20] J.
Tang, A. LeBel, S. Jain, and A. G. Huth, <span>“Semantic reconstruction
of continuous language from non-invasive brain recordings,”</span>
<em>bioRxiv</em>, 2022, doi: <a
href="https://doi.org/10.1101/2022.09.29.509744">10.1101/2022.09.29.509744</a>.</div>
</div>
<div id="ref-belrose2023leace" class="csl-entry" role="listitem">
<div class="csl-left-margin">[21] N.
Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S.
Biderman, <span>“LEACE: Perfect linear concept erasure in closed
form.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2306.03819">https://arxiv.org/abs/2306.03819</a></div>
</div>
<div id="ref-burns2022discovering" class="csl-entry" role="listitem">
<div class="csl-left-margin">[22] C.
Burns, H. Ye, D. Klein, and J. Steinhardt, <span>“Discovering latent
knowledge in language models without supervision.”</span> 2022.
Available: <a
href="https://arxiv.org/abs/2212.03827">https://arxiv.org/abs/2212.03827</a></div>
<div id="ref-anderson1972more" class="csl-entry" role="listitem">
<div class="csl-left-margin">[23] P.
W. Anderson, <span>“More <span>Is Different</span>,”</span>
<em>Science</em>, vol. 177, no. 4047, pp. 393–396, Aug. 1972, doi: <a
href="https://doi.org/10.1126/science.177.4047.393">10.1126/science.177.4047.393</a>.</div>
</div>
<div id="ref-steinhardt2022more" class="csl-entry" role="listitem">
<div class="csl-left-margin">[24] J.
Steinhardt, <span>“More <span>Is Different</span> for
<span>AI</span>,”</span> <em>Bounded Regret</em>. Jan. 2022.</div>
</div>
<div id="ref-wei2022emergent" class="csl-entry" role="listitem">
<div class="csl-left-margin">[25] J.
Wei <em>et al.</em>, <span>“Emergent abilities of large language
models.”</span> 2022. Available: <a
href="https://arxiv.org/abs/2206.07682">https://arxiv.org/abs/2206.07682</a></div>
</div>
<div id="ref-McGrath_2022" class="csl-entry" role="listitem">
<div class="csl-left-margin">[26] T.
McGrath <em>et al.</em>, <span>“Acquisition of chess knowledge in
<span>AlphaZero</span>,”</span> <em>Proceedings of the National Academy
of Sciences</em>, vol. 119, no. 47, Nov. 2022, doi: <a
href="https://doi.org/10.1073/pnas.2206625119">10.1073/pnas.2206625119</a>.</div>
</div>
<div id="ref-bubeck2023sparks" class="csl-entry" role="listitem">
<div class="csl-left-margin">[27] S.
Bubeck <em>et al.</em>, <span>“Sparks of artificial general
intelligence: Early experiments with GPT-4.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2303.12712">https://arxiv.org/abs/2303.12712</a></div>
</div>
<div id="ref-2023gpt4" class="csl-entry" role="listitem">
<div class="csl-left-margin">[28] </div><div
class="csl-right-inline"><span>“<span>GPT-4 System Card</span>,”</span>
<span>OpenAI</span>, Mar. 2023.</div>
</div>
<div id="ref-Zou2022ForecastingFW" class="csl-entry" role="listitem">
<div class="csl-left-margin">[29] A.
Zou <em>et al.</em>, <span>“Forecasting future world events with neural
networks,”</span> <em>NeurIPS</em>, 2022.</div>
</div>
<div id="ref-hafner2022benchmarking" class="csl-entry" role="listitem">
<div class="csl-left-margin">[30] D.
Hafner, <span>“Benchmarking the spectrum of agent capabilities.”</span>
2022. Available: <a
href="https://arxiv.org/abs/2109.06780">https://arxiv.org/abs/2109.06780</a></div>
</div>
<div id="ref-baker2019emergent" class="csl-entry" role="listitem">
<div class="csl-left-margin">[31] B.
Baker <em>et al.</em>, <span>“Emergent <span>Tool Use From Multi-Agent
Autocurricula</span>,”</span> <em>arXiv.org</em>. Sep. 2019.</div>
</div>
<div id="ref-2019openai" class="csl-entry" role="listitem">
<div class="csl-left-margin">[32] </div><div
class="csl-right-inline"><span>“<span>OpenAI Five</span> defeats
<span>Dota</span> 2 world champions,”</span> <em>OpenAI</em>. Apr. 2019.
Accessed: Sep. 16, 2023. [Online]. Available: <a
href="https://openai.com/research/openai-five-defeats-dota-2-world-champions">https://openai.com/research/openai-five-defeats-dota-2-world-champions</a></div>
</div>
<div id="ref-park2023generative" class="csl-entry" role="listitem">
<div class="csl-left-margin">[34] J.
S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S.
Bernstein, <span>“Generative agents: Interactive simulacra of human
behavior.”</span> 2023. Available: <a
href="https://arxiv.org/abs/2304.03442">https://arxiv.org/abs/2304.03442</a></div>
</div>
<div id="ref-Hubinger2019RisksFL" class="csl-entry" role="listitem">
<div class="csl-left-margin">[35] E.
Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant,
<span>“Risks from learned optimization in advanced machine learning
systems,”</span> <em>ArXiv</em>, 2019.</div>
</div>
<div id="ref-vonoswald2023uncovering" class="csl-entry" role="listitem">
<div class="csl-left-margin">[36] J.
von Oswald <em>et al.</em>, <span>“Uncovering mesa-optimization
algorithms in <span>Transformers</span>.”</span> <span>arXiv</span>,
Sep. 2023.</div>
</div>
<div id="ref-oswald2023transformers" class="csl-entry" role="listitem">
<div class="csl-left-margin">[37] J.
V. Oswald <em>et al.</em>, <span>“Transformers <span>Learn
In-Context</span> by <span>Gradient Descent</span>,”</span> in
<em>Proceedings of the 40th <span>International Conference</span> on
<span>Machine Learning</span></em>, <span>PMLR</span>, Jul. 2023, pp.
35151–35174.</div>
</div>
<div id="ref-bostrom2022base" class="csl-entry" role="listitem">
<div class="csl-left-margin">[38] N.
Bostrom, <span>“Base <span>Camp</span> for <span>Mt</span>.
<span>Ethics</span>,”</span> 2022.</div>
</div>
</div>