Summary
Large language models (LLMs) have demonstrated performance approaching human levels in tasks such as long-text comprehension and mathematical reasoning, but they remain black-box systems. Understanding the reasoning bottlenecks of LLMs remains a critical challenge, as these limitations are deeply tied to their internal architecture. Attention heads play a pivotal role in reasoning and are thought to share similarities with human brain functions. In this review, we explore the roles and mechanisms of attention heads to help demystify the internal reasoning processes of LLMs. We first introduce a four-stage framework inspired by the human thought process. Using this framework, we review existing research to identify and categorize the functions of specific attention heads. Additionally, we analyze the experimental methodologies used to discover these special heads and further summarize relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions.
Keywords: attention head, mechanistic interpretability, large language model, LLM, cognitive neuroscience
The bigger picture
In recent years, with the rapid advancement of artificial intelligence (AI) technologies, applications built upon large language models (LLMs) have become indispensable tools in daily life, akin to smartphones. Efforts to enhance the general capabilities of LLMs have primarily focused on two areas: creating high-quality training datasets and improving model architectures. For the latter, prior studies have shown that the reasoning processes of LLMs exhibit notable similarities to human cognition. Therefore, understanding the specific behavioral mechanisms within mainstream LLMs—namely, their interpretability mechanisms—is crucial for advancing and refining these models. This study focuses on the attention heads of LLMs, offering a systematic review of their behavior and associated experimental methods across various scenarios. By proposing a novel interpretive framework and highlighting directions worthy of further exploration, we aim to inspire researchers investigating the LLM inference abilities and contribute to the advancement of this field.
Large language models (LLMs) have demonstrated performance approaching human levels in tasks such as long-text comprehension and mathematical reasoning. At the core of their architecture, attention heads play a pivotal role in reasoning and are thought to share similarities with human brain functions. This systematic review introduces a four-stage framework inspired by human cognitive processes, summarizes the functional mechanisms of different attention heads, and highlights current limitations in attention head interpretability research while proposing future directions for exploration.
Introduction
The transformer architecture1 has demonstrated outstanding performance across various tasks, such as natural language inference and natural language Generation. However, it still retains the black-box nature inherent to deep neural networks (DNNs).2,3 As a result, many researchers have dedicated efforts to understanding the internal reasoning processes within these models, aiming to uncover the underlying mechanisms.4 This line of research provides a theoretical foundation for models such as bidirectional encoder representations from transformer (BERT)5 and generative pre-trained transformer (GPT)6 to perform well in downstream applications. Additionally, in the current era where large language models (LLMs) are widely applied, interpretability mechanisms can guide researchers in intervening in specific stages of LLM inference, thereby enhancing their problem-solving capabilities.7,8,9 Among the components of LLMs, attention heads play a crucial role in the reasoning process. Particularly in recent years, attention heads within LLMs have garnered significant attention, as illustrated in Figure 1. Numerous studies have explored attention heads with specific functions. This paper consolidates these research efforts, organizing and analyzing the potential mechanisms of different types of attention heads. Additionally, we summarize the methodologies employed in these investigations.
Figure 1.
The global Google Trends popularity of the keywords “attention head” and “model interpretability”
The data retrieval date is December 4, 2024.
The logical structure and classification method of this paper are illustrated in Figure 2. We begin with the background of the problem in Background, where we present a simplified representation of the LLMs’ structures (Mathematical representation of LLMs) and explain the related key terms (Glossary of key terms). In Overview of special attention heads, we first summarize the four stages of human thought processes from a cognitive neuroscience perspective and apply this framework to analyze the reasoning mechanisms of LLMs. Using this as our classification criterion, we categorize existing work on attention heads, identifying commonalities among heads that contribute to similar reasoning processes, from knowledge recalling (KR) to expression preparation (EP), and exploring the collaborative mechanisms of heads functioning at different stages (How do attention heads work together?).
Figure 2.
The framework of this survey
Investigating the internal mechanisms of models often requires extensive experiments to validate hypotheses. To provide a comprehensive understanding of these methods, we summarize the current experimental methodologies used to explore special attention heads in Unveiling the discovery of attention heads. We divide these methodologies into two main categories based on whether they require additional modeling: modeling free and modeling required.
In addition to the core sections shown in Figure 2, we summarize the evaluation tasks and benchmarks used in relevant studies in Evaluation. Furthermore, in Additional topics, we compile research on the mechanisms of feedforward networks (FFNs) and mechanical interpretability to help deepen our understanding of LLM structures from multiple perspectives. Finally, in Discussion, we offer our insights on the current state of research in this field and outline several potential directions for future research.
In summary, the strengths of our work are as follows:
-
(1)
Focus on the latest research. Although earlier researchers explored the mechanisms of attention heads in models such as BERT, many of these conclusions are now outdated. This paper primarily focuses on highly popular LLMs, such as large language model meta AI (LLaMA) and GPT, consolidating the latest research findings.
-
(2)
An innovative four-stage framework for LLM reasoning. We have distilled key stages of human thought processes by integrating knowledge from cognitive neuroscience, psychology, and related fields. Furthermore, we have applied these stages as an analogy for LLM reasoning.
-
(3)
Detailed categorization of attention heads. Based on the proposed four-stage framework, we classify different attention heads according to their functions within these stages, and we explain how heads operating at different stages collaborate to achieve alignment between humans and LLMs.
-
(4)
Clear summarization of experimental methods. We provide a detailed categorization of the current methods used to explore attention head functions from the perspective of model dependency, laying a foundation for the improvement and innovation of experimental methods in future research.
Out-of-scope topics
First, we need to clarify the boundaries of the topic reviewed in this paper. In other words, some works fall outside the scope of our focus.
As the latest research on attention head interpretability is primarily based on LLMs, this paper focuses on the heads within current mainstream LLM architectures, specifically those with a decoder-only structure. As such, we do not discuss early studies related to the transformer, such as those focusing on attention heads in BERT-based models.10,11,12
Some studies on mechanistic interpretability propose holistic operational principles that encompass embeddings, attention heads, and multilayer perceptrons (MLPs). However, this paper focuses exclusively on attention heads. Consequently, we do not cover the roles of other components within the transformer architecture from Overview of special attention heads to Unveiling the discovery of attention heads; these are only briefly summarized in Additional topics.
Background
Mathematical representation of LLMs
As mentioned in Out-of-scope topics, to facilitate the discussion in the subsequent sections, we first define the mathematical notations for the transformer layer of an LLM. Note that there are two main layer normalization methods in LLMs: pre-norm and post-norm.13,14 However, since these are not the focus of this paper, we will omit layer normalization in this section.
As shown in Figure 3, a model consists of an embedding layer, L identical transformer layers, and an unembedding layer. The input to are one-hot sentence tokens, with a shape of , where N is the length of the token sequence and represents the vocabulary size.
Figure 3.
The overall structure of decoder-only LLMs
After passing through the embedding layer, which applies semantic embedding and positional encoding (e.g., RoPE15), the one-hot matrix is transformed into the input for the first transformer layer, where d represents the dimension of the token embedding (or latent vector).
In the -th transformer layer, there are two residual blocks. The first residual block takes the input matrix and combines it with the output —produced by a multi-head attention mechanism with H attention heads—to compute (as shown in Equation 1). Subsequently, serves as the input for the second residual block. Here, represents the computation function of the h-th attention head in the -th layer (see Equation 3), where .
| (Equation 1) |
Similarly, as shown in Equation 2, the second residual block combines with the output obtained after passing through the FFN, yielding the final output of the -th decoder block. This output also serves as the input for the +1-th decoder block. Here, consists of linear layers (and activation functions) such as gated linear units (GLU), Swish GLU (SwiGLU),16 or mixture of experts (MoEs).17,18
| (Equation 2) |
Here, we will concentrate on the details of . This function can be expressed using matrix operations. Specifically, each layer’s function corresponds to four low-rank matrices: . By multiplying with , the query matrix is obtained. Similarly, the key matrix and the value matrix can be derived. The function can then be expressed as Equation 3.1
| (Equation 3) |
Glossary of key terms
This paper mentions several specialized terms that are fundamental to understanding and analyzing the reasoning mechanisms of LLMs. These terms are organized into two categories: conceptual frameworks, which provide theoretical abstractions for modeling LLM reasoning, and empirical analysis methods, which offer practical tools for experimentally probing and validating these frameworks. Below, we provide explanations for these key terms.
Conceptual frameworks
Circuits
Circuits are abstractions of the reasoning logic in deep models. The model is viewed as a computational graph. There are two main approaches to modeling circuits. One approach treats the features in the latent space of as nodes and the transitions between features as edges.19,20 The other approach views different components of , such as attention heads and neurons, as nodes, and the interactions between these components, such as residual connections, as edges.21 A circuit is a subgraph of . Researchers have discovered many important circuits, such as the bias circuit22 and knowledge circuit.23
Residual stream
As shown in Figure 4, each row in the figure can be viewed as a residual stream. The residual stream after layer is the sum of the embedding and the outputs of all layers up to layer , serving as the input to layer . Elhage et al.24 conceptualized the residual stream as a shared bandwidth through which information can flow. Different layers (or tokens) utilize this shared bandwidth, with lower layers (or previous tokens) writing information and higher layers (or subsequent tokens) reading it.
Figure 4.
The diagram of residual streams
From the perspective of residual streams, the inference process of LLMs can be understood at a micro level where attention heads access latent state matrices from several residual streams, as indicated by the gray arrows across layers in the diagram. At a macro level, different residual streams control the information flow through attention heads, as shown by the gray wavy lines in the diagram.
QK matrix and OV matrix
We expand Equation 3, as shown in Figure 5. According to the study by Elhage et al.,24 is referred to as the QK matrix (QK circuit), while is referred to as the OV matrix (OV circuit). Specifically, the QK matrix enables the computation of attention scores between the N tokens in , thereby facilitating the reading of information from certain residual streams. Meanwhile, the OV matrix is responsible for writing the processed information back into the corresponding residual streams.
Figure 5.
Annotation of the expanded form of Equation 3
Empirical analysis methods
Activation patching
Activation patching is designed to analyze the impact of the modifications on the model’s final decisions. It involves substituting activation values in specific layers of a model with alternatives—such as activations from different inputs, baseline values, or perturbed versions. Specifically, three types of effects are considered: direct effect, indirect effect, and total effect, as illustrated in Figure 6.
Figure 6.
Three different types of calculating effects
Ablation study
Ablation study and activation patching are conceptually related but differ in their methods of operation. Instead of replacing activations, it involves removing specific components of the LLM to observe how the output is affected.25 The key distinction between the two methods lies in their mechanism: activation patching modifies activations to simulate the logical replacement of a component, whereas ablation study physically removes the component entirely.
Logit lens
When calculating effects such as those shown in Figure 6, logit lens can quantify this effect. It is often used in conjunction with activation patching or ablation studies. Specifically, it uses the unembedding layer to map an intermediate representation vector to the logits values of the vocabulary, allowing for the comparison of logits differences or other metrics. More details are in the Colab notebook.
Existing related surveys
To the best of our knowledge, there is no survey focused on the mechanisms of LLMs’ attention heads. Specifically, Räuker et al.26 mainly discussed non-transformer architectures, with little focus on attention heads. The surveys by Gonçalves et al.,27 Santana and Colombini,28 Chaudhari et al.,29 and Brauwers and Frasincar30 cover older content, primarily focusing on the various attention computation methods that emerged during the early development of the transformer. However, current LLMs still use the original scaled-dot product attention, indicating that many of the derived attention forms have become outdated. Although Luo and Specia31 focused on the internal structure of LLMs, they only summarized experimental methodologies and overlooked research findings related to operational mechanisms.
Overview of special attention heads
Previous research has shown that the decoder-only architecture described in Background follows the scaling law, and it exhibits emergent abilities once the number of parameters reaches a certain threshold.32,33 Many LLMs that have emerged subsequently demonstrate outstanding performance in numerous tasks, even close to humans. However, researchers still do not fully understand why these models are able to achieve such remarkable results. To address this question, recent studies have begun to delve into the internal mechanisms of LLMs, focusing on their fundamental structure—a neural network composed of multi-attention heads and FFNs.
We have observed that many studies concentrate on the functions of attention heads, attempting to explain their reasoning processes. Additionally, several researchers have drawn parallels of reasoning methods between LLMs and human, as illustrated in Table 1. These findings suggest that certain research insights from studies of the human brain may be transferable to the study of attention heads. Therefore, in this section, we first summarize a four-stage framework inspired by human cognitive paradigms and use it as a guiding method to classify the functions of different attention heads.
Table 1.
Summary of the relationship between LLMs and human behaviors explored in existing studies
| Research paper | Viewpoints on the relationship between LLMs and humans (brains) |
|---|---|
| Liang et al.34 | “self-feedback” mechanism in LLMs mirrors human metacognition35 by enabling models to evaluate and refine their own reasoning |
| Dasgupta et al.36 | the language model can exhibit many of the varied, context-sensitive patterns of human reasoning behavior |
| Li et al.37 | different attention heads in LLMs exhibit specialized roles, analogous to the modular organization of human brain regions |
| Janik38 | LLMs exhibit some human-like memory characteristics, such as primacy and recency effects |
| Schrimpf et al.39 | representations in transformers show significant similarity to human brain neural activities during language tasks, particularly in terms of predictive processing (errors flows bottom-up to adjust the model) |
| Marjieh et al.40 | the attention distributions of LLMs for implicit semantic relations in language closely align with human response patterns in perceptual tasks |
| Mischler et al.41 | the attention mechanism may partially reflect the brain’s predictive coding theory42 |
How does the human brain and attention head think?
As shown in Table 1, the role of an attention head, as its name suggests, is quite analogous to the functions of the human brain. In some representative earlier works, the object-attribute-relation (OAR) model abstracts human brain knowledge and information into a graph composed of objects, attributes, and relations.43 Based on this abstraction, Wang and Chiew44 proposed a mathematical model of problem-solving. Specifically, the solver’s brain first utilizes its own OAR model to identify the content of the problem, distinguishing the objects and attributes within it, and constructs a sub-OAR model accordingly. Then, the solver combines their knowledge to search for potential solution goals and solution paths, evaluating these candidate solutions. If the evaluation results are unsatisfactory, the solver iteratively explores and evaluates solutions until suitable ones are found. Ultimately, the result of problem-solving is represented as a part of the relations in the sub-OAR model.
Similarly, the ACT-R model, which consists of five modules—perception (P), working memory (WM), procedural memory (PM), declarative memory (DM), and motor (M)—highlights the interaction between various modules in human cognition.45 The P module receives environmental inputs (e.g., visual or auditory information) and transmits them to the WM module. WM retrieves condition-action rules stored in PM in an if-then format to generate the next action. If additional knowledge is required, WM retrieves it from DM. Finally, the action is executed through the M module.46,47
In summary, these studies center on how humans retrieve knowledge, perceive and understand problems or environments, and conceive and execute actions. Inspired by these works, we propose a more universally applicable four-stage framework for describing the process by which the human brain solves specific problems: KR, in-context identification (ICI), latent reasoning (LR), and EP. These four stages can interact with and transition between one another, as illustrated in Figure 7.
Figure 7.
The four-stage framework of human thinking and LLM reasoning
The relationship between these four stages is not a linear progression but rather a graph-like transformation. Both humans and LLMs iteratively retrieve internal knowledge, observe the problem, and reason to arrive at the final answer.
When solving a problem, humans first need to recall the knowledge they have learned that is relevant to the issue at hand. This process is known as KR. During this stage, the hippocampus integrates memories into the brain’s network48 and activates different types of memories as needed with the help of dynamic associations.49,50 Confronted with the specific text of the problem, humans need to perform ICI. This means that the brain not only focuses on the overall structural content of the text51 but also parses the syntactic52 and semantic53 information embedded within it.
Once the brain has acquired the aforementioned textual and memory information, it attempts to integrate this information to derive conclusions, a process known as LR. This stage primarily includes arithmetic operations54 and logical inference.55 Finally, the brain needs to translate the reasoning results into natural language, forming an answer that can be expressed verbally. This is the EP stage. At this point, the brain bridges the gap between knowing and saying.56
As indicated by the arrows in Figure 7, these four stages are not executed in a strictly one-direction fashion when humans solve problems; rather, they can jump and switch between each other. For example, the brain may “cycle” through the identification of contextual content (the ICI stage) and then retrieve relevant knowledge based on the current context (the KR stage). Similarly, if LR cannot proceed further due to missing information, the brain may return to the KR and ICI stages to gather more information.
We will now draw an analogy between these four steps and the mechanisms of attention heads, as depicted in Figure 8. Previous research has shown that LLMs possess strong contextual learning abilities and have many practical applications.57 As a result, much of the work on interpretability has focused on the ability of LLMs to capture and reason about contextual information. Consequently, the functions of currently known special attention heads are primarily concentrated in the ICI and LR stages, while fewer attention heads operate in the KR and EP stages.
Figure 8.
Taxonomy of special attention heads in language models
The icons before each head indicate the specific LLM architectures where the head was discovered.
KR
For LLMs, most knowledge is learned during the training or fine-tuning phases, which is embedded in the model’s parameters. This form of knowledge is often referred to as LLMs’ “parametric knowledge.” Similar to humans, certain attention heads in LLMs recall this internally stored knowledge—such as common sense or domain-specific expertise—to be used in subsequent reasoning. These heads typically retrieve knowledge by making initial guesses or associating based on specific content within the context, injecting the memory information into the residual stream as initial data or supplementary information. A brief summary of their functionalities is shown in Table 2.
Table 2.
Key attention heads in KR
| Head name | Input feature | Output feature | Layer distribution |
|---|---|---|---|
| Memory head | user context and intermediate results | relevant parametric knowledge injected | shallow/middle |
| Constant head | all options in multiple-choice tasks | uniformly distributed attention scores | middle |
| Single Letter head | answer options | focused attention on a single candidate | middle |
| Negative head | binary decision task context | bias attention scores toward negative expressions | middle |
In general tasks, Bietti et al.58 identified that certain attention heads can give rise to “associative memories,” progressively storing and retrieving knowledge during the model’s training phase. The weight matrices of these heads can be viewed as a weighted sum of the outer products of various vectors (e.g., input-output vectors or key-value vectors). Through their processing, these heads filter out noise from a superposed activation state while preserving essential features. Furthermore, as the embedding dimension d increases, they become more adept at refining relevant information and linking it to useful memories.59 The so-called memory head is capable of retrieving content related to the current problem from the model’s parametric knowledge.60 This retrieved content could be knowledge learned during pre-training or experience accumulated during previous reasoning processes. Specifically, shallow FFNs enrich the semantics of entities present in the problems. Based on this enriched information, the memory head recalls attributes associated with these entities and writes them back into the residual stream.
In specific task scenarios, such as when LLMs tackle multiple-choice question answering (MCQA) problems, the answer is typically an option letter (e.g., A/B/C/D) rather than a short text. In these cases, they may initially use constant head to evenly distribute attention scores across all options. Alternatively, they might use single letter head to assign a higher attention score to one option while giving lower scores to others, thereby capturing all potential answers.61 In addition, in the context of binary decision tasks (BDT), which are problems where the solution space is discrete and contains only two options, such as yes-no questions or answer verification, Yu et al.62 found that LLMs often exhibit a negative bias when handling such tasks. This could be because the model has learned a significant amount of negative expressions related to similar tasks from prior knowledge during training. Consequently, when the model identifies a given text as a binary task, a negative head may “preemptively” choose the negative answer due to this prior bias.
ICI
Understanding the in-context nature of a problem is one of the most critical processes to effectively address it. Just as humans read a problem statement and quickly pick up on various key pieces of information, some attention heads in LLMs also focus on these elements. Specifically, attention heads that operate during the ICI stage use their QK matrices to focus on and identify overall structural, syntactic, and semantic information within the in-context. This information is then written into the residual stream via OV matrices.
Overall structural information identification
Identifying the overall structural information within a context mainly involves LLMs attending to content in special positions or with unique occurrences in the text. Previous head63,64 and positional head65,66,67 attend to the positional relationships within the token sequence. They capture the embedding information of the current token and the previous token. Rare words head focuses on tokens that appear with the lowest frequency, emphasizing rare or unique tokens.66 Duplicate head excels at capturing repeated content within the context, giving more attention to tokens that appear multiple times.21
Besides, as LLMs can gradually handle long texts, this is also related to the “needle-in-a-haystack” capability of attention heads. (Global) retrieval head can accurately locate specific tokens in long texts.68,69,70 These heads enable LLMs to achieve excellent reading and in-context retrieval capabilities.
Syntactic information identification
For syntactic information identification, sentences primarily consist of subjects, predicates, objects, and clauses. Syntactic head can distinctly identify and label nominal subjects, direct objects, adjectival modifiers, and adverbial modifiers.66,71 Some words in the original sentence may get split into multiple subwords because of the tokenizer (e.g., “happiness” might be split into “happi” and “ness”). The subword merge head focuses on these subwords and merges them into one complete word.65,72
Additionally, Yao et al.23 proposed the mover head cluster, which can be considered as “argument parsers.” These heads often copy or transfer a sentence’s important information (such as the subject’s position) to the [END] position. The [END] position refers to the last token’s position in the sentence being decoded by the LLM, and many studies indicate that summarizing contextual information at this position facilitates subsequent reasoning and next-token prediction. Name mover head and backup name mover head move the names in the text to the [END] position. Letter mover head extracts the first letters of certain words in the context and aggregates them at the [END] position.73 Conversely, negative name mover head prevents name information from being transferred to the [END] position.21,74
Semantic information identification
As for semantic information identification, context head extracts information from the context that is related to the current task.60 Further, content-gatherer head “moves” tokens related to the correct answer to the [END] position, preparing to convert them into the corresponding option letter for output.61,75 The sentiment summarizer proposed by Tigges et al.76 can summarize adjectives and verbs that express sentiment in the context near the [SUM] position. The [SUM] position is located directly before the [END] position and enables subsequent heads to effectively read and reason.
Capturing the message about relationships is also important for future reasoning. Semantic induction head captures semantic relationships within sentences, such as part-whole, usage, and category-instance relationships.77 Subject head and relation head focus on subject attributes and relation attributes, respectively, and then inject these attributes into the residual stream.78
LR
The KR and ICI stages focus on gathering information, while LR is where all the collected information is synthesized and logical reasoning occurs. Whether in humans or LLMs, the LR stage is the core of problem-solving. Specifically, QK matrices of a head perform implicit reasoning based on information read from the residual stream, and then the reasoning results or signals are written back into the residual stream through OV matrices.
In-context learning
In-context learning is one of the most widely discussed areas. Building on the work of Pan et al.,79 it primarily includes two types: task recognition (TR) and task learning (TL). Both aim to infer the solution based on the context; however, they differ fundamentally in their reliance on pre-trained priors. TR leverages the prior knowledge of LLMs to interpret demonstrations. For instance, sentiment classification tasks often involve labels with clear semantic meanings, such as “positive” and “negative,” which LLMs are likely to have internalized during pre-training. In contrast, TL requires the model to learn a novel mapping function between input-output pairs, where the examples and labels lack an inherent semantic connection.
For TR, summary reader can read the information summarized at the [SUM] position during the ICI stage and use this information to infer the corresponding sentiment label.76 Todd et al.80 proposed that the output of certain mid-layer attention heads can combine into a function vector. These heads abstract the core features and logical relationships of a task, based on the semantic information identified during ICI, and thereby trigger task execution.
For TL, the essence of solving these problems is enabling LLMs to inductively find patterns. Induction heads are among the most widely studied attention heads.63,81,82,83,84,85,86 They capture patterns such as “… [A][B] … [A]” where token [B] follows token [A], and predict that the next token of this sequence should be [B]. Specifically, the induction head in the residual stream of the second [A] can access information from that of all preceding tokens. This mainly includes information about what the previous token is for each token, which is provided by the previous head. The induction head then matches this information with the information in the current residual stream; i.e., it matches the second [A] with the [A] preceding [B], to perform further reasoning.
Induction head tends to strictly follow a pattern once identified and completes fill-in-the-blank reasoning. However, in most cases, the real problem will not be identical to the examples—just as students’ exam papers will not be exactly the same as their homework. To address this, Yu and Ananiadou87 identified the in-context head, whose QK matrix calculates the similarity between information at the [END] position and each label. The OV matrix then extracts label features and weights them according to the similarity scores to determine the final answer (take all labels into consideration rather than only one label).
Effective reasoning
Some studies have identified heads related to reasoning effectiveness. Truthfulness head88,89 and accuracy head90,91 are heads highly correlated with the truthfulness and accuracy of answers. They help the model infer truthful and correct results in question answering (QA) tasks, and modifying the model along their activation directions can enhance LLMs’ reasoning abilities. Similarly, the consistency head ensures the internal consistency of LLMs when asked the same question in different ways.92
However, not all heads positively impact reasoning. For example, vulnerable head is overly sensitive to certain specific input forms, making it susceptible to irrelevant information and leading to incorrect results.93 During reasoning, it is advisable to minimize the influence of such heads.
Task-specific reasoning
Finally, some heads are specialized for specific tasks. In MCQA tasks, correct-letter head can complete the matching between the answer text and option letters in order to determine the final answer choice.61 When dealing with tasks related to sequential data, iteration head can iteratively infer the next intermediate state based on the current state and input.94 For arithmetic problems, successor head can perform increment operations on ordinal numbers.95
In tasks such as syllogistic reasoning and information extraction, the inhibition head (also referred to as the suppression head) can aggregate outputs from other heads and suppress certain information. For example, it can suppress a subject or a middle term in order to reduce their associated logit values after unembedding.21,96
These examples illustrate how various attention heads specialize in different aspects of reasoning, contributing to the overall problem-solving capabilities of LLMs.
EP
During the EP stage, LLMs need to align their reasoning results with the content that needs to be expressed verbally. As shown in Table 3, EP heads may first aggregate information from various stages. Chughtai et al.78 proposed the mixed head, which can linearly combine and aggregate information passed along by heads from the ICI and LR stages (such as subject heads, relation heads, and induction heads). The aggregated results are then written back into the residual stream and ultimately mapped onto the vocabulary logits via the unembedding layer.
Table 3.
Key attention heads in EP
| Head name | Input feature | Output feature | Layer distribution |
|---|---|---|---|
| Mixed head | outputs of subject and relation heads | integrated and concise final representation | deep |
| Amplification head | correct answer signals | amplified attention on correct tokens | deep |
| Correct head | hidden states of different options | focused attentions on final output tokens | deep |
| Coherence head | contextualized reasoning outputs | fluent and coherent text’s tokens | middle/deep |
| Faithfulness Head | reasoning results and instructions | selected faithful contexts | deep |
Some EP heads have a signal-amplification function. Specifically, they read information about the context or reasoning results from the residual stream then enhance the information that needs to be expressed as output and write it back into the stream. Amplification head61 and correct head97 amplify the signal of the correct choice letter in MCQA problems near the [END] position. This amplification ensures that, after passing through the unembedding layer and softmax calculation, the correct choice letter has the highest probability.
In addition to information aggregation and signal amplification, some EP heads are used to align the model’s reasoning results with the user’s instructions. In multilingual tasks, the model may sometimes fail to respond in the target language desired by the user. Coherence head ensures linguistic consistency in the generated content.90 It helps LLMs maintain consistency between the output language and the language of user’s query when dealing with multilingual inputs. Faithfulness head is strongly associated with the faithfulness of chain of thought (CoT), which refers to whether the model’s generated response accurately reflects its internal reasoning process and behavior, i.e., the consistency between output and internal reasoning.98 Enhancing the activation of these heads allows LLMs to better align their internal reasoning with the output, making the CoT results more robust and consistent.
However, for some simple tasks, LLMs might not require special EP heads to refine language expression. In this situation, the information written back into the residual stream during the ICI and LR stages may be directly suitable for output (i.e., skip the EP stage and select the token with the highest probability).
How do attention heads work together?
As illustrated in Figure 9, if we divide the layers of an LLM (e.g., GPT-2 Small) into three segments based on their order—shallow (e.g., layers 1–4), middle (e.g., layers 5–8), and deep (e.g., layers 9–12)—we can map the relationship between the stages where heads act and the layers they are in, according to the content above. However, when combined with Figure 7, this pattern reflects only the majority of cases; there are instances where LLMs return to the KR or ICI stage at deeper layers—for example, in the MCQA and indirect object identification (IOI) cases discussed below.
Figure 9.
Diagram of the relationship between the stages where heads act and the layers they are in, as described from KR to EP
To gain an enhanced understanding of the relationships between these heads, researchers have investigated the potential semantic meanings embedded in the query vector and key vector .61,75 For example, when solving an MCQA problem, the model first infers the correct answer in text form. It then needs to map this text to the corresponding option letter based on the list of choices. At this point, during the ICI stage, the content-gatherer head moves the tokens of the inferred answer text to the [END] position. Then, in the LR stage, the correct-letter head uses the information passed by the content-gatherer head to identify the correct option. The query vector in this context effectively asks, “Are you the correct label?” while recalling the gathered correct answer text. The key vector represents, “I’m choice [A/B/C/D], with the corresponding text […].” After matching the right key vector to the query vector, we can get the correct answer choice.
Consider the parity problem, which involves determining whether the sum of an input sequence , consisting of only 0s and 1s, is odd or even. Let parity state sequence denote the parity (odd or even) of the sum of the first i digits in the sequence, as defined in Equation 4. For example, given the input sequence , the corresponding parity state sequence is . When querying the LLM with the prompt “ [EOI] [END],” where [EOI] represents the end-of-input token, the expected response is the final parity state .
Under these settings, during the ICI stage, a mover head transmits information from the [EOI] position to the [END] position. In the LR stage, an iteration head first reads the [EOI]’s position index from [END] and uses its query vector to ask, “Are you position t?” The key vector for each token responds, “I’m position .” This querying process identifies the last digit in the input sequence, which, combined with , allows the model to calculate .
| (Equation 4) |
Further research has explored integrating multiple special attention heads into a cohesive working mechanism.21,75,96,99 Wang et al.,21 Merullo et al.,75 and Kim et al.96 have independently identified the collaborative mechanisms of attention heads, such as mover heads, induction heads, and inhibition heads, in different task scenarios, namely object identification and syllogistic reasoning. Their studies, all conducted on the GPT-2 model,6 have yielded remarkably similar conclusions regarding the information transfer patterns among several key attention heads. Here we take the IOI task, which tests the model’s ability to deduce the indirect object in a sentence, as an example. Figure 10 outlines the main collaboration process.
-
(1)
In the KR stage, the subject head and the relation head focus on “Mary” and “bought flowers for,” respectively, triggering the model to recall that the answer should be a human name.78
-
(2)
Then in the ICI stage, the duplicate head identifies that “John” appears multiple times, while the name-mover head focuses on both “John” and “Mary” and moves them to the [END] position.
-
(3)
During the iterative stages of ICI and LR, the previous head and the induction head work together to attend to “John.” All this information is passed to the inhibition head, thereby suppressing the logits value of “John.”
-
(4)
Finally in the stage of EP, the amplification head boosts the logits value for “Mary.”
Figure 10.
Schematic diagram of the collaborative mechanism of different attention heads in IOI task
Each oval represents a specific attention head, and the color indicates the depth of the layer where the head is located.21 These colors are aligned with those in Figures 4 and 9.
In summary, attention heads in LLMs work collaboratively across stages such as KR, ICI, LR, and EP. This structured cooperation enables the model to solve complex tasks by effectively aligning and propagating relevant information through layers, further reflecting similarities between the working mechanisms of attention heads and the human brain.
Unveiling the discovery of attention heads
How can we uncover the specific functions of those special heads mentioned in Overview of special attention heads? In this section, we will unveil the discovery methods. Current research primarily employs experimental methods to validate the working mechanisms of those heads. We categorize the mainstream experimental approaches into two types based on whether they require the construction of new models: modeling free and modeling required. The classification scheme and method examples are shown in Figure 11.
Figure 11.
Pie chart of methods for exploring special attention heads and diagram of various methods
Modeling free
Modeling-free methods do not require setting up new models, making them widely applicable in interpretability research. These methods typically involve altering a latent state computed during the LLMs’ reasoning process and then using logit lens to map the intermediate results to token logits or probabilities. By calculating the logit (or probability) difference, researchers can infer the impact of the change. Modeling-free methods primarily include activation patching and ablation study. However, due to the frequent interchange of these terms in the literature, a new perspective is required to distinguish them. This paper further divides these methods into modification-based and replacement-based methods based on how the latent state representation is altered, as summarized in Table 4.
Table 4.
Brief summarization of modeling-free methods
| Type | Specific method | Core operation | Representative works |
|---|---|---|---|
| Modification based | directional addition | adding extra information to a specific component’s latent state | Tigges et al.,76 Yu et al.,62 and Turner et al.100 |
| directional subtraction | subtracting part of information from a specific component’s latent state | Tigges et al.76 and Geiger et al.101 | |
| Replacement based | zero ablation | the component’s latent state is replace with zero vectors | Wang et al.,21 Yu and Ananiadou98, Jin et al.,60 Yao et al.,23 and Mohebbi et al.102 |
| mean ablation | the component’s latent state is replaced with the mean state across all samples passing through it | McDougall et al.,74 Wang et al.,21 Kim et al.,96 and Hanna et al.103 | |
| naive activation patching | the component’s activation is replaced with corresponding activation run by a corrupted prompt | Merullo et al.,75 Todd et al.,80 Wang et al.,21 Lieberum et al.,61 and Wiegreffe et al.97 |
Modification-based methods involve altering the values of a specific latent state while retaining some of the original information under the hypothesis that concepts are encoded as linear directions in the representation space.104 Directional addition retains part of the information in the original state and then directionally adds some additional information.
For instance, Tigges et al.76 input texts containing positive and negative sentiments into LLMs, obtaining positive and negative representations from the latent state. The difference between these two representations can be seen as a sentiment direction in the latent space. By adding this sentiment direction vector to the activation of the attention head, the effect on the output can be analyzed to determine whether the head has the ability to summarize sentiment. Similarly, Ortu et al.99 explored the competitive relationships between different mechanisms. They directionally amplified the attention score of one token toward another, allowing the latent representation to include more information about that token.
Conversely, directional subtraction retains part of the original state information while directionally removing some of it.105 This method can be used to investigate whether removing specific information from a latent state affects the model’s output in a significant way, thereby revealing whether certain attention heads can back up or fix the deleted information.
Replacement-based methods, in contrast to modification-based methods, discard all information in a specific latent state and replace it with other values. Zero ablation and mean ablation replace the original latent state with zero values or the mean value of latent states across all samples from a dataset, respectively. This can logically “eliminate” the head or cause it to lose its special function, allowing researchers to assess its importance.
Naive activation patching is the traditional patching method. It involves using a latent state obtained from a corrupted prompt to replace the original latent state at the corresponding position. For example, consider the original prompt “John and Mary went to the store.” Replacing “Mary” with “Alice” results in a corrupted prompt. By systematically replacing the latent state obtained under the original prompt with the one obtained under the corrupted prompt across each head, researchers can preliminarily determine which head has the ability to focus on names based on the magnitude of the impact.25,106 Alternatively, we can also replace the latent state obtained from the corrupted run with the original one. By doing so, we can observe how the head’s behavior shifts back toward the performance on the original prompt.
Modeling required
Modeling-required methods involve explicitly constructing models to delve deeper into the functions of specific heads. Based on whether the newly constructed models require training, we further categorize modeling-required methods into training-required and training-free methods, as summarized in Table 5.
Table 5.
Brief summarization of modeling-required methods
| Type | Specific method | Core operation | Representative works |
|---|---|---|---|
| Training required | probing | train a classifier to distinguish heads with different functions using activation values | Li et al.,88 Hoscilowicz et al.,89 Gould et al.,95 Guo et al.,90 Yang et al.,92 and Jin et al.107 |
| simplified model training | train an approximate simplified model (e.g., a two-layer transformer or an attention-only model) | Edelman et al.,81 Cabannes et al.,94 Reddy,85 and Elhage et al.24 | |
| Training free | scoring | calculate the score that reflects the relationship between the component’s attributes and LLM features | Jin et al.,60 Wu et al.,68 Crosbie,84 Yu et al.,62 and Ji-An et al.83 |
| others | new methods that have not yet been widely adopted | Ferrando and Voita65 and Conmy et al.108 |
Training-Required methods necessitate training the newly established models to explore mechanisms. Probing is a common training-based method. This approach extracts activation values from different heads as features and categorizes heads into different classes as labels. A classifier is then trained on these data to learn the relationship between the activation patterns and the head’s function. Subsequently, the trained classifier can serve as a probe to detect which heads within the LLMs possess which functions.88,92
Another approach involves training a simplified transformer model on a clean dataset for a specific task. Researchers investigate whether the heads in this simplified model exhibit certain functionalities, which can then be extrapolated whether similar heads in the original model possess the same capabilities. This method reduces computational costs during training and analysis, while the constructed model remains simple and highly controllable.94
Training-free methods primarily involve designing scores that reflect specific phenomena. These scores can be viewed as mathematical models that construct an intrinsic relationship between the attributes of components and certain model characteristics or behaviors. For instance, when investigating retrieval heads, Wu et al.68 defined a retrieval score. This score represents the frequency with which a head assigns the highest attention score to the token it aims to retrieve across a sample set, as shown in Equation 5. A high retrieval score indicates that the head possesses a strong needle-in-a-haystack ability.
Similarly, when exploring negative heads, Yu et al.62 introduced the negative attention score (NAS), as shown in Equation 6. Here, i denotes the i-th token in the input prompt, and and represent the positions of “Yes” and “No” in the prompt, respectively. A high NAS suggests that the head focuses more on negative tokens during decision making, making it prone to generating negative signals.
| (Equation 5) |
| (Equation 6) |
In addition to scoring, researchers have proposed other novel training-free modeling methods. Ferrando and Voita introduced the concept of an information flow graph (IFG), where nodes represent tokens and edges represent information transfer between tokens via attention heads or FFNs. By calculating and filtering the importance of each edge to the node it points to, key edges can be selected to form a subgraph. This subgraph can then be viewed as the primary internal mechanism through which LLMs perform reasoning.
Evaluation
This section summarizes the benchmarks and datasets used in the interpretability research of attention heads. Based on the different evaluation goals during the mechanism exploration process, we categorize them into two types: mechanism exploration evaluation and common evaluation. The former is designed to evaluate the working mechanisms of specific attention heads, while the latter assesses whether enhancing or suppressing the functions of certain special heads can improve the overall performance of LLMs.
Mechanism exploration evaluation
To delve deeper into the internal reasoning paths of LLMs, many researchers have synthesized new datasets based on existing benchmarks. The primary feature of these datasets is the simplification of problem difficulty, with elements unrelated to interpretability, such as problem length and query format, being standardized. As shown in Table 6, these datasets essentially evaluate the model’s knowledge reasoning and KR capabilities, but they simplify the answers from a paragraph-level to a token-level.
Table 6.
Selected benchmarks for mechanism exploration evaluation
| Benchmark | Type | Main task | Source | Release date |
|---|---|---|---|---|
| LRE109 | knowledge recalling | infer object entities given subject-entity prompts | Massachusetts Institute of Technology | 2023.09 |
| ToyMovieReview76 | sentiment analysis | infer positive/negative sentiment | EleutherAI | 2023.10 |
| ToyMoodStory76 | sentiment analysis | infer positive/negative sentiment | EleutherAI | 2023.10 |
| FV caplitalize80 | token-level reasoning | infer the capital letter given some words | Northeastern University | 2023.10 |
| ICL-MC81 | token-level reasoning | infer next state based on in-context | Harvard | 2024.02 |
| Succession95 | arithmetic reasoning | infer next number in an incremental sequence | Cambridge | 2023.12 |
| Iteration synthetic94 | arithmetic reasoning | infer the next state of an iterative process | Meta | 2024.06 |
| Omniglot110 | word-level reasoning | infer label from few samples | New York University (NYU) | 2015.12 |
| IOI21 | word-level reasoning | infer the indirect object | University of California, Berkeley (UCB) | 2022.11 |
| Colored object75 | word-level reasoning | infer the correct color of a material | Brown University | 2023.10 |
| World capital60 | word-level reasoning | infer the capital city given a country | University of Chinese Academy of Sciences | 2024.02 |
Take exploring sentiment-related heads as an example, Tigges et al.76 created the ToyMovieReview and ToyMoodStory datasets, with specific prompt templates illustrated in Figure 12. Using these datasets, researchers employed sampling methods to calculate the activation differences of each head for positive and negative sentiments. This allowed them to recognize heads with significant differences as potential candidates for the role of sentiment summarizers.
Figure 12.
Prompt template for ToyMovieReview and ToyMoodStory dataset
For example, the adjective in the figure could be “fantastic”/“horrible,” as the verb could be “like”/“dislike.”
Common evaluation
The exploration of attention-head mechanisms is ultimately aimed at improving the comprehensive capabilities of LLMs. Many researchers, upon identifying a head with a specific function, have attempted to modify that type of head—such as by enhancing or diminishing its activation—to observe whether the LLMs’ responses become more accurate and useful. We classify these common evaluation benchmarks based on their evaluation focus, as shown in Table 7. The special attention heads discussed in this paper are closely related to improving LLMs’ abilities in five key areas: knowledge reasoning, logic reasoning, sentiment analysis, long context retrieval, and text comprehension.
Table 7.
Selected benchmarks for common evaluation
| Benchmark | Type | Main task | Source | Release date |
|---|---|---|---|---|
| MMLU111 | knowledge reasoning | solve problems with widespread knowledge | UCB | 2020.09 |
| TruthfulQA112 | knowledge reasoning | answer questions that span 38 categories | Oxford | 2021.09 |
| LogiQA113 | logic reasoning | deduce the answer of logical problems | Fudan University | 2020.07 |
| MQuAKE114 | logic reasoning | deduce the answer via multi-hop reasoning | Princeton | 2023.05 |
| SST/SST2115 | sentiment analysis | infer positive/negative sentiment | Standford | 2013.10 |
| ETHOS116 | sentiment analysis | detect hate speech in online comments | Aristotle University of Thessaloniki | 2020.06 |
| needle in a haystack | long context retrieval | retrieve content from long context | GitHub | 2023.11 |
| AG News117 | text comprehension | infer the category of news | NYU | 2015.02 |
| TriviaQA118 | text comprehension | answer questions based on documents | University of Washington (UoW) | 2017.05 |
| AGENDA119 | text comprehension | generate the abstract of a passage | UoW | 2019.04 |
Additional topics
In this section, we summarize various works related to the LLMs interpretability. Although these works may not recognize new special heads as discussed in Overview of special attention heads, they delve into the underlying mechanisms of LLMs from other perspectives. We will elaborate on these studies under two categories: FFN interpretability and Machine psychology.
FFN interpretability
As discussed in Background, apart from attention heads, FFNs also play a significant role in the LLM reasoning process. This section primarily summarizes research focused on the mechanisms of FFNs and the collaborative interactions between attention heads and FFNs.
One of the primary functions of FFNs is to store knowledge acquired during the pre-training phase. Dai et al.120 proposed that factual knowledge stored within the model is often concentrated in a few neurons of the MLP, reflecting the sparsity of the model.121 Geva et al.122 observed that the neurons in the FFN of GPT models can be likened to key-value pairs, where specific keys can retrieve corresponding values, i.e., knowledge. Lv et al.123 discovered a hierarchical storage of knowledge within the model’s FFN, with lower layers storing syntactic and semantic information and higher layers storing more concrete factual content.
FFNs effectively complement the capabilities of attention heads across the four stages described in Overview of special attention heads. The collaboration between FFNs and attention heads enhances the overall capabilities of LLMs. Geva et al.124 proposed that attention heads and FFNs can work together to enrich the representation of a subject and then extract its related attributes, thus facilitating factual information retrieval during the KR stage. Stolfo et al.125 found that, unlike attention heads, which focus on global information and perform aggregation, FFNs focus only on a single representation and perform local updates. This complementary functionality allows them to explore textual information both in breadth (attention heads) and depth (FFNs).
In summary, each component of LLMs plays a crucial role in the reasoning process. The individual contributions of these components, combined with their interactions, accomplish the entire process from KR to expression.
Machine psychology
Current research on the LLMs interpretability often draws parallels between the reasoning processes of these models and human thinking. This suggests the need for a more unified framework that connects LLMs with human cognition. The concept of machine psychology has emerged to fill this gap,126 exploring the cognitive activities of AI through psychological paradigms.
Recently, Hagendorff127 and Johansson et al.128 have proposed different approaches to studying machine psychology. Hagendorff’s work focuses on using psychological methods to identify new abilities in LLMs, such as heuristics and biases, social interactions, language understanding, and learning. His research suggests that LLMs display human-like cognitive patterns, which can be analyzed to improve AI interpretability and performance.
Johansson’s framework integrates principles of operant conditioning129 with AI systems, emphasizing adaptability and learning from environmental interactions. This approach aims to bridge gaps in artificial general intelligence (AGI) research by combining insights from psychology, cognitive science, and neuroscience.
Overall, machine psychology provides a new perspective for analyzing LLMs. Psychological experiments and behavioral analyses may lead to new discoveries about these models. As LLMs are increasingly applied across various domains of society, understanding their behavior through a psychological lens becomes increasingly important, which offers valuable insights for developing more intelligent AI systems.
Discussion
Limitations in existing studies
Although substantial progress has been made in uncovering the internal mechanisms of LLMs, several key limitations persist in existing research. These can be summarized as follows:
-
(1)
Lack of task generalizability. Current research primarily explores simple application scenarios that are limited to specific types of tasks. For example, Wang et al.21 and Merullo et al.75 have identified reasoning circuits in LLMs through tasks such as the IOI task and the color object task. However, these circuits have not been validated across other tasks, making it challenging to determine whether these mechanisms are universally applicable.
-
(2)
Lack of mechanism transferability. As shown in Figure 8, many discovered special heads have only been explored within a few specific LLMs, or even on custom-built toy models. This raises a critical question: does a specialized head identified in one LLM exhibit the same functionality in another LLM? However, current research lacks investigations into the transferability of such mechanisms across different model series.
-
(3)
Limited focus on multi-head collaboration. Most studies investigate the mechanisms of individual attention heads, with only a few researchers studying the collaborative relationships among multiple heads. Consequently, existing work lacks a comprehensive framework for understanding the coordinated functioning of all attention heads in LLMs and analogizing the human brains.
-
(4)
Absence of theoretical supports. Many studies propose hypotheses about circuits based on observed phenomena and validate these hypotheses through experiments. However, this approach cannot establish the theoretical soundness of the mechanisms, nor can it determine whether the observed mechanisms are merely coincidental.
Future directions and challenges
Building on the limitations discussed above and the content presented earlier, this paper outlines several potential research directions for the future:
-
(1)
Exploring mechanisms in more complex tasks. Investigate whether certain attention heads possess special functions in more complex tasks, such as open-ended question answering,130,131 math problems,132,133 and tool-using tasks.134
-
(2)
Mechanism’s robustness against prompts. Research has shown that current LLMs are highly sensitive to prompts, with slight changes potentially leading to opposite responses.135 Future work could analyze this phenomenon through the lens of attention-head mechanisms and propose solutions to mitigate this issue.
-
(3)
Developing new experimental methods. Explore new experimental approaches, such as designing experiments to verify whether a particular mechanism is indivisible or whether it has universal applicability.
-
(4)
Building a comprehensive interpretability framework. This framework should encompass both the independent and collaborative functioning mechanisms of most attention heads and other components.
-
(5)
Integrating machine psychology. Incorporate insights from machine psychology to construct an internal mechanism framework for LLMs from an anthropomorphic perspective, understanding the gaps between current LLMs and human cognition and guiding targeted improvements.
Limitation
Current research on the interpretability of LLMs’ attention heads is relatively scattered, primarily focusing on the functions of individual heads, while lacking a rigorous definition of the overall framework. As a result, the categorization of attention head functions from the perspective of human cognitive behavior in this paper may not be perfectly orthogonal, potentially leading to some overlap between different stages.
Resource availability
Lead contact
Additional information, questions, and requests should be directed to the lead contact, Dr. Zhiyu Li (lizy@iaar.ac.cn).
Materials availability
Not applicable, as no new unique reagents were generated.
Data and code availability
Our reference list is available at GitHub (https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) and has been archived at Zenodo.136
Acknowledgments
This work was supported by the funding from the Research Institute of China Telecom, Beijing, China. We extend our gratitude to all team members and partners involved in this study for their support and contributions. Furthermore, we sincerely appreciate the valuable comments and suggestions provided by the reviewers of this paper.
Author contributions
Conceptualization, Z.Z., Y.W., and S.S.; planning, Z.Z.; investigation, Z.Z. and Y.W.; original draft, Z.Z. and Y.H.; visualization, Z.Z. and Y.H.; review & editing, all authors; project administration, M.Y., B.T., F.X., and Z.L.; supervision, Z.L.
Declaration of interests
The authors declare no competing interests.
References
- 1.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:1–11. [Google Scholar]
- 2.Gilpin L.H., Bau D., Yuan B.Z., Bajwa A., Specter M., Kagal L. 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) IEEE; 2018. Explaining explanations: An overview of interpretability of machine learning; pp. 80–89. [Google Scholar]
- 3.Lipton Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue. 2018;16:31–57. [Google Scholar]
- 4.Montavon G., Samek W., Müller K.-R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018;73:1–15. [Google Scholar]
- 5.Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. 2018 doi: 10.48550/arXiv.1810.04805. Preprint at. [DOI] [Google Scholar]
- 6.Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019;1:9. [Google Scholar]
- 7.Chuang Y.-S., Qiu L., Hsieh C.-Y., Krishna R., Kim Y., Glass J. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps. arXiv. 2024 doi: 10.48550/arXiv.2407.07071. Preprint at. [DOI] [Google Scholar]
- 8.Li H., Chi H., Liu M., Yang W. Look within, why llms hallucinate: A causal perspective. arXiv. 2024 doi: 10.48550/arXiv.2407.10153. Preprint at. [DOI] [Google Scholar]
- 9.Ji Z., Chen D., Ishii E., Cahyawijaya S., Bang Y., Wilie B., Fung P. Llm internal states reveal hallucination risk faced with a query. arXiv. 2024 doi: 10.48550/arXiv.2407.03282. Preprint at. [DOI] [Google Scholar]
- 10.Kovaleva O., Romanov A., Rogers A., Rumshisky A. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Association for Computational Linguistics; 2019. Revealing the dark secrets of BERT; pp. 4365–4374. [Google Scholar]
- 11.Wang A., Cho K. Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. Association for Computational Linguistics; 2019. BERT has a mouth, and it must speak: BERT as a Markov random field language model; pp. 30–36. [Google Scholar]
- 12.Pande, M., Budhraja, A., Nema, P., Kumar, P., and Khapra, M. M. (2021). The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in bert. In: Proceedings of the AAAI conference on artificial intelligence vol. 35. ( 13613–13621).
- 13.Liu L., Liu X., Gao J., Chen W., Han J. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Association for Computational Linguistics; 2020. Understanding the difficulty of training transformers; pp. 5747–5763. [Google Scholar]
- 14.Xiong R., Yang Y., He D., Zheng K., Zheng S., Xing C., Zhang H., Lan Y., Wang L., Liu T. International Conference on Machine Learning. PMLR; 2020. On layer normalization in the transformer architecture; pp. 10524–10533. [Google Scholar]
- 15.Su J., Ahmed M., Lu Y., Pan S., Bo W., Liu Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing. 2024;568 [Google Scholar]
- 16.Shazeer N. Glu variants improve transformer. arXiv. 2020 doi: 10.48550/arXiv.2002.05202. Preprint at. [DOI] [Google Scholar]
- 17.Fedus W., Zoph B., Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022;23:1–39. [Google Scholar]
- 18.Cai W., Jiang J., Wang F., Tang J., Kim S., Huang J. A survey on mixture of experts. arXiv. 2024 doi: 10.48550/arXiv.2407.06204. Preprint at. [DOI] [Google Scholar]
- 19.Olah C., Cammarata N., Schubert L., Goh G., Petrov M., Carter S. Zoom in: An introduction to circuits. Distill. 2020;5 [Google Scholar]
- 20.Geiger, A., Lu, H., Icard, T., and Potts, C. (2021). Causal abstractions of neural networks. In: Advances in Neural Information Processing Systems vol. 34. Curran Associates, Inc. ( 9574–9586).
- 21.Wang K.R., Variengien A., Conmy A., Shlegeris B., Steinhardt J. The Eleventh International Conference on Learning Representations. 2023. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small; pp. 1–21. [Google Scholar]
- 22.Vig J., Gehrmann S., Belinkov Y., Qian S., Nevo D., Singer Y., Shieber S. Investigating gender bias in language models using causal mediation analysis. Adv. Neural Inf. Process. Syst. 2020;33:12388–12401. [Google Scholar]
- 23.Yao Y., Zhang N., Xi Z., Wang M., Xu Z., Deng S., Chen H. Knowledge circuits in pretrained transformers. Preprint at arXiv. 2024 doi: 10.48550/arXiv.2405.17969. [DOI] [Google Scholar]
- 24.Elhage N., Nanda N., Olsson C., Henighan T., Joseph N., Mann B., Askell A., Bai Y., Chen A., Conerly T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread. 2021;1:12. [Google Scholar]
- 25.Heimersheim S., Nanda N. How to use and interpret activation patching. arXiv. 2024 doi: 10.48550/arXiv.2404.15255. Preprint at. [DOI] [Google Scholar]
- 26.Räuker T., Ho A., Casper S., Hadfield-Menell D. 2023 ieee conference on secure and trustworthy machine learning (satml) IEEE; 2023. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks; pp. 464–483. [Google Scholar]
- 27.Gonçalves T., Rio-Torto I., Teixeira L.F., Cardoso J.S. A survey on attention mechanisms for medical applications: are we moving toward better algorithms? IEEE Access. 2022;10:98909–98935. [Google Scholar]
- 28.Santana A., Colombini E. Neural attention models in deep learning: Survey and taxonomy. arXiv. 2021 doi: 10.48550/arXiv.2112.05909. Preprint at. [DOI] [Google Scholar]
- 29.Chaudhari S., Mithal V., Polatkan G., Ramanath R. An attentive survey of attention models. ACM Trans. Intell. Syst. Technol. 2021;12:1–32. [Google Scholar]
- 30.Brauwers G., Frasincar F. A general survey on attention mechanisms in deep learning. IEEE Trans. Knowl. Data Eng. 2023;35:3279–3298. [Google Scholar]
- 31.Luo H., Specia L. From understanding to utilization: A survey on explainability for large language models. arXiv. 2024 doi: 10.48550/arXiv.2401.12874. Preprint at. [DOI] [Google Scholar]
- 32.Kaplan J., McCandlish S., Henighan T., Brown T.B., Chess B., Child R., Gray S., Radford A., Wu J., Amodei D. Scaling laws for neural language models. arXiv. 2020 doi: 10.48550/arXiv.2001.08361. Preprint at. [DOI] [Google Scholar]
- 33.Bommasani R., Hudson D.A., Adeli E., Altman R., Arora S., von Arx S., Bernstein M.S., Bohg J., Bosselut A., Brunskill E., et al. On the opportunities and risks of foundation models. arXiv. 2021 doi: 10.48550/arXiv.2108.07258. Preprint at. [DOI] [Google Scholar]
- 34.Liang X., Song S., Zheng Z., Wang H., Yu Q., Li X., Li R.-H., Xiong F., Li Z. Internal consistency and self-feedback in large language models: A survey. Preprint at arXiv. 2024 doi: 10.48550/arXiv.2407.14507. [DOI] [Google Scholar]
- 35.Rouault M., McWilliams A., Allen M.G., Fleming S.M. Human metacognition across domains: insights from individual differences and neuroimaging. Personal. Neurosci. 2018;1 doi: 10.1017/pen.2018.16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Dasgupta I., Lampinen A.K., Chan S.C., Creswell A., Kumaran D., McClelland J.L., Hill F. Language models show human-like content effects on reasoning. arXiv. 2022 doi: 10.48550/arXiv.2207.07051. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Li Y., Michaud E.J., Baek D.D., Engels J., Sun X., Tegmark M. The geometry of concepts: Sparse autoencoder feature structure. arXiv. 2024 doi: 10.48550/arXiv.2410.19750. Preprint at. [DOI] [Google Scholar]
- 38.Janik R.A. Aspects of human memory and large language models. arXiv. 2023 doi: 10.48550/arXiv.2311.03839. Preprint at. [DOI] [Google Scholar]
- 39.Schrimpf M., Blank I.A., Tuckute G., Kauf C., Hosseini E.A., Kanwisher N., Tenenbaum J.B., Fedorenko E. The neural architecture of language: Integrative modeling converges on predictive processing. Proc. Natl. Acad. Sci. USA. 2021;118 doi: 10.1073/pnas.2105646118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Marjieh R., Sucholutsky I., van Rijn P., Jacoby N., Griffiths T.L. Large language models predict human sensory judgments across six modalities. Sci. Rep. 2024;14 doi: 10.1038/s41598-024-72071-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mischler G., Li Y.A., Bickel S., Mehta A.D., Mesgarani N. Contextual feature extraction hierarchies converge in large language models and the brain. Nat. Mach. Intell. 2024;6:1467–1477. [Google Scholar]
- 42.Millidge B., Seth A., Buckley C.L. Predictive coding: a theoretical and experimental review. arXiv. 2021 doi: 10.48550/arXiv.2107.12979. Preprint at. [DOI] [Google Scholar]
- 43.Wang Y. The oar model of neural informatics for internal knowledge representation in the brain. Int. J. Cognit. Inf. Nat. Intell. 2007;1:66–77. [Google Scholar]
- 44.Wang Y., Chiew V. On the cognitive process of human problem solving. Cognit. Syst. Res. 2010;11:81–92. [Google Scholar]
- 45.Anderson J.R. Psychology Press; 2014. Rules of the Mind. [Google Scholar]
- 46.Whitehill J. Understanding act-r-an outsider’s perspective. arXiv. 2013 doi: 10.48550/arXiv.1306.0125. Preprint at. [DOI] [Google Scholar]
- 47.Laird J.E. An analysis and comparison of act-r and soar. arXiv. 2022 doi: 10.48550/arXiv.2201.09305. Preprint at. [DOI] [Google Scholar]
- 48.Squire L.R. Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. Psychol. Rev. 1992;99:195–231. doi: 10.1037/0033-295x.99.2.195. [DOI] [PubMed] [Google Scholar]
- 49.Tulving E. Organization of memory ( 381) Academic Press; 1972. Episodic and semantic memory; p. 381. [Google Scholar]
- 50.Sartori G., Orrù G. Language models and psychological sciences. Front. Psychol. 2023;14 doi: 10.3389/fpsyg.2023.1279317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kintsch W. The role of knowledge in discourse comprehension: a construction-integration model. Psychol. Rev. 1988;95:163–182. doi: 10.1037/0033-295x.95.2.163. [DOI] [PubMed] [Google Scholar]
- 52.Chomsky N. Vol. 11. MIT press; 2014. (Aspects of the Theory of Syntax). [Google Scholar]
- 53.Jackendoff R.S. Vol. 18. MIT press; 1992. (Semantic Structures). [Google Scholar]
- 54.Dehaene S. OUP; 2011. The Number Sense: How the Mind Creates Mathematics. [Google Scholar]
- 55.Johnson-Laird P. Towards a cognitive science of language, inference, and consciousness. Harvard University Press; 1983. Mental models. [Google Scholar]
- 56.Levelt W.J. Models of word production. Trends Cognit. Sci. 1999;3:223–232. doi: 10.1016/s1364-6613(99)01319-4. [DOI] [PubMed] [Google Scholar]
- 57.Winata G.I., Madotto A., Lin Z., Liu R., Yosinski J., Fung P. Proceedings of the 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics; 2021. Language models are few-shot multilingual learners; pp. 1–15. [Google Scholar]
- 58.Bietti A., Cabannes V., Bouchacourt D., Jegou H., Bottou L. Birth of a transformer: A memory viewpoint. Adv. Neural Inf. Process. Syst. 2024;36:1560–1588. [Google Scholar]
- 59.Dana L., Pydi M.S., Chevaleyre Y. Memorization in attention-only transformers. arXiv. 2024 doi: 10.48550/arXiv.2411.10115. Preprint at. [DOI] [Google Scholar]
- 60.Jin Z., Cao P., Yuan H., Chen Y., Xu J., Li H., Jiang X., Liu K., Zhao J. Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics; 2024. Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models; pp. 1193–1215. [Google Scholar]
- 61.Lieberum T., Rahtz M., Kramár J., Irving G., Shah R., Mikulik V. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv. 2023 doi: 10.48550/arXiv.2307.09458. Preprint at. [DOI] [Google Scholar]
- 62.Yu S., Song J., Hwang B., Kang H., Cho S., Choi J., Joe S., Lee T., Gwon Y.L., Yoon S. Correcting negative bias in large language models through negative attention score alignment. arXiv. 2024 doi: 10.48550/arXiv.2408.00137. Preprint at. [DOI] [Google Scholar]
- 63.Olsson C., Elhage N., Nanda N., Joseph N., DasSarma N., Henighan T., Mann B., Askell A., Bai Y., Chen A., et al. In-context learning and induction heads. arXiv. 2022 doi: 10.48550/arXiv.2209.11895. Preprint at. [DOI] [Google Scholar]
- 64.Nanda N., Rajamanoharan S., Kramár J., Shah R. Fact finding: Attempting to reverse-engineer factual recall on the neuron level. Alignment Forum. 2023 https://www.alignmentforum.org/s/hpWHhjvjn67LJ4xXX/p/iGuwZTHWb6DFY3sKB. [Google Scholar]
- 65.Ferrando J., Voita E. Information flow routes: Automatically interpreting language models at scale. arXiv. 2024 doi: 10.48550/arXiv.2403.00824. Preprint at. [DOI] [Google Scholar]
- 66.Voita E., Talbot D., Moiseev F., Sennrich R., Titov I. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned; pp. 5797–5808. [Google Scholar]
- 67.Raganato A., Tiedemann J. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics; 2018. An analysis of encoder representations in transformer-based machine translation; pp. 287–297. [Google Scholar]
- 68.Wu W., Wang Y., Xiao G., Peng H., Fu Y. Retrieval head mechanistically explains long-context factuality. arXiv. 2024 doi: 10.48550/arXiv.2404.15574. Preprint at. [DOI] [Google Scholar]
- 69.Tang H., Lin Y., Lin J., Han Q., Hong S., Yao Y., Wang G. Razorattention: Efficient kv cache compression through retrieval heads. arXiv. 2024 doi: 10.48550/arXiv.2407.15891. Preprint at. [DOI] [Google Scholar]
- 70.Fu T., Huang H., Ning X., Zhang G., Chen B., Wu T., Wang H., Huang Z., Li S., Yan S., et al. Moa: Mixture of sparse attention for automatic large language model compression. arXiv. 2024 doi: 10.48550/arXiv.2406.14909. Preprint at. [DOI] [Google Scholar]
- 71.Chen A., Shwartz-Ziv R., Cho K., Leavitt M.L., Saphra N. The Twelfth International Conference on Learning Representations. 2024. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs; pp. 1–32. [Google Scholar]
- 72.Correia G.M., Niculae V., Martins A.F.T. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Association for Computational Linguistics; 2019. Adaptively sparse transformers; pp. 2174–2184. [Google Scholar]
- 73.García-Carrasco J., Maté A., Trujillo J.C. International Conference on Artificial Intelligence and Statistics. PMLR; 2024. How does gpt-2 predict acronyms? extracting and understanding a circuit via mechanistic interpretability; pp. 3322–3330. [Google Scholar]
- 74.McDougall C., Conmy A., Rushing C., McGrath T., Nanda N. Copy suppression: Comprehensively understanding an attention head. arXiv. 2023 doi: 10.48550/arXiv.2310.04625. Preprint at. [DOI] [Google Scholar]
- 75.Merullo J., Eickhoff C., Pavlick E. The Twelfth International Conference on Learning Representations. 2024. Circuit component reuse across tasks in transformer language models; pp. 1–29. [Google Scholar]
- 76.Tigges C., Hollinsworth O.J., Geiger A., Nanda N. Linear representations of sentiment in large language models. arXiv. 2023 doi: 10.48550/arXiv.2310.15154. Preprint at. [DOI] [Google Scholar]
- 77.Ren J., Guo Q., Yan H., Liu D., Zhang Q., Qiu X., Lin D. Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics; 2024. Identifying semantic induction heads to understand in-context learning; pp. 6916–6932. [Google Scholar]
- 78.Chughtai B., Cooney A., Nanda N. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv. 2024 doi: 10.48550/arXiv.2402.07321. Preprint at. [DOI] [Google Scholar]
- 79.Pan J., Gao T., Chen H., Chen D. What in-context learning “learns” in-context: Disentangling task recognition and task learning. arXiv. 2023 doi: 10.48550/arXiv.2305.09731. Preprint at. [DOI] [Google Scholar]
- 80.Todd E., Li M., Sharma A.S., Mueller A., Wallace B.C., Bau D. The Twelfth International Conference on Learning Representations. 2024. Function vectors in large language models; pp. 1–52. [Google Scholar]
- 81.Edelman B.L., Edelman E., Goel S., Malach E., Tsilivis N. The evolution of statistical induction heads: In-context learning markov chains. arXiv. 2024 doi: 10.48550/arXiv.2402.11004. Preprint at. [DOI] [Google Scholar]
- 82.Singh A.K., Moskovitz T., Hill F., Chan S.C., Saxe A.M. Vol. 235. PMLR; 2024. What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation; pp. 45637–45662. (Proceedings of the 41st International Conference on Machine Learning). [Google Scholar]
- 83.Ji-An L., Zhou C.Y., Benna M.K., Mattar M.G. Linking in-context learning in transformers to human episodic memory. arXiv. 2024 doi: 10.48550/arXiv.2405.14992. Preprint at. [DOI] [Google Scholar]
- 84.Crosbie J. Induction heads as an essential mechanism for pattern matching in in-context learning. arXiv. 2024 doi: 10.48550/arXiv.2407.07011. Preprint at. [DOI] [Google Scholar]
- 85.Reddy G. The Twelfth International Conference on Learning Representations. 2024. The mechanistic basis of data dependence and abrupt learning in an in-context classification task; pp. 1–14. [Google Scholar]
- 86.Akyürek E., Wang B., Kim Y., Andreas J. Vol. 235. PMLR; 2024. In-context language learning: Architectures and algorithms; pp. 787–812. (Proceedings of the 41st International Conference on Machine Learning). [Google Scholar]
- 87.Yu Z., Ananiadou S. How do large language models learn in-context? query and key matrices of in-context heads are two towers for metric learning. arXiv. 2024 doi: 10.48550/arXiv.2402.02872. Preprint at. [DOI] [Google Scholar]
- 88.Li K., Patel O., Viégas F., Pfister H., Wattenberg M. Inference-time intervention: Eliciting truthful answers from a language model. Adv. Neural Inf. Process. Syst. 2024;36:41451–41530. [Google Scholar]
- 89.Hoscilowicz J., Wiacek A., Chojnacki J., Cieslak A., Michon L., Urbanevych V., Janicki A. Nl-iti: Optimizing probing and intervention for improvement of iti method. arXiv. 2024 doi: 10.48550/arXiv.2403.18680. Preprint at. [DOI] [Google Scholar]
- 90.Guo P., Ren Y., Hu Y., Cao Y., Li Y., Huang H. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024. Steering large language models for cross-lingual information retrieval; pp. 585–596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Yin F., Ye X., Durrett G. Lofit: Localized fine-tuning on llm representations. arXiv. 2024 doi: 10.48550/arXiv.2406.01563. Preprint at. [DOI] [Google Scholar]
- 92.Yang J., Chen D., Sun Y., Li R., Feng Z., Peng W. Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics; 2024. Enhancing semantic consistency of large language models through model editing: An interpretability-oriented approach; pp. 3343–3353. [Google Scholar]
- 93.García-Carrasco J., Maté A., Trujillo J. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. International Joint Conferences on Artificial Intelligence Organization; 2024. Detecting and understanding vulnerabilities in language models via mechanistic interpretability; pp. 385–393. [Google Scholar]
- 94.Cabannes V., Arnal C., Bouaziz W., Yang A., Charton F., Kempe J. Iteration head: A mechanistic study of chain-of-thought. arXiv. 2024 doi: 10.48550/arXiv.2406.02128. Preprint at. [DOI] [Google Scholar]
- 95.Gould R., Ong E., Ogden G., Conmy A. The Twelfth International Conference on Learning Representations. 2024. Successor heads: Recurring, interpretable attention heads in the wild; pp. 1–26. [Google Scholar]
- 96.Kim G., Valentino M., Freitas A. A mechanistic interpretation of syllogistic reasoning in auto-regressive language models. arXiv. 2024 doi: 10.48550/arXiv.2408.08590. Preprint at. [DOI] [Google Scholar]
- 97.Wiegreffe S., Tafjord O., Belinkov Y., Hajishirzi H., Sabharwal A. Answer, assemble, ace: Understanding how transformers answer multiple choice questions. arXiv. 2024 doi: 10.48550/arXiv.2407.15018. Preprint at. [DOI] [Google Scholar]
- 98.Tanneru S.H., Ley D., Agarwal C., Lakkaraju H. Trustworthy Multi-modal Foundation Models and AI Agents (TiFA) 2024. On the difficulty of faithful chain-of-thought reasoning in large language models; pp. 1–16. [Google Scholar]
- 99.Ortu F., Jin Z., Doimo D., Sachan M., Cazzaniga A., Schölkopf B. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; 2024. Competition of mechanisms: Tracing how language models handle facts and counterfactuals; pp. 8420–8436. [Google Scholar]
- 100.Turner A.M., Thiergart L., Leech G., Udell D., Vazquez J.J., Mini U., MacDiarmid M. Activation addition: Steering language models without optimization. arXiv. 2023 doi: 10.48550/arXiv.2308.10248. Preprint at. [DOI] [Google Scholar]
- 101.Geiger A., Wu Z., Potts C., Icard T., Goodman N. Causal Learning and Reasoning. 2024. Finding alignments between interpretable causal variables and distributed neural representations; pp. 160–187. [Google Scholar]
- 102.Mohebbi H., Zuidema W., Chrupała G., Alishahi A. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2023. Quantifying context mixing in transformers; pp. 3378–3400. [Google Scholar]
- 103.Hanna M., Liu O., Variengien A. Vol. 36. Curran Associates, Inc.; 2023. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model; pp. 76033–76060. (Advances in Neural Information Processing Systems). [Google Scholar]
- 104.Park K., Choe Y.J., Veitch V. Vol. 235. PMLR; 2024. The linear representation hypothesis and the geometry of large language models; pp. 39643–39666. (Proceedings of the 41st International Conference on Machine Learning). [Google Scholar]
- 105.Liang X., Wang H., Wang Y., Song S., Yang J., Niu S., Hu J., Liu D., Yao S., Xiong F., et al. Controllable text generation for large language models: A survey. arXiv. 2024 doi: 10.48550/arXiv.2408.12599. Preprint at. [DOI] [Google Scholar]
- 106.Zhang F., Nanda N. The Twelfth International Conference on Learning Representations. 2024. Towards best practices of activation patching in language models: Metrics and methods; pp. 1–28. [Google Scholar]
- 107.Jin M., Yu Q., Huang J., Zeng Q., Wang Z., Hua W., Zhao H., Mei K., Meng Y., Ding K., et al. Exploring concept depth: How large language models acquire knowledge at different layers? arXiv. 2024 doi: 10.48550/arXiv.2404.07066. Preprint at. [DOI] [Google Scholar]
- 108.Conmy A., Mavor-Parker A., Lynch A., Heimersheim S., Garriga-Alonso A. Vol. 36. Curran Associates, Inc.; 2023. Towards automated circuit discovery for mechanistic interpretability; pp. 16318–16352. (Advances in Neural Information Processing Systems). [Google Scholar]
- 109.Hernandez E., Sharma A.S., Haklay T., Meng K., Wattenberg M., Andreas J., Belinkov Y., Bau D. The Twelfth International Conference on Learning Representations. 2024. Linearity of relation decoding in transformer language models; pp. 1–23. [Google Scholar]
- 110.Lake B.M., Salakhutdinov R., Tenenbaum J.B. Human-level concept learning through probabilistic program induction. Science. 2015;350:1332–1338. doi: 10.1126/science.aab3050. [DOI] [PubMed] [Google Scholar]
- 111.Hendrycks D., Burns C., Basart S., Zou A., Mazeika M., Song D., Steinhardt J. International Conference on Learning Representations. 2021. Measuring massive multitask language understanding; pp. 1–27. [Google Scholar]
- 112.Lin S., Hilton J., Evans O. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; 2022. TruthfulQA: Measuring how models mimic human falsehoods; pp. 3214–3252. [Google Scholar]
- 113.Liu J., Cui L., Liu H., Huang D., Wang Y., Zhang Y. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. International Joint Conferences on Artificial Intelligence Organization; 2020. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning; pp. 3622–3628. [Google Scholar]
- 114.Zhong Z., Wu Z., Manning C., Potts C., Chen D. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023. MQuAKE: Assessing knowledge editing in language models via multi-hop questions; pp. 15686–15702. [Google Scholar]
- 115.Socher R., Perelygin A., Wu J., Chuang J., Manning C.D., Ng A.Y., Potts C. Proceedings of the 2013 conference on empirical methods in natural language processing. 2013. Recursive deep models for semantic compositionality over a sentiment treebank; pp. 1631–1642. [Google Scholar]
- 116.Mollas I., Chrysopoulou Z., Karlos S., Tsoumakas G. Ethos: an online hate speech detection dataset. arXiv. 2022;8:4663–4678. doi: 10.1007/s40747-021-00608-2. Preprint at. [DOI] [Google Scholar]
- 117.Zhang X., Zhao J., LeCun Y. Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 2015;28:1–9. [Google Scholar]
- 118.Joshi M., Choi E., Weld D., Zettlemoyer L. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension; pp. 1601–1611. [Google Scholar]
- 119.Koncel-Kedziorski R., Bekal D., Luan Y., Lapata M., Hajishirzi H. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) Association for Computational Linguistics; 2019. Text Generation from Knowledge Graphs with Graph Transformers; pp. 2284–2293. [Google Scholar]
- 120.Dai D., Dong L., Hao Y., Sui Z., Chang B., Wei F. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; 2022. Knowledge neurons in pretrained transformers; pp. 8493–8502. [Google Scholar]
- 121.Voita E., Ferrando J., Nalmpantis C. Findings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics; 2024. Neurons in large language models: Dead, n-gram, positional; pp. 1288–1301. [Google Scholar]
- 122.Geva M., Schuster R., Berant J., Levy O. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2021. Transformer feed-forward layers are key-value memories; pp. 5484–5495. [Google Scholar]
- 123.Lv A., Zhang K., Chen Y., Wang Y., Liu L., Wen J.-R., Xie J., Yan R. Interpreting key mechanisms of factual recall in transformer-based language models. arXiv. 2024 doi: 10.48550/arXiv.2403.19521. Preprint at. [DOI] [Google Scholar]
- 124.Geva M., Bastings J., Filippova K., Globerson A. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023. Dissecting recall of factual associations in auto-regressive language models; pp. 12216–12235. [Google Scholar]
- 125.Stolfo A., Belinkov Y., Sachan M. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis; pp. 7035–7052. [Google Scholar]
- 126.Krichmar J.L., Edelman G.M. Machine psychology: autonomous behavior, perceptual categorization and conditioning in a brain-based device. Cerebr. Cortex. 2002;12:818–830. doi: 10.1093/cercor/12.8.818. [DOI] [PubMed] [Google Scholar]
- 127.Hagendorff T. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv. 2023 doi: 10.48550/arXiv.2303.13988. Preprint at. [DOI] [Google Scholar]
- 128.Johansson R., Hammer P., Lofthouse T. Functional equivalence with nars. arXiv. 2024 doi: 10.48550/arXiv.2405.03340. Preprint at. [DOI] [Google Scholar]
- 129.Staddon J.E.R., Cerutti D.T. Operant conditioning. Annu. Rev. Psychol. 2003;54:115–144. doi: 10.1146/annurev.psych.54.101601.145124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Narayan S., Cohen S.B., Lapata M. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv. 2018 doi: 10.48550/arXiv.1808.08745. Preprint at. [DOI] [Google Scholar]
- 131.Li M., Chen M.-B., Tang B., ShengbinHou S., Wang P., Deng H., Li Z., Xiong F., Mao K., Peng C., Luo Y. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; 2024. NewsBench: A systematic evaluation framework for assessing editorial capabilities of large language models in Chinese journalism; pp. 9993–10014. [Google Scholar]
- 132.Hendrycks D., Burns C., Kadavath S., Arora A., Basart S., Tang E., Song D., Steinhardt J. Measuring mathematical problem solving with the math dataset. arXiv. 2021 doi: 10.48550/arXiv.2103.03874. Preprint at. [DOI] [Google Scholar]
- 133.Cobbe K., Kosaraju V., Bavarian M., Chen M., Jun H., Kaiser L., Plappert M., Tworek J., Hilton J., Nakano R., et al. Training verifiers to solve math word problems. arXiv. 2021 doi: 10.48550/arXiv.2110.14168. Preprint at. [DOI] [Google Scholar]
- 134.Chen Z., Du W., Zhang W., Liu K., Liu J., Zheng M., Zhuo J., Zhang S., Lin D., Chen K., Zhao F. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; 2024. T-eval: Evaluating the tool utilization capability of large language models step by step; pp. 9510–9529. [Google Scholar]
- 135.Yu Q., Zheng Z., Song S., Li Z., Xiong F., Tang B., Chen D. xfinder: Robust and pinpoint answer extraction for large language models. arXiv. 2024 doi: 10.48550/arXiv.2405.11874. Preprint at. [DOI] [Google Scholar]
- 136.Zheng Z., Wang Y., Huang Y., Song S., Yang M., Tang B., Xiong F., Li Z. Reference list for the paper “attention heads of large language models”. Zenodo. 2024 doi: 10.5281/zenodo.14601922. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Our reference list is available at GitHub (https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) and has been archived at Zenodo.136












