Building transformers from neurons and astrocytes

Leo Kozachkov; Ksenia V Kastanenka; Dmitry Krotov

doi:10.1073/pnas.2219150120

. 2023 Aug 14;120(34):e2219150120. doi: 10.1073/pnas.2219150120

Building transformers from neurons and astrocytes

Leo Kozachkov ^a,^b,¹, Ksenia V Kastanenka ^c, Dmitry Krotov ^a,¹

PMCID: PMC10450673 PMID: 37579149

Significance

Transformers have become the default choice of neural architecture for many machine learning applications. Their success across multiple domains such as language, vision, and speech raises the question: How can one build Transformers using biological computational units? At the same time, in the glial community, there is gradually accumulating evidence that astrocytes, formerly believed to be passive house-keeping cells in the brain, in fact play an important role in the brain’s information processing and computation. In this work we hypothesize that neuron–astrocyte networks can naturally implement the core computation performed by the Transformer block in AI. The omnipresence of astrocytes in almost any brain area may explain the success of Transformers across a diverse set of information domains and computational tasks.

Keywords: neuroscience, astrocytes, Transformers, glia, artificial intelligence

Abstract

Glial cells account for between 50% and 90% of all human brain cells, and serve a variety of important developmental, structural, and metabolic functions. Recent experimental efforts suggest that astrocytes, a type of glial cell, are also directly involved in core cognitive processes such as learning and memory. While it is well established that astrocytes and neurons are connected to one another in feedback loops across many timescales and spatial scales, there is a gap in understanding the computational role of neuron–astrocyte interactions. To help bridge this gap, we draw on recent advances in AI and astrocyte imaging technology. In particular, we show that neuron–astrocyte networks can naturally perform the core computation of a Transformer, a particularly successful type of AI architecture. In doing so, we provide a concrete, normative, and experimentally testable account of neuron–astrocyte communication. Because Transformers are so successful across a wide variety of task domains, such as language, vision, and audition, our analysis may help explain the ubiquity, flexibility, and power of the brain’s neuron–astrocyte networks.

Astrocytes, one kind of glia, are a ubiquitous cell type in the central nervous system. It is empirically well established that astrocytes and neurons communicate with one another via feedback loops that span many spatial and temporal scales (1–3). These communications underlie a variety of important physiological processes, such as regulating blood flow to neurons (4) and eliminating debris (5). A rapidly growing body of evidence suggests that astrocytes also play an active and flexible role in behavior (6–12). However, a firm computational interpretation of neuron–astrocyte communication is missing.

Transformers, a particular type of artificial intelligence (AI) architecture, have become influential in machine learning (13) and, increasingly, in computational neuroscience (14–20). They are currently the choice model for tasks across many disparate domains, including natural language processing, vision, and speech (21). Interestingly, several recent reports suggested architectural similarities between Transformers and the hippocampus (15, 19) and cerebellum (18), as well as representational similarities with human brain recordings (14, 16, 20). However, unlike more traditional neural networks, such as convolutional networks (22) or Hopfield networks (23), which have a long tradition of biological implementations, Transformers are only at the beginning of their interpretation in terms of known biological processes.

We hypothesize that biological neuron–astrocyte networks can perform the core computations of a Transformer. In support of this hypothesis, we explicitly construct an artificial neuron–astrocyte network whose internal mechanics and outputs approximate those of a Transformer with high probability. The main computational element of our network is the tripartite synapse, the ubiquitous three-factor connection between an astrocyte, a presynaptic neuron, and a postsynaptic neuron (24). We argue that tripartite synapses can perform the role of normalization in the Transformer’s self-attention operation. As such, neuron–astrocyte networks are natural candidates for the biological “hardware” that can be used for computing with Transformers.

The organization of this paper is as follows. We begin with two primers, which introduce the core concepts and notations: one on astrocyte biology and the other one on Transformers. Then, we describe our neuron–astrocyte network in detail and demonstrate the correspondence to Transformers through theory and simulations. We begin by establishing the correspondence for the models with shared weights and then show the general case. For completeness, we also derive a nonastrocytic mechanism for implementing Transformers biologically. Although, ultimately, it should be decided through experiments which of the two mechanisms is closer to biological reality, from the theoretical perspective we argue that astrocytes provide a more natural and parsimonious hypothesis for how Transformers might be implemented in the brain. We conclude with a discussion on the intrinsic timescales of our biological Transformers, as well as potential future work.

Primer on Astrocyte Biology.

Glial cells are the other major cell type in the brain besides neurons. The exact ratio of glia to neurons is disputed, but it is somewhere between 1:1 and 10:1 (25). The most well-studied type of glial cell is the astrocyte. A defining feature of astrocytes is that a single astrocyte cell forms connections with thousands to millions of nearby synapses (26). For example, a single human astrocyte can cover between 270,000 to 2 million synapses within a single domain (27). Astrocytes are mostly electrically silent, encoding information in the dynamics of intracellular calcium ions ( ${Ca}^{2 +}$ ). In most parts of the brain, neurons and astrocytes are closely intertwined. For example, in the hippocampus as many as 60% of all axon–dendrite synapses are wrapped by astrocyte cell membranes called processes (28). In the cerebellum, the number is even higher. This three-way arrangement (presynaptic axon, postsynaptic dendrite, astrocytes process) is so common that it has been given a name: the tripartite synapse (24).

Astrocyte processes contain receptors corresponding to the neurotransmitters released at the synaptic sites they ensheathe. For example, astrocytes in the basal ganglia are sensitive to dopamine, whereas in the cortex astrocytes are sensitive to glutamate (29). Despite being affected by the same presynaptic neurotransmitters, postsynaptic neurons and astrocytes respond very differently: Neurons primarily encode information using action potentials, but astrocytes encode information via elevations in free intercellular calcium. Importantly, neuron-to-astrocyte signaling can trigger a response in the opposite astrocyte-to-neuron direction thus establishing a feedback loop between neural cells and astrocytes. Astrocytes can either depress or facilitate synapses, depending on the situation (30). For example, astrocytes in the hypothalamus have been observed to multiplicatively scale the excitatory synapses they ensheathe by the same common factor (31).

Interestingly, there is also extensive astrocyte-to-astrocyte communication in the brain. Astrocytes form large-scale networks with one another (26). These networks are spatially tiled, with regular intercellular spacing of a few tens of micrometers (32). Unlike neurons, which communicate primarily with spikes, astrocytes communicate via calcium waves that propagate between their cell bodies, processes, and endfeet (33). These waves have speeds of a few tens of micrometers per second. It is thought that these waves could be used to synchronize neural populations and coordinate important neural processes (34).

Among this plethora of biological phenomena, the following four points will be important for our mathematical model:

Most synapses in the brain are tripartite (presynaptic neuron, postsynaptic neuron, astrocyte process).
There is a feedback loop between astrocyte processes and synapses. Astrocyte processes respond to presynaptic neural activity with an elevation in intracellular calcium ions ( ${Ca}^{2 +}$ ) and, in turn, release gliotransmitters which modulate synapses. This modulation can be either facilitating or depressing.
The neuron $\to$ astrocyte signaling pathway is plastic.
Nearby astrocyte processes can spatially average their ${Ca}^{2 +}$ levels.

Next, we introduce Transformers from the AI perspective, before proposing their biological implementation with astrocytes.

Primer on Transformers.

Transformers (13) are a popular neural architecture used in many of the recent innovations in AI including Foundation Models (35), Generative Pre-trained Transformer-3 (GPT3) (36), Chat Generative Pre-trained Transformer (ChatGPT) (37), etc. Originally developed for natural language processing tasks, Transformers are taking over the leader boards in other domains too, including vision (38), speech, and audio processing (21). Initially, Transformers were developed as a means to overcome the shortcomings of recurrent neural networks (13). A major difference between these two architectures is as follows: while recurrent neural network process inputs one at a time, Transformers have direct access to all past inputs. Through their self-attention mechanism (described in detail shortly), Transformers can learn long-range dependencies between words in a sentence without having to recurrently maintain a hidden state over long time intervals. Among other computational benefits, this allows for more efficient parallelization during the training process and avoids the vanishing/exploding gradient problem (39–41). In the vision domain, Transformers have also achieved state-of-the-art results (38) surpassing convolutional neural networks. While the latter use hard-coded inductive biases enabling them to learn local correlations between pixels in the images plane, Transformers form long-range learnable dependencies in the image plane right away starting from the early layers of processing (42).

Although recurrent and convolutional neural networks admit straightforward biological interpretations, Transformers presently do not. The reason has to do with the Transformer’s self-attention mechanism. In particular, the so-called self-attention matrix is computed by a) calculating all pairwise dot products between “tokens” (e.g., words in a sentence, patches in an image, etc), b) exponentiating these dot product terms, and then c) normalizing the rows of this matrix to sum to one. These operations are fundamentally nonlocal in time and space, which make them difficult to interpret in biological terms. Later on, we will show how astrocyte biology offers a biologically plausible solution to this dilemma.

Transformers are typically a composition of many Transformer “blocks.” A typical Transformer block uses four basic operations: self-attention, feed-forward neural network, layer normalization, and skip connections. These operations are arranged in a certain way so that the entire block can learn relationships between the tokens, which represent the data. More formally, consider a sequence of $N$ token embeddings. Each token can correspond to a word (or a part of the word) if the Transformer is used in the language domain or a patch of an image in the vision domain. Each embedding is of dimension $d$ . The tokens are streamed into the network one by one (online setting), and the time of the token’s presentation is denoted by $t$ . The $t^{th}$ embedding is given by a vector $x_{t} \in R^{d}$ . Going forward, it will be helpful to collect these tokens into a single matrix, $X$ :

\begin{matrix} X \equiv [\begin{matrix} | & | & | \\ x_{1} & x_{2} & \dots & x_{N} \\ | & | & | \end{matrix}] \in R^{d \times N} . \end{matrix}

[1]

In the Transformer block, each token is converted to a key, query, and value vector via a corresponding linear transformation: $W_{K}, W_{Q} \in R^{D \times d}$ and $W_{V} \in R^{d \times d}$ . Here, $D$ is the internal size of the attention operation. These transformations are optimized during training. The key, value, and query vectors are then collected into matrices, similarly to Eq. 1:

\begin{matrix} \begin{matrix} k_{t} = W_{K} x_{t} \\ v_{t} = W_{V} x_{t} \\ q_{t} = W_{Q} x_{t} \end{matrix} \to \begin{matrix} K = W_{K} X \\ V = W_{V} X, \\ Q = W_{Q} X \end{matrix} \end{matrix}

[2]

After computing the key, value, and query matrices, the next major step in a Transformer is the self-attention operation, which allows the tokens to exchange information with each other. The self-attention matrix, $SelfAttn (X)$ , is an $N \times N$ matrix which contains information about all the pairwise interactions between tokens. At the core of the self-attention mechanism is the softmax function. Recall that the softmax function exponentiates the elements of a vector and then divides each element by the sum of these exponentials. Denoting column $t$ of the self-attention matrix by $attn (t)$ , we have that

\begin{matrix} attn (t) = \sum_{i = 1}^{N} α_{i} (t) v_{i} with α_{i} (t) = \frac{e^{k_{i}^{T} q_{t}}}{\sum_{j = 1}^{N} e^{k_{j}^{T} q_{t}}} . \end{matrix}

Due to the softmax normalization, each column of the self-attention matrix can be interpreted as a convex combination of the value vectors. Given this definition as well as Eq. 2, we can write the self-attention matrix compactly as:

\begin{matrix} SelfAttn (X) = V softmax (K^{T} Q), \end{matrix}

[3]

where here the softmax normalization is computed along the columns of $K^{T} Q$ . The output of this self-attention operation is then passed along to a LayerNorm operation and a feed-forward neural network (FFN) that both act separately on each token (each column of its input), see Fig. 1. Recall that a LayerNorm scales each element of a vector by the mean and variance of all elements in the vector (43) and can be implemented in a biologically plausible manner (44). Without loss of generality, a single-headed attention Transformer is studied. In this case, the output of the full Transformer block may be written as a two-step process:

\begin{matrix} \begin{matrix} Y & = LayerNorm (SelfAttn (X) + X) \\ Transformer (X) & = LayerNorm (FFN (Y) + Y), \end{matrix} \end{matrix}

[4]

Fig. 1. — (A) A high-level overview of the proposed neuron–astrocyte network. The Transformer block is approximated by a feed-forward network with an astrocyte unit that ensheaths the synapses between the hidden and last layers (matrix $H$ ). Data are constantly streamed into the network. (B) During the writing phase the neuron-to-neuron weights are updated using Hebbian learning rule and the neuron-to-astrocyte weights are updated using a presynaptic plasticity rule. During the reading phase, the data are forwarded through the network, and the astrocyte modulates the synaptic weights $H$ .

where FFN refers to a feedfoward network, applied to each token (i.e., each column of $Y$ ) separately and identically.

Biological Implementation of a Transformer Block

In order to gain theoretical insight into Transformers, it is common to tie the weights (45, 46). This tying can be within a single Transformer block, between blocks, or both. In this section, we will tie the weights within a single block but not between blocks. We will relax this weight sharing constraint in the later sections. In particular, we tie $W_{Q}, W_{K}, W_{V}$ as follows:

\begin{matrix} W_{Q} = W_{K} = W, W_{V} = I, \end{matrix}

[5]

for some arbitrary matrix $W$ and the identity matrix, $I$ . In general, we will not require that $d = D$ . We include this constraint now to fully analyze the simplest version of our model that captures the essential elements of our argument. Without loss of generality, we will ignore layer normalization steps for now, returning to them in the section titled “General Case of Untied Weights.”

Neuron–Astrocyte Network.

A high-level overview of our circuit is shown in Fig. 1. The network consists of a perceptron with an input layer, a hidden layer, and an output layer (Fig. 1A). As in many associative memory systems, our network has distinct writing and reading operations (23, 47). In particular, our network alternates between writing and reading phases (Fig. 1B). The writing phase enables the circuit to store information about all the tokens; the reading phase enables any given token to interact with all the others. Recall that a difficulty with interpreting Transformers as biological circuits is that they require operations which are nonlocal in space and time. Having distinct writing and reading phases allows our network to resolve this temporal nonlocality. As we will see, the spatial nonlocality is resolved through the astrocyte unit.

The $d$ -dimensional inputs are passed to the hidden layer with $m$ units, as well as to the last layer via a skip connection (not shown in Fig. 1). The hidden layer applies a fixed nonlinearity to incoming inputs. The outputs of the hidden layer are passed to the last layer via a linear mapping $H \in R^{d \times m}$ . The synapses in the matrix $H$ are triparite synapses, meaning that each of the $md$ synapses is associated with an astrocyte process $p_{i α}$ . The Latin indices $i, j$ are used to enumerate neurons in the first and last layers, while the Greek indices $α, β$ are reserved for the hidden neurons. The strength of the synapse between a hidden neuron $α$ and the output neuron $i$ is denoted by $H_{i α}$ and the activity of the astrocyte process that ensheaths this synapse is described by $p_{i α}$ . The layers are denoted from left to right as $f$ , $h$ , $ℓ$ (first, hidden, last), respectively. Our network is described by the following equations:

\begin{matrix} \begin{matrix} f = x & \in R^{d} \\ h = ϕ (W f) & \in R^{m} \\ ℓ = r (H ⊙ \tilde{P}) h + f & \in R^{d}, \end{matrix} \end{matrix}

[6]

The scalar $r = {0, 1}$ stands for ‘read’ and is zero during the writing phase and unity during the reading phase. The symbol $⊙$ denotes the Hadamard product (element-wise multiplication) between two matrices. The matrix $\tilde{P} \in R^{d \times m}$ captures the effect of the astrocyte processes and is defined as follows:

\begin{matrix} {\tilde{P}}_{i α} = \frac{1}{p_{i α}} \end{matrix}

This inverse modulation of synaptic weights by astrocytes has been observed, for example, in studies involving tumour necrosis factor-alpha (TNF- $α$ ), wherein astrocytes will upscale synaptic weights in response to low neural activity and downscale weights in response to high neural activity. More generally, many studies have observed that astrocytes can both depress and facilitate synapses, depending on the situation (1, 48–51).

Neural Activation Function.

The neural activation function $ϕ$ plays a special role in our circuit. In order to match the exponential dot product in the Transformer’s self-attention mechanism, we will require that $ϕ$ be an approximate feature map for the exponential dot product kernel

\begin{matrix} ϕ {(x)}^{T} ϕ (y) \approx e^{x^{T} y}, \end{matrix}

[7]

There are many (indeed, infinitely many) activation functions which satisfy this condition. Several biologically plausible options come from the theory of random feature maps (52–54), and we will discuss them in detail later on. For now, we will simply assume that $ϕ$ is chosen so that Eq. 7 is true. More generally, however, one can pick any $ϕ$ such that $ϕ {(x)}^{T} ϕ (y) \geq 0$ to yield a valid self-attention mechanism (55). Nevertheless, only particular choices of $ϕ$ yield the softmax self-attention which is used in most Transformers at scale (13).

Astrocyte Process Dynamics.

As discussed in the introduction, astrocyte processes are sensitive to presynaptic neural activity. To capture this mathematically, we assume that the astrocyte process ${Ca}^{2 +}$ response is linearly proportional to the presynaptic neuron activation $h_{α}$ of neuron $α$ in layer $h$ . The constant of proportionality between the astrocyte process activation and the presynaptic neural activity is denoted as $g_{i α}$ . This constant is in general different for every astrocyte process. Upon presentation of an embedded token to the network, astrocyte process $p_{i α}$ initially responds with a local calcium elevation $g_{i α} h_{α}$ . This ${Ca}^{2 +}$ response is then spatially averaged with the responses of other nearby astrocyte processes so that, after transients, the processes have the same value once a token is presented:

\begin{matrix} p_{i α} = \frac{1}{md} \sum_{j = 1}^{d} \sum_{β = 1}^{m} g_{j β} h_{β} = p . \end{matrix}

[8]

The neuron-to-astrocyte signaling pathway in our circuit is completely described by Eq. 8.

Writing Phase.

During the writing phase, $r$ is set to zero. Biologically, this condition could correspond to some global neuromodulator being released into the local environment, for example, acetylcholine, as suggested in refs. 17 and 56. Plugging $r = 0$ , Eq. 6 becomes

\begin{matrix} \begin{matrix} f_{t} = x_{t} \\ h_{t} = ϕ (W f_{t}) = ϕ (k_{t}) \\ ℓ_{t} = f_{t} = v_{t}, \end{matrix} \end{matrix}

[9]

where we have substituted in the definitions of the key, query, and value vectors given by Eq. 2, as well as the temporary weight-tying assumption given by Eq. 5. As the embedded tokens are passed into Eq. 9 sequentially, the weight matrix $H$ is updated via Hebbian plasticity with a learning rate of $\frac{1}{m}$ . Upon presentation of token $t$ , the matrix $H$ is

\begin{matrix} H_{t} = H_{t - 1} + \frac{1}{m} ℓ_{t} h_{t}^{T} \Rightarrow H = \frac{1}{m} V ϕ^{T} (K), \end{matrix}

where we have assumed that $H$ is initially the zero matrix and substituted in the equalities in Eq. 9. At the same time that the neuron-to-neuron weights are updated via Hebbian plasticity, the neuron-to-astrocyte weights are updated via presynaptic plasticity. Upon presentation of token $t$ , these weights are

\begin{matrix} g_{t, i α} = g_{t - 1, i α} + ϕ {(W x_{t})}_{α} \Rightarrow g_{i α} = \sum_{j = 1}^{N} ϕ {(k_{j})}_{α} . \end{matrix}

Note that as a consequence of the presynaptic plasticity, the weight $g_{i α}$ does not depend on the index $i$ . Therefore, we will only refer to the vector $g \in R^{m}$ , which—through the presynaptic plasticity—is simply the sum over all token presentations of the hidden layer neural activations:

\begin{matrix} g = \sum_{j = 1}^{N} ϕ (k_{j}) . \end{matrix}

Reading Phase.

During the reading phase, the read gate is set to $r = 1$ in Eq. 6, and the inputs are forwarded through the network. The astrocyte process activation value $p$ , which according to Eq. 8 does not depend on indices $i$ and $α$ , is given by

\begin{matrix} \begin{matrix} p = \frac{d}{md} g^{T} h = \frac{1}{m} \sum_{j = 1}^{N} ϕ {(k_{j})}^{T} ϕ (q_{t}) . \end{matrix} \end{matrix}

[10]

To obtain the last equality, we have used $h_{t} = ϕ (W x_{t}) = ϕ (q_{t})$ . Plugging in all the steps of Eq. 6, we see that the last layer has the following output

\begin{matrix} ℓ_{t} & = \frac{1}{p} H ϕ (q_{t}) + x_{t} = \frac{V ϕ^{T} (K) ϕ (q_{t})}{ϕ {(q_{t})}^{T} \sum_{j = 1}^{N} ϕ (k_{j})} + x_{t} \\ \approx \sum_{i = 1}^{N} \frac{e^{k_{i}^{T} q_{t}}}{\sum_{j = 1}^{N} e^{k_{j}^{T} q_{t}}} v_{i} + x_{t} \\ = attn (t) + x_{t}, \end{matrix}

[11]

where we have used the assumption that $ϕ$ is an approximate feature map for the exponential dot product, given by Eq. 7. If we compute $ℓ_{t}$ for every token $x_{t}$ and stack the results column-wise into a matrix $L$ , we can conclude that the output of our neuron–astrocyte circuit is approximately the output of the Transformer’s self-attention, plus the necessary residual connection:

\begin{matrix} \begin{matrix} L \approx SelfAttn (X) + X . \end{matrix} \end{matrix}

[12]

Random Feature Activations.

As mentioned above, in order to approximate the softmax attention, we require that $ϕ$ is a feature map for the exponential dot product. This is the idea behind linear Transformer architectures (55) such as Performers (53) and Random Feature Attention (54). We will now discuss two biologically plausible options for such a feature map. The first relies on a well-known result in kernel approximation theory (52), which is that the radial basis function (RBF) kernel can, with high probability, be approximated very well using random projections and cosines

\begin{matrix} ϕ (x) = \sqrt{\frac{2}{m}} exp (\frac{{| | x | |}^{2}}{2}) cos (Π x + b), \end{matrix}

[13]

where the elements of $Π \in R^{m \times D}$ are drawn from a standard normal distribution, and the elements of $b \in R^{m}$ are drawn from the uniform distribution on $[0, 2 π]$ . A related but different random feature map was introduced in the context of Performers (53). There it was shown that instead of cosines, one can just as well use exponential functions

\begin{matrix} ϕ (x) = \sqrt{\frac{1}{m}} exp (\frac{- {| | x | |}^{2}}{2}) exp (Π x), \end{matrix}

[14]

Note that due to the softmax normalization, any constant prefactors in Eq. 13 can be ignored (since they cancel in the numerator and denominator). If we assume an additional spherical normalization step before the random projection layer, so that all arguments to $ϕ$ have constant norm, then the above activation functions may be written more plainly as

\begin{matrix} ϕ (x) = cos (Π x + b) and ϕ (x) = exp (Π x) . \end{matrix}

Cosine tuning curves appear ubiquitously in neuroscience, across many different organisms (e.g., crickets, cats, rhesus monkeys) and many different brain areas (e.g., cerebellum, motor cortex, and hippocampus) (57, 58). The function $exp (\cdot)$ is monotonic and positive, making it easy to implement from a biological perspective. For the exponential random feature function, the term $exp (\frac{- {| | x | |}^{2}}{2})$ may be interpreted as a homeostatic mechanism to ensure that firing rates do not become too large. We stress that while the aforementioned random feature maps are sufficient for approximating the softmax self-attention mechanism, there are infinitely many other activation functions that lead to valid (though potentially nonsoftmax) self-attention matrices.

General Case of Untied Weights

In this section, we relax the weight tying condition and generalize our construction to the case when $D \neq d$ . While in the previous sections $r$ acted as a gatekeeper for the weight matrix $H$ , we will now also have $r$ act as a gatekeeper for a few other weight matrices. Using the same variable names, consider the following neuron–astrocyte forward equations:

\begin{matrix} \begin{matrix} f = x & \in R^{d} \\ h = ϕ [(1 - r) W_{K} f + r W_{Q} f] & \in R^{m} \\ ℓ = r (H ⊙ \tilde{P}) h + (1 - r) W_{V} f + r f & \in R^{d}, \end{matrix} \end{matrix}

[15]

When $r = 0$ , we recover the writing phase of Eq. 9; when $r = 1$ , we recover the reading phase equations of Eq. 11. When we impose the weight tying constraint of $W_{K} = W_{Q} = W$ and $W_{V} = I$ , we recover the original equations of Eq. 6. Eq. 15 describes the neuron–astrocyte implementation of the general Transformer block without the weight sharing constraint imposed. The circuit diagram corresponding to Eq. 15 can be seen in Fig. 2A.

Numerical Validation

The results derived above have also been checked numerically. In Fig. 2B, one can see the error between the proposed neuron–astrocyte network and the actual AI Transformer block as a function of the ratio of the width of the hidden layer to the size of the token embedding. As expected from the theoretical analysis, the error between the two networks rapidly decreases as the hidden layer becomes wider. In practice, as the width of the hidden layer becomes 5 to 10 times the embedding dimension, the two networks produce very similar outputs. In Fig. 3A, we use the parameters of the ALBERT-base (59, 60) Transformer to generate a corresponding neuron–astrocyte model. In particular, we extracted the word embedding matrix, the encoder matrix, and the $W_{Q}$ , $W_{K}$ , $W_{V}$ matrices from the first block of ALBERT-base. We then embedded and encoded the first 200 words of the abstract of this paper. We plugged these weights into two neuron–astrocyte networks Eq. 15—one with $m = 10^{3}$ hidden neurons and one with $m = 10^{5}$ hidden neurons—and passed the tokens through the network. We extracted the astrocyte responses during the reading phase and plotted these along with the actual softmax normalization terms in ALBERT-base model. In Fig. 3B, we performed a similar “weight transfer” from a Vision Transformer model that was pretrained on ImageNet-21K (61, 62). In this case, the tokens were patches of an image, instead of words in a sentence. As expected from the theoretical derivation, for sufficiently large number of hidden units, neuron–astrocyte networks accurately describe computation performed by the Transformer models. The code to reproduce Fig. 3 is available in the following GitHub repository: https://github.com/kozleo/neuron-astrocyte-transformer.

Fig. 3. — (*Left*) Astrocyte traces for $m = 10^{3}$ and $m = 10^{5}$ compared against the “exact” softmax normalization terms of the ALBERT-base model. The tokens used were the first 200 words of the abstract of this paper. See “Numerical Validation” for details. (*Right*) Similar plot as in left side, but for a Vision Transformer. Instead of using embedded words as tokens, the model uses patches from an image.

Do We Need Astrocytes?

Although we are interested in addressing the scientific problem of how astrocytes participate in behavior, a natural question when positing any new brain mechanism is as follows: “Can the same behavior be achieved without this mechanism?” This section demonstrates that a Transformer circuit can also be constructed using neurons and bipartite synapses, together with a specialized divisive normalization achieved via shunting inhibition. The circuit is similar to Eq. 6:

\begin{matrix} \begin{matrix} f = x & \in R^{d} \\ h = ϕ (W f) & \in R^{m} \\ R = g^{T} h & \in R \\ ℓ = \frac{r}{R} H h + f & \in R^{d} . \end{matrix} \end{matrix}

[16]

The only difference between Eqs. 16 and 6 is the addition of a new element, $R$ , and the removal of the astrocyte processes. Here, $R$ is an inhibitory neuron that divisively normalizes feed-forward inputs into layer $ℓ$ . However, it does not inhibit all feedfoward inputs equally. Despite both $h$ and $f$ being feed-forward inputs to layer $ℓ$ , the divisive inhibition is only implemented on the inputs coming from layer $h$ . This can happen, for example, if the feed-forward synaptic inputs coming from layer $h$ arrive at the dendritic tree close to where inhibitory inputs from neuron $R$ shunt current flow, while the feed-forward inputs coming from layer $f$ synapse far away from the shunting (63). Leaving the reading and writing phases untouched, circuit Eq. 16 implements the same forward pass as Eq. 6.

While the proposed nonastrocytic circuit can, in theory, also implement a Transformer forward pass, it should be noted that there exists a controversy about the capability of shunting inhibition to implement divisive normalization (63, 64). Thus, the biologically plausibility of this circuit is questionable. Additionally—as we will discuss in the next section—the comparatively slower timescale of astrocytes provides a natural memory buffer when, e.g., accumulating and storing words in a sentence. Finally, it is possible that there are many ways to implement Transformers biologically, each with relative pros and cons. Different brain areas may implement Transformer-like computation using different circuitries. It is ultimately an experimental question to validate these theoretical hypothesis.

Timescales

One aspect of our model which we have yet to discuss is its timescale. Our circuit operates in two distinct phases: a reading phase and a writing phase. The reading phase does not involve any plasticity, so the only relevant timescale to compute is how long it takes to traverse the neuron–astrocyte-synapse pathway. Recent data indicate that astrocytes can sense and respond to neural activity on the order of a few hundreds of milliseconds (9, 65). The speed of the writing phase is limited by the speed of plasticity. There are two types of plasticity used in our model during the writing phase: 1) Hebbian plasticity between neurons and 2) presynaptic plasticity between neurons and astrocytic processes. In the case of neuron–neuron plasticity, there are experimental studies reporting a vast range of the relevant timescales. These include Hebbian plasticity (66–68), behavioral timescale plasticity (69–71), etc. The induction timescales for these plasticity mechanisms range from hundreds of milliseconds (70) to tens of minutes (67). In the case of STDP computational modeling studies, it is typically assumed that synaptic weights are adjusted instantaneously, by an amount proportional to the timing difference between pre-post synaptic spikes (72, 73). The neuron–astrocyte plasticity timescale is harder to establish, due to limitations in calcium recording technology. While fast calcium transients in astrocyte processes have been recently recorded (9), and neuron–astrocyte plasticity has been experimentally observed (74), fast (e.g., $<$ 1 s) neuron–astrocyte plasticity has not been observed yet, possibly due to limitations of the calcium imaging technology.

Discussion

Here, we have built a computational neuron–astrocyte model which is functionally equivalent to an important AI architecture: the Transformer. This model serves a dual purpose. The first purpose is to provide a concrete, normative, computational account of how the communication between astrocytes and neurons subserves brain function. The second purpose is to provide a biologically plausible account of how Transformers might be implemented in the brain. While the feedback loop between neurons and astrocytes is well studied from an experimental perspective, there is comparatively little work studying it from the computational perspective (7). Astrocyte modeling studies tend to focus on either the biophysics of neuron-astrocyte or astrocyte signaling (75, 76) or the emergent computational properties of detailed neuron-astrocyte models (77–79). Fewer studies have focused on simpler, normative models of neuron–astrocyte networks (51, 80, 81).

An important feature of our model is that it is flexible enough to approximate any Transformer. In other words, we do not only show how to model a particular Transformer (i.e., one with weights that have already been trained for some specific task)—rather, we show how to approximate all possible Transformers using neurons and astrocytes. Given the demonstrated power and flexibility of Transformers, this generality can help to explain why astrocytes are so prevalent across disparate brain areas and species. Our model has several immediate implications. First, as calcium imaging technologies improve, it will become increasingly feasible to explicitly compare artificial representations in AI networks to representations in biological astrocyte networks—as is already done when comparing AI networks to biological neural networks (16, 22, 82). Given that astrocyte activity is thought to be tightly coupled to fMRI responses (83), natural language processing contexts such as (16) and (84) are already a promising place to look for astrocytic contributions to brain function. Additionally, we propose that our hypothesis could be refuted through studies involving targeted astrocyte manipulations. The brain’s sensitivity to normal astrocyte function levels is evident. For instance, prior experimental studies have demonstrated that hippocampal astrocyte activation positively influences memory-related behaviors (85), whereas striatal astrocyte activation impairs attention (86). To challenge our hypothesis, we could train both a Transformer model and an animal subject to perform the same hippocampal-based memory task, such as one requiring path integration. Based on previous research, we anticipate a strong correlation between Transformer and hippocampal activations (87). If we could then selectively silence or modify hippocampal astrocytes in the animal subject and demonstrate that the representational similarity to the Transformer model remains unaffected, our hypothesis would be undermined. The main constraint of this approach lies in the present challenge of selectively inactivating astrocytes in a controlled and reversible fashion (1). Nevertheless, we anticipate that advancements in the field of astrocyte biology will eventually overcome these limitations.

Despite the exciting potential links between Transformers and the brain, it is worth noting that humans learn quite differently from Transformers. Transformers are extremely data-hungry, and consequently, training them requires a massive amount of energy (88). By contrast, the human brain runs on a smaller energy budget than a common laptop and does not require internet-scale training datasets to learn a language (89). In view of this fact, it may be more appropriate to view training a large Transformer as analogous to learning over evolutionary timescales, rather than the lifetime of a single individual (90).

Finally, a major roadblock in accepting Transformers as models of natural language processing (or, more generally, sequential processing) in the brain is that they require a memory buffer to store the tokens as they are presented. This is because the self-attention matrix is computed over all the tokens. Our paper proposes that neuron–astrocyte networks can perform this buffering naturally through spatial and temporal integration. Finally, and more speculatively, since astrocytes are implicated in many brain disorders and diseases, our work suggests that causal manipulations on Transformers can be used as a way to generate putative hypotheses for how astrocyte function goes astray in brain disorders and diseases (91, 92).

Acknowledgments

We thank Dan Gutfreund, John Hopfield, Martin Schrimpf, and Mriganka Sur for helpful comments and feedback. This work was completed while L.K. was an MIT-IBM Watson AI Lab Summer 2022 Intern. K.V.K. acknowledges funding from the following sources: BrightFocus Foundation Grant A2020833S, and National Institutes of Health Grant R01AG066171.

Author contributions

L.K. and D.K. designed research; L.K. and D.K. performed research; and L.K., K.V.K., and D.K. wrote the paper.

Competing interests

L.K. did his summer internship at IBM.

Footnotes

This article is a PNAS Direct Submission.

Contributor Information

Leo Kozachkov, Email: leokoz8@mit.edu.

Dmitry Krotov, Email: krotov@ibm.com.

Data, Materials, and Software Availability

There are no data underlying this work. The code used in this work is available at GitHub repository (https://github.com/kozleo/neuron-astrocyte-transformer) (93).

References

1.Kofuji P., Araque A., Astrocytes and behavior. Annu. Rev. Neurosci. 44, 49–67 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Lind B. L., Brazhe A. R., Jessen S. B., Tan F. C. C., Lauritzen M. J., Rapid stimulus-evoked astrocyte Ca²⁺ elevations and hemodynamic responses in mouse somatosensory cortex in vivo. Proc. Natl. Acad. Sci. U.S.A. 110, E4678–E4687 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Pinto-Duarte A., Roberts A. J., Ouyang K., Sejnowski T. J., Impairments in remote memory caused by the lack of Type 2 IP₃ receptors. Glia 67, 1976–1989 (2019). [DOI] [PubMed] [Google Scholar]
4.MacVicar B. A., Newman E. A., Astrocyte regulation of blood flow in the brain. Cold Spring Harb. Perspect. Biol. 7, a020388 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chung W.-S., Allen N. J., Eroglu C., Astrocytes control synapse formation, function, and elimination. Cold Spring Harb. Perspect. Biol. 7, a020370 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kol A., Goshen I., The memory orchestra: The role of astrocytes and oligodendrocytes in parallel to neurons. Curr. Opin. Neurobiol. 67, 131–137 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kastanenka K. V., et al. , A roadmap to integrate astrocytes into systems neuroscience. Glia 68, 5–26 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.M. López-Hidalgo, V. Kellner, J. Schummers, Astrocyte subdomains respond independently in vivo. bioRxiv [Preprint] (2019). 10.1101/675769 (Accessed 20 June 2019). [DOI]
9.Stobart J. L., et al. , Cortical circuit activity evokes rapid astrocyte calcium signals on a similar timescale to neurons. Neuron 98, 726–735 (2018). [DOI] [PubMed] [Google Scholar]
10.Yu M., et al. , Glia accumulate evidence that actions are futile and suppress unsuccessful behavior. Cell 178, 27–43 (2019). [DOI] [PubMed] [Google Scholar]
11.Nagai J., et al. , Behaviorally consequential astrocytic regulation of neural circuits. Neuron 109, 576–596 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lin Z., et al. , Entrainment of astrocytic and neuronal Ca²⁺ population dynamics during information processing of working memory in mice. Neurosci. Bull. 38, 474–488 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.A. Vaswani et al., “Attention is all you need” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2017), vol. 30. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (Accessed 6 December 2017).
14.Toneva M., Wehbe L., Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). Adv. Neural. Inf. Process. Syst. 32, 14954–14964 (2019). [Google Scholar]
15.D. Krotov, J. J. Hopfield, “Large associative memory problem in neurobiology and machine learning” in International Conference on Learning Representations (OpenReview.net, 2021).
16.Schrimpf M., et al. , The neural architecture of language: Integrative modeling converges on predictive processing. Proc. Natl. Acad. Sci. U.S.A. 118, e2105646118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.D. Tyulmankov, C. Fang, A. Vadaparty, G. R. Yang, “Biological learning in key-value memory networks” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2021), vol. 34, pp. 22247–22258.
18.Bricken T., Pehlevan C., Attention approximates sparse distributed memory. Adv. Neural. Inf. Process. Syst. 34, 15301–15315 (2021). [Google Scholar]
19.J. C. R. Whittington, J. Warren, T. E. J. Behrens, Relating transformers to models and neural representations of the hippocampal formation. arXiv [Preprint] (2022). http://arxiv.org/abs/2112.04035 (Accessed 15 March 2022).
20.Caucheteux C., King J.-R., Brains and algorithms partially converge in natural language processing. Commun. Biol. 5, 1–10 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers. arXiv [Preprint] (2021). http://arxiv.org/abs/2106.04554 (Accessed 15 June 2021).
22.Yamins D. L. K., et al. , Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. U.S.A. 111, 8619–8624 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hopfield J. J., Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554–2558 (1982). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Perea G., Navarrete M., Araque A., Tripartite synapses: Astrocytes process and control synaptic information. Trends Neurosci. 32, 421–431 (2009). [DOI] [PubMed] [Google Scholar]
25.Von Bartheld C. S., Bahney J., Herculano-Houzel S., The search for true numbers of neurons and glial cells in the human brain: A review of 150 years of cell counting. J. Comp. Neurol. 524, 3865–3895 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Halassa M. M., Fellin T., Takano H., Dong J.-H., Haydon P. G., Synaptic islands defined by the territory of a single astrocyte. J. Neurosci. 27, 6473–6477 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Oberheim N. A., et al. , Uniquely hominid features of adult human astrocytes. J. Neurosci. 29, 3276–3287 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Semyanov A., Verkhratsky A., Astrocytic processes: From tripartite synapses to the active milieu. Trends Neurosci. 44, 781–792 (2021). [DOI] [PubMed] [Google Scholar]
29.Verkhratsky A., Butt A., Glial Neurobiology: A Textbook (John Wiley& Sons, 2007). [Google Scholar]
30.Newman E. A., New roles for astrocytes: Regulation of synaptic transmission. Trends Neurosci. 26, 536–542 (2003). [DOI] [PubMed] [Google Scholar]
31.Gordon G. R. J., et al. , Astrocyte-mediated distributed plasticity at hypothalamic glutamate synapses. Neuron 64, 391–403 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Sul J.-Y., Orosz G., Givens R. S., Haydon P. G., Astrocytic connectivity in the hippocampus. Neuron Glia Biol. 1, 3–11 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Kuga N., Sasaki T., Takahara Y., Matsuki N., Ikegaya Y., Large-scale calcium waves traveling through astrocytic networks in vivo. J. Neurosci. 31, 2607–2614 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Scemes E., Giaume C., Astrocyte calcium waves: What they are and what they do. Glia 54, 716–725 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.R. Bommasani et al., On the opportunities and risks of foundation models. arXiv [Preprint] (2021). http://arxiv.org/abs/2108.07258 (Accessed 12 June 2022).
36.Brown T., et al. , Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020). [Google Scholar]
37.OpenAI, ChatGPT: Optimizing language models for dialogue (2022). https://openai.com/blog/chatgpt/ (Accessed 12 May 2022).
38.A. Dosovitskiy et al., An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [Preprint] (2020). http://arxiv.org/abs/2010.11929 (Accessed 3 June 2021).
39.S. Hochreiter, Untersuchungen zu Dynamischen Neuronalen Netzen (Diploma, Technische Universität München, 1991), vol. 91.
40.Bengio Y., Simard P., Frasconi P., Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994). [DOI] [PubMed] [Google Scholar]
41.R. Pascanu, T. Mikolov, Y. Bengio, “On the difficulty of training recurrent neural networks” in International Conference on Machine Learning (PMLR, 2013), pp. 1310–1318.
42.Raghu M., Unterthiner T., Kornblith S., Zhang C., Dosovitskiy A., Do vision transformers see like convolutional neural networks? Adv. Neural. Inf. Process. Syst. 34, 12116–12128 (2021). [Google Scholar]
43.J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization. arXiv [Preprint] (2016). http://arxiv.org/abs/1607.06450 (Accessed 21 June 2016).
44.Shen Y., Wang J., Navlakha S., A correspondence between normalization strategies in artificial and biological neural networks. Neural Comput. 33, 3179–3203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.M. E. Sander, P. Ablin, M. Blondel, G. Peyré, "Sinkformers: Transformers with doubly stochastic attention" in International Conference on Artificial Intelligence and Statistics (PMLR, 2022), pp. 3515–3530.
46.Y. Yang, Z. Huang, D. Wipf, Transformers from an optimization perspective. arXiv [Preprint] (2022). http://arxiv.org/abs/2205.13891 (Accessed 27 May 2022).
47.Kanerva P., Sparse Distributed Memory (MIT Press, 1988). [Google Scholar]
48.Perea G., Araque A., Astrocytes potentiate transmitter release at single hippocampal synapses. Science 317, 1083–1086 (2007). [DOI] [PubMed] [Google Scholar]
49.Perea G., Navarrete M., Araque A., Tripartite synapses: Astrocytes process and control synaptic information. Trends Neurosci. 32, 421–431 (2009). [DOI] [PubMed] [Google Scholar]
50.De Pittà M., Brunel N., Volterra A., Astrocytes: Orchestrating synaptic plasticity? Neuroscience 323, 43–61 (2016). [DOI] [PubMed] [Google Scholar]
51.Ivanov V., Michmizos K., Increasing liquid state machine performance with edge-of-chaos dynamics organized by astrocyte-modulated plasticity. Adv. Neural. Inf. Process. Syst. 34, 25703–25719 (2021). [Google Scholar]
52.Rahimi A., Recht B., Random features for large-scale kernel machines. Adv. Neural. Inf. Process. Syst. 20 (2007). [Google Scholar]
53.K. Choromanski et al., Rethinking attention with performers. arXiv [Preprint] (2020). http://arxiv.org/abs/2009.14794 (Accessed 30 September 2020).
54.H. Peng et al., Random feature attention. arXiv [Preprint] (2021). http://arxiv.org/abs/2103.02143 (Accessed 19 March 2021).
55.A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret, "Transformers are RNNs: Fast autoregressive transformers with linear attention" in International Conference on Machine Learning (PMLR, 2020), pp. 5156–5165.
56.Rasmusson D. D., The role of acetylcholine in cortical synaptic plasticity. Behav. Brain Res. 115, 205–218 (2000). [DOI] [PubMed] [Google Scholar]
57.Georgopoulos A. P., Kalaska J. F., Caminiti R., Massey J. T., On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci. 2, 1527–1537 (1982). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Salinas E., Abbott L. F., Vector reconstruction from firing rates. J. Comput. Neurosci. 1, 89–107 (1994). [DOI] [PubMed] [Google Scholar]
59.Z. Lan et al., ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv [Preprint] (2019). http://arxiv.org/abs/1909.11942 (Accessed 9 February 2020).
60.T. Wolf et al., Huggingface’s transformers: State-of-the-art natural language processing. arXiv [Preprint] (2019). http://arxiv.org/abs/1910.03771 (Accessed 14 July 2020).
61.B. Wu et al., Visual transformers: Token-based image representation and processing for computer vision. arXiv [Preprint] (2020). 10.48550/arXiv.2006.03677 (Accessed 20 November 2020). [DOI]
62.J. Deng et al., “Imagenet: A large-scale hierarchical image database" in 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255.
63.Chance F. S., Abbott L. F., Divisive inhibition in recurrent networks. Netw. Comput. Neural Syst. 11, 119 (2000). [PubMed] [Google Scholar]
64.Holt G. R., Koch C., Shunting inhibition does not have a divisive effect on firing rates. Neural Comput. 9, 1001–1013 (1997). [DOI] [PubMed] [Google Scholar]
65.Semyanov A., Henneberger C., Agarwal A., Making sense of astrocytic calcium signals—From acquisition to interpretation. Nat. Rev. Neurosci. 21, 551–564 (2020). [DOI] [PubMed] [Google Scholar]
66.Markram H., Lübke J., Frotscher M., Sakmann B., Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275, 213–215 (1997). [DOI] [PubMed] [Google Scholar]
67.Bi G., Poo M., Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18, 10464–10472 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Erickson M. A., Maramara L. A., Lisman J., A single brief burst induces GluR1-dependent associative short-term potentiation: A potential mechanism for short-term memory. J. Cogn. Neurosci. 22, 2530–2540 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Bittner K. C., Milstein A. D., Grienberger C., Romani S., Magee J. C., Behavioral time scale synaptic plasticity underlies CA1 place fields. Science 357, 1033–1036 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Magee J. C., Grienberger C., Synaptic plasticity forms and functions. Annu. Rev. Neurosci. 43, 95–117 (2020). [DOI] [PubMed] [Google Scholar]
71.Fan L. Z., et al. , All-optical physiology resolves a synaptic basis for behavioral timescale plasticity. Cell 186, 543–559.e19 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Song S., Miller K. D., Abbott L. F., Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci. 3, 919–926 (2000). [DOI] [PubMed] [Google Scholar]
73.Sjöström J., et al. , Spike-timing dependent plasticity. Scholarpedia 35, 1362 (2010). [Google Scholar]
74.Croft W., Dobson K. L., Bellamy T. C., “Equipping glia for long-term integration of network activity” in Neural Plasticity, Plasticity of Neuron-Glial Transmission (Hindawi, 2015). [DOI] [PMC free article] [PubMed]
75.Witthoft A., Karniadakis G. E., A bidirectional model for communication in the neurovascular unit. J. Theor. Biol. 311, 80–93 (2012). [DOI] [PubMed] [Google Scholar]
76.Savtchenko L. P., Rusakov D. A., Regulation of rhythm genesis by volume-limited, astroglia-like signals in neural networks. Philos. Trans. R. Soc. B: Biol. Sci. 369, 20130614 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
77.De Pittà M., Brunel N., Multiple forms of working memory emerge from synapse–astrocyte interactions in a neuron–glia network model. Proc. Natl. Acad. Sci. U.S.A. 119, e2207912119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Becker S., Nold A., Tchumatchenko T., Modulation of working memory duration by synaptic and astrocytic mechanisms. PLoS Comput. Biol. 18, e1010543 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Gordleeva S. Y., et al. , Modeling working memory in a spiking neuron network accompanied by astrocytes. Front. Cell. Neurosci. 15, 631485 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
80.G. Tang, I. E. Polykretis, V. A. Ivanov, A. Shah, K. P. Michmizos, “Introducing astrocytes on a neuromorphic processor: Synchronization, local plasticity and edge of chaos” in Proceedings of the 7th Annual Neuro-inspired Computational Elements Workshop (Association for Computing Machinery, New York, NY, 2019), pp. 1–9.
81.E. J. Peterson, What can astrocytes compute? bioRxiv [Preprint] (2021). 10.1101/2021.10.20.465192 (Accessed 1 December 2022). [DOI]
82.M. Schrimpf et al., Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv [Preprint] (2020). 10.1101/407007 (Accessed 9 May 2018). [DOI]
83.Figley C. R., Stroman P. W., The role(s) of astrocytes and astrocyte activity in neurometabolism, neurovascular coupling, and the production of functional neuroimaging signals. Eur. J. Neurosci. 33, 577–588 (2011). [DOI] [PubMed] [Google Scholar]
84.S. Kumar et al., Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model. bioRxiv [Preprint] (2022). 10.1101/2022.06.08.495348 (Accessed 9 May 2018). [DOI]
85.Adamsky A., Goshen I., Astrocytes in memory function: Pioneering findings and future directions. Neuroscience 370, 14–26 (2018). [DOI] [PubMed] [Google Scholar]
86.Nagai J., et al. , Hyperactivity with disrupted attention by activation of an astrocyte synaptogenic cue. Cell 177, 1280–1292 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
87.J. C. R. Whittington, J. Warren, T. E. J. Behrens, Relating transformers to models and neural representations of the hippocampal formation. arXiv [Preprint] (2021). http://arxiv.org/abs/2112.04035 (Accessed 15 March 2022).
88.D. Patterson et al., Carbon emissions and large neural network training. arXiv [Preprint] (2021). http://arxiv.org/abs/2104.10350 (Accessed 21 May 2021).
89.Balasubramanian V., Brain power. Proc. Natl. Acad. Sci. U.S.A. 118, e2107022118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
90.F. Geiger, M. Schrimpf, T. Marques, J. J. DiCarlo, Wiring up vision: Minimizing supervised synaptic updates needed to produce a primate ventral stream. bioRxiv [Preprint] 2020. 10.1101/2020.06.08.140111 (Accessed 6 August 2020). [DOI]
91.Escartin C., et al. , Reactive astrocyte nomenclature, definitions, and future directions. Nat. Neurosci. 24, 312–325 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
92.Volman V., Bazhenov M., Sejnowski T. J., Computational models of neuron–astrocyte interaction in epilepsy. Front. Comput. Neurosci. 6, 58 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
93.L. Kozachkov, D. Krotov, Building Transformers from Neurons and Astrocytes. GitHub. https://github.com/kozleo/neuron-astrocyte-transformer. Deposited 15 February 2023. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

There are no data underlying this work. The code used in this work is available at GitHub repository (https://github.com/kozleo/neuron-astrocyte-transformer) (93).

[r1] 1.Kofuji P., Araque A., Astrocytes and behavior. Annu. Rev. Neurosci. 44, 49–67 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Lind B. L., Brazhe A. R., Jessen S. B., Tan F. C. C., Lauritzen M. J., Rapid stimulus-evoked astrocyte Ca²⁺ elevations and hemodynamic responses in mouse somatosensory cortex in vivo. Proc. Natl. Acad. Sci. U.S.A. 110, E4678–E4687 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Pinto-Duarte A., Roberts A. J., Ouyang K., Sejnowski T. J., Impairments in remote memory caused by the lack of Type 2 IP₃ receptors. Glia 67, 1976–1989 (2019). [DOI] [PubMed] [Google Scholar]

[r4] 4.MacVicar B. A., Newman E. A., Astrocyte regulation of blood flow in the brain. Cold Spring Harb. Perspect. Biol. 7, a020388 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Chung W.-S., Allen N. J., Eroglu C., Astrocytes control synapse formation, function, and elimination. Cold Spring Harb. Perspect. Biol. 7, a020370 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.Kol A., Goshen I., The memory orchestra: The role of astrocytes and oligodendrocytes in parallel to neurons. Curr. Opin. Neurobiol. 67, 131–137 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Kastanenka K. V., et al. , A roadmap to integrate astrocytes into systems neuroscience. Glia 68, 5–26 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8] 8.M. López-Hidalgo, V. Kellner, J. Schummers, Astrocyte subdomains respond independently in vivo. bioRxiv [Preprint] (2019). 10.1101/675769 (Accessed 20 June 2019). [DOI]

[r9] 9.Stobart J. L., et al. , Cortical circuit activity evokes rapid astrocyte calcium signals on a similar timescale to neurons. Neuron 98, 726–735 (2018). [DOI] [PubMed] [Google Scholar]

[r10] 10.Yu M., et al. , Glia accumulate evidence that actions are futile and suppress unsuccessful behavior. Cell 178, 27–43 (2019). [DOI] [PubMed] [Google Scholar]

[r11] 11.Nagai J., et al. , Behaviorally consequential astrocytic regulation of neural circuits. Neuron 109, 576–596 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Lin Z., et al. , Entrainment of astrocytic and neuronal Ca²⁺ population dynamics during information processing of working memory in mice. Neurosci. Bull. 38, 474–488 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.A. Vaswani et al., “Attention is all you need” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2017), vol. 30. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (Accessed 6 December 2017).

[r14] 14.Toneva M., Wehbe L., Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). Adv. Neural. Inf. Process. Syst. 32, 14954–14964 (2019). [Google Scholar]

[r15] 15.D. Krotov, J. J. Hopfield, “Large associative memory problem in neurobiology and machine learning” in International Conference on Learning Representations (OpenReview.net, 2021).

[r16] 16.Schrimpf M., et al. , The neural architecture of language: Integrative modeling converges on predictive processing. Proc. Natl. Acad. Sci. U.S.A. 118, e2105646118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17] 17.D. Tyulmankov, C. Fang, A. Vadaparty, G. R. Yang, “Biological learning in key-value memory networks” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2021), vol. 34, pp. 22247–22258.

[r18] 18.Bricken T., Pehlevan C., Attention approximates sparse distributed memory. Adv. Neural. Inf. Process. Syst. 34, 15301–15315 (2021). [Google Scholar]

[r19] 19.J. C. R. Whittington, J. Warren, T. E. J. Behrens, Relating transformers to models and neural representations of the hippocampal formation. arXiv [Preprint] (2022). http://arxiv.org/abs/2112.04035 (Accessed 15 March 2022).

[r20] 20.Caucheteux C., King J.-R., Brains and algorithms partially converge in natural language processing. Commun. Biol. 5, 1–10 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21] 21.T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers. arXiv [Preprint] (2021). http://arxiv.org/abs/2106.04554 (Accessed 15 June 2021).

[r22] 22.Yamins D. L. K., et al. , Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. U.S.A. 111, 8619–8624 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23] 23.Hopfield J. J., Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554–2558 (1982). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r24] 24.Perea G., Navarrete M., Araque A., Tripartite synapses: Astrocytes process and control synaptic information. Trends Neurosci. 32, 421–431 (2009). [DOI] [PubMed] [Google Scholar]

[r25] 25.Von Bartheld C. S., Bahney J., Herculano-Houzel S., The search for true numbers of neurons and glial cells in the human brain: A review of 150 years of cell counting. J. Comp. Neurol. 524, 3865–3895 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26] 26.Halassa M. M., Fellin T., Takano H., Dong J.-H., Haydon P. G., Synaptic islands defined by the territory of a single astrocyte. J. Neurosci. 27, 6473–6477 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27] 27.Oberheim N. A., et al. , Uniquely hominid features of adult human astrocytes. J. Neurosci. 29, 3276–3287 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r28] 28.Semyanov A., Verkhratsky A., Astrocytic processes: From tripartite synapses to the active milieu. Trends Neurosci. 44, 781–792 (2021). [DOI] [PubMed] [Google Scholar]

[r29] 29.Verkhratsky A., Butt A., Glial Neurobiology: A Textbook (John Wiley& Sons, 2007). [Google Scholar]

[r30] 30.Newman E. A., New roles for astrocytes: Regulation of synaptic transmission. Trends Neurosci. 26, 536–542 (2003). [DOI] [PubMed] [Google Scholar]

[r31] 31.Gordon G. R. J., et al. , Astrocyte-mediated distributed plasticity at hypothalamic glutamate synapses. Neuron 64, 391–403 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r32] 32.Sul J.-Y., Orosz G., Givens R. S., Haydon P. G., Astrocytic connectivity in the hippocampus. Neuron Glia Biol. 1, 3–11 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r33] 33.Kuga N., Sasaki T., Takahara Y., Matsuki N., Ikegaya Y., Large-scale calcium waves traveling through astrocytic networks in vivo. J. Neurosci. 31, 2607–2614 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r34] 34.Scemes E., Giaume C., Astrocyte calcium waves: What they are and what they do. Glia 54, 716–725 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r35] 35.R. Bommasani et al., On the opportunities and risks of foundation models. arXiv [Preprint] (2021). http://arxiv.org/abs/2108.07258 (Accessed 12 June 2022).

[r36] 36.Brown T., et al. , Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020). [Google Scholar]

[r37] 37.OpenAI, ChatGPT: Optimizing language models for dialogue (2022). https://openai.com/blog/chatgpt/ (Accessed 12 May 2022).

[r38] 38.A. Dosovitskiy et al., An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [Preprint] (2020). http://arxiv.org/abs/2010.11929 (Accessed 3 June 2021).

[r39] 39.S. Hochreiter, Untersuchungen zu Dynamischen Neuronalen Netzen (Diploma, Technische Universität München, 1991), vol. 91.

[r40] 40.Bengio Y., Simard P., Frasconi P., Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994). [DOI] [PubMed] [Google Scholar]

[r41] 41.R. Pascanu, T. Mikolov, Y. Bengio, “On the difficulty of training recurrent neural networks” in International Conference on Machine Learning (PMLR, 2013), pp. 1310–1318.

[r42] 42.Raghu M., Unterthiner T., Kornblith S., Zhang C., Dosovitskiy A., Do vision transformers see like convolutional neural networks? Adv. Neural. Inf. Process. Syst. 34, 12116–12128 (2021). [Google Scholar]

[r43] 43.J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization. arXiv [Preprint] (2016). http://arxiv.org/abs/1607.06450 (Accessed 21 June 2016).

[r44] 44.Shen Y., Wang J., Navlakha S., A correspondence between normalization strategies in artificial and biological neural networks. Neural Comput. 33, 3179–3203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r45] 45.M. E. Sander, P. Ablin, M. Blondel, G. Peyré, "Sinkformers: Transformers with doubly stochastic attention" in International Conference on Artificial Intelligence and Statistics (PMLR, 2022), pp. 3515–3530.

[r46] 46.Y. Yang, Z. Huang, D. Wipf, Transformers from an optimization perspective. arXiv [Preprint] (2022). http://arxiv.org/abs/2205.13891 (Accessed 27 May 2022).

[r47] 47.Kanerva P., Sparse Distributed Memory (MIT Press, 1988). [Google Scholar]

[r48] 48.Perea G., Araque A., Astrocytes potentiate transmitter release at single hippocampal synapses. Science 317, 1083–1086 (2007). [DOI] [PubMed] [Google Scholar]

[r49] 49.Perea G., Navarrete M., Araque A., Tripartite synapses: Astrocytes process and control synaptic information. Trends Neurosci. 32, 421–431 (2009). [DOI] [PubMed] [Google Scholar]

[r50] 50.De Pittà M., Brunel N., Volterra A., Astrocytes: Orchestrating synaptic plasticity? Neuroscience 323, 43–61 (2016). [DOI] [PubMed] [Google Scholar]

[r51] 51.Ivanov V., Michmizos K., Increasing liquid state machine performance with edge-of-chaos dynamics organized by astrocyte-modulated plasticity. Adv. Neural. Inf. Process. Syst. 34, 25703–25719 (2021). [Google Scholar]

[r52] 52.Rahimi A., Recht B., Random features for large-scale kernel machines. Adv. Neural. Inf. Process. Syst. 20 (2007). [Google Scholar]

[r53] 53.K. Choromanski et al., Rethinking attention with performers. arXiv [Preprint] (2020). http://arxiv.org/abs/2009.14794 (Accessed 30 September 2020).

[r54] 54.H. Peng et al., Random feature attention. arXiv [Preprint] (2021). http://arxiv.org/abs/2103.02143 (Accessed 19 March 2021).

[r55] 55.A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret, "Transformers are RNNs: Fast autoregressive transformers with linear attention" in International Conference on Machine Learning (PMLR, 2020), pp. 5156–5165.

[r56] 56.Rasmusson D. D., The role of acetylcholine in cortical synaptic plasticity. Behav. Brain Res. 115, 205–218 (2000). [DOI] [PubMed] [Google Scholar]

[r57] 57.Georgopoulos A. P., Kalaska J. F., Caminiti R., Massey J. T., On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci. 2, 1527–1537 (1982). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r58] 58.Salinas E., Abbott L. F., Vector reconstruction from firing rates. J. Comput. Neurosci. 1, 89–107 (1994). [DOI] [PubMed] [Google Scholar]

[r59] 59.Z. Lan et al., ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv [Preprint] (2019). http://arxiv.org/abs/1909.11942 (Accessed 9 February 2020).

[r60] 60.T. Wolf et al., Huggingface’s transformers: State-of-the-art natural language processing. arXiv [Preprint] (2019). http://arxiv.org/abs/1910.03771 (Accessed 14 July 2020).

[r61] 61.B. Wu et al., Visual transformers: Token-based image representation and processing for computer vision. arXiv [Preprint] (2020). 10.48550/arXiv.2006.03677 (Accessed 20 November 2020). [DOI]

[r62] 62.J. Deng et al., “Imagenet: A large-scale hierarchical image database" in 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255.

[r63] 63.Chance F. S., Abbott L. F., Divisive inhibition in recurrent networks. Netw. Comput. Neural Syst. 11, 119 (2000). [PubMed] [Google Scholar]

[r64] 64.Holt G. R., Koch C., Shunting inhibition does not have a divisive effect on firing rates. Neural Comput. 9, 1001–1013 (1997). [DOI] [PubMed] [Google Scholar]

[r65] 65.Semyanov A., Henneberger C., Agarwal A., Making sense of astrocytic calcium signals—From acquisition to interpretation. Nat. Rev. Neurosci. 21, 551–564 (2020). [DOI] [PubMed] [Google Scholar]

[r66] 66.Markram H., Lübke J., Frotscher M., Sakmann B., Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275, 213–215 (1997). [DOI] [PubMed] [Google Scholar]

[r67] 67.Bi G., Poo M., Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18, 10464–10472 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r68] 68.Erickson M. A., Maramara L. A., Lisman J., A single brief burst induces GluR1-dependent associative short-term potentiation: A potential mechanism for short-term memory. J. Cogn. Neurosci. 22, 2530–2540 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r69] 69.Bittner K. C., Milstein A. D., Grienberger C., Romani S., Magee J. C., Behavioral time scale synaptic plasticity underlies CA1 place fields. Science 357, 1033–1036 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r70] 70.Magee J. C., Grienberger C., Synaptic plasticity forms and functions. Annu. Rev. Neurosci. 43, 95–117 (2020). [DOI] [PubMed] [Google Scholar]

[r71] 71.Fan L. Z., et al. , All-optical physiology resolves a synaptic basis for behavioral timescale plasticity. Cell 186, 543–559.e19 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r72] 72.Song S., Miller K. D., Abbott L. F., Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci. 3, 919–926 (2000). [DOI] [PubMed] [Google Scholar]

[r73] 73.Sjöström J., et al. , Spike-timing dependent plasticity. Scholarpedia 35, 1362 (2010). [Google Scholar]

[r74] 74.Croft W., Dobson K. L., Bellamy T. C., “Equipping glia for long-term integration of network activity” in Neural Plasticity, Plasticity of Neuron-Glial Transmission (Hindawi, 2015). [DOI] [PMC free article] [PubMed]

[r75] 75.Witthoft A., Karniadakis G. E., A bidirectional model for communication in the neurovascular unit. J. Theor. Biol. 311, 80–93 (2012). [DOI] [PubMed] [Google Scholar]

[r76] 76.Savtchenko L. P., Rusakov D. A., Regulation of rhythm genesis by volume-limited, astroglia-like signals in neural networks. Philos. Trans. R. Soc. B: Biol. Sci. 369, 20130614 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r77] 77.De Pittà M., Brunel N., Multiple forms of working memory emerge from synapse–astrocyte interactions in a neuron–glia network model. Proc. Natl. Acad. Sci. U.S.A. 119, e2207912119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r78] 78.Becker S., Nold A., Tchumatchenko T., Modulation of working memory duration by synaptic and astrocytic mechanisms. PLoS Comput. Biol. 18, e1010543 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r79] 79.Gordleeva S. Y., et al. , Modeling working memory in a spiking neuron network accompanied by astrocytes. Front. Cell. Neurosci. 15, 631485 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r80] 80.G. Tang, I. E. Polykretis, V. A. Ivanov, A. Shah, K. P. Michmizos, “Introducing astrocytes on a neuromorphic processor: Synchronization, local plasticity and edge of chaos” in Proceedings of the 7th Annual Neuro-inspired Computational Elements Workshop (Association for Computing Machinery, New York, NY, 2019), pp. 1–9.

[r81] 81.E. J. Peterson, What can astrocytes compute? bioRxiv [Preprint] (2021). 10.1101/2021.10.20.465192 (Accessed 1 December 2022). [DOI]

[r82] 82.M. Schrimpf et al., Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv [Preprint] (2020). 10.1101/407007 (Accessed 9 May 2018). [DOI]

[r83] 83.Figley C. R., Stroman P. W., The role(s) of astrocytes and astrocyte activity in neurometabolism, neurovascular coupling, and the production of functional neuroimaging signals. Eur. J. Neurosci. 33, 577–588 (2011). [DOI] [PubMed] [Google Scholar]

[r84] 84.S. Kumar et al., Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model. bioRxiv [Preprint] (2022). 10.1101/2022.06.08.495348 (Accessed 9 May 2018). [DOI]

[r85] 85.Adamsky A., Goshen I., Astrocytes in memory function: Pioneering findings and future directions. Neuroscience 370, 14–26 (2018). [DOI] [PubMed] [Google Scholar]

[r86] 86.Nagai J., et al. , Hyperactivity with disrupted attention by activation of an astrocyte synaptogenic cue. Cell 177, 1280–1292 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r87] 87.J. C. R. Whittington, J. Warren, T. E. J. Behrens, Relating transformers to models and neural representations of the hippocampal formation. arXiv [Preprint] (2021). http://arxiv.org/abs/2112.04035 (Accessed 15 March 2022).

[r88] 88.D. Patterson et al., Carbon emissions and large neural network training. arXiv [Preprint] (2021). http://arxiv.org/abs/2104.10350 (Accessed 21 May 2021).

[r89] 89.Balasubramanian V., Brain power. Proc. Natl. Acad. Sci. U.S.A. 118, e2107022118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r90] 90.F. Geiger, M. Schrimpf, T. Marques, J. J. DiCarlo, Wiring up vision: Minimizing supervised synaptic updates needed to produce a primate ventral stream. bioRxiv [Preprint] 2020. 10.1101/2020.06.08.140111 (Accessed 6 August 2020). [DOI]

[r91] 91.Escartin C., et al. , Reactive astrocyte nomenclature, definitions, and future directions. Nat. Neurosci. 24, 312–325 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r92] 92.Volman V., Bazhenov M., Sejnowski T. J., Computational models of neuron–astrocyte interaction in epilepsy. Front. Comput. Neurosci. 6, 58 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r93] 93.L. Kozachkov, D. Krotov, Building Transformers from Neurons and Astrocytes. GitHub. https://github.com/kozleo/neuron-astrocyte-transformer. Deposited 15 February 2023. [DOI] [PMC free article] [PubMed]

PERMALINK

Building transformers from neurons and astrocytes

Leo Kozachkov

Ksenia V Kastanenka

Dmitry Krotov

Significance

Abstract

Primer on Astrocyte Biology.

Primer on Transformers.

Fig. 1.

Biological Implementation of a Transformer Block

Neuron–Astrocyte Network.

Neural Activation Function.

Astrocyte Process Dynamics.

Writing Phase.

Reading Phase.

Random Feature Activations.

General Case of Untied Weights

Fig. 2.

Numerical Validation

Fig. 3.

Do We Need Astrocytes?

Timescales

Discussion

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Building transformers from neurons and astrocytes

Leo Kozachkov

Ksenia V Kastanenka

Dmitry Krotov

Significance

Abstract

Primer on Astrocyte Biology.

Primer on Transformers.

Fig. 1.

Biological Implementation of a Transformer Block

Neuron–Astrocyte Network.

Neural Activation Function.

Astrocyte Process Dynamics.

Writing Phase.

Reading Phase.

Random Feature Activations.

General Case of Untied Weights

Fig. 2.

Numerical Validation

Fig. 3.

Do We Need Astrocytes?

Timescales

Discussion

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases