Skip to main content
Journal of the Royal Society Interface logoLink to Journal of the Royal Society Interface
. 2021 Apr 14;18(177):20210031. doi: 10.1098/rsif.2021.0031

On reaction network implementations of neural networks

David F Anderson 1,, Badal Joshi 2, Abhishek Deshpande 1
PMCID: PMC8086923  PMID: 33849332

Abstract

This paper is concerned with the utilization of deterministically modelled chemical reaction networks for the implementation of (feed-forward) neural networks. We develop a general mathematical framework and prove that the ordinary differential equations (ODEs) associated with certain reaction network implementations of neural networks have desirable properties including (i) existence of unique positive fixed points that are smooth in the parameters of the model (necessary for gradient descent) and (ii) fast convergence to the fixed point regardless of initial condition (necessary for efficient implementation). We do so by first making a connection between neural networks and fixed points for systems of ODEs, and then by constructing reaction networks with the correct associated set of ODEs. We demonstrate the theory by constructing a reaction network that implements a neural network with a smoothed ReLU activation function, though we also demonstrate how to generalize the construction to allow for other activation functions (each with the desirable properties listed previously). As there are multiple types of ‘networks’ used in this paper, we also give a careful introduction to both reaction networks and neural networks, in order to disambiguate the overlapping vocabulary in the two settings and to clearly highlight the role of each network’s properties.

Keywords: neural networks, reaction networks, ReLU

1. Introduction

There is a growing interest in synthetic chemical reaction networks that carry out some pre-determined task [113]. The field that develops and analyses these networks often goes by the name ‘computation with chemical reaction networks’. The tasks being carried out can range from the pedestrian, such as determining the minimum or sum of two numbers, to the complex. The goal of this style of work is not to devise methods that can match or exceed silicon based computers in terms of speed, but instead it is to develop methods of computation for environments in which silicon based computers cannot currently go—for instance in the cellular environment. A particular type of (complex) computation now found ubiquitously in our daily technology is machine learning via neural networks, and so it is no surprise that there has been recent work on the development of chemical reaction network implementations of neural networks with a fixed set of parameters [8,1420]. More generally, work focused in this context on understanding the connection between biochemical models and the physical mechanisms of information processing stretches back at least through the 1960s [2129].

The papers we are aware of in the literature pertaining to chemical reaction network implementations of neural networks focus on particular constructions. Hence, there is currently little mathematical theory developed that can be utilized in a general manner. (An exception is [8], which develops the necessary theory for chemical Boltzmann machines to be implemented via stochastic models of chemical reaction networks.) Moreover, it is often simulation that is put forth as evidence to demonstrate the validity of a construction as opposed to rigorous proof. Thus, these works are not mathematical in nature. (This should not be taken as a criticism, as these papers were not meant to focus on the mathematics.) The major goal of this work, therefore, is to begin the development of a mathematical framework for the construction of deterministically modelled reaction networks that implement neural networks and machine learning. In particular, the mathematical framework will allow us to prove that the dynamical system associated with the constructed chemical reaction network will (i) implement a given neural network and (ii) have certain desirable properties, briefly outlined below.

Some further details are called for before proceeding. In order to devise deterministically modelled chemical reaction networks that implement neural networks, the following broad strategy may be employed:

  • 1.

    Fix a neural network with some choice of activation function, φ, and parameters (biases and weights), P. Denote the output values of the neural network via Ψ(d), where d is an input (data).

  • 2.
    Determine a chemical reaction network {S,C,R} for which the associated mass-action ODE system
    x˙(t)=f(x(t)),x(0)=d, 1.1
    satisfies F(x) = Ψ(d), where F is some functional of the solution, x, to (1.1) (note that the solution x depends on d, the initial value). In particular, it is natural to take the output to be the limiting steady-state values of some ordered subset of the species,
    F(x)=(limtxi(t))iI,
    where I is some index set.

The above is the basic strategy of [14], in which they design a reaction network to learn the XOR function, and of [19]. We note that a different modelling framework is used in [20], in which limiting values are found when certain counts go to zero (and remain there).

The basic strategy outlined above, i.e. using the limiting values of an initial value problem (1.1) to represent the output of a neural network, is quite natural, but it leaves open a number of questions that need to be addressed for a given construction:

  • 1.

    When will the constructed reaction network admit limiting steady states?

  • 2.

    Assuming limiting steady-state values exist, under what conditions will they be unique for a given choice of model parameters and for a given initial condition?

  • 3.

    Assuming there are unique limiting steady states, when will they be smooth in the parameters (which is important for gradient descent and other optimization procedures)?

  • 4.

    How long will it take the model to converge? In particular, could the time required to determine the output of the system depend strongly on the initial conditions?

We note that these are highly non-trivial questions in the present context as mass-action models of chemical systems are polynomial dynamical systems, and are known to exhibit myriad behaviours including chaotic behaviour [30].

In this article, we develop a mathematical framework that is capable of resolving the questions posed above. Moreover, we use our framework to develop a chemical reaction network implementation of an arbitrarily sized neural network with a smoothed ReLU activation function (see equation (3.2) and figure 4). Using our mathematical framework, we prove that this construction leads to a system that is exponentially reliable (i.e. the output of the system is unique and is smooth with respect to the parameters of the model, and the process converges exponentially fast) and converges from infinity in finite time (so the convergence time is uniformly bounded over all initial conditions). See definitions 4.7 and 5.3 for the precise meaning of these terms.

Figure 4.

Figure 4.

The function (1/2)(y+y2+4h) for h = 0.3, 0.1, 0. Note that the h = 0 case is the ReLU.

The applications possible from neural network implementations of chemical reaction networks seem nearly limitless. However, it is the view of these authors that this potential can only be achieved once a solid mathematical foundation is created upon which to build the necessary theory and, eventually, physical implementations—perhaps via DNA strand displacement [9,31,32]. We therefore view this work as a starting point, with follow-up work focused on implementations of neural networks that can perform gradient descent autonomously, allowing us to relax the assumption of a fixed set of parameters, in both supervised and unsupervised settings. Finally, while the focus of the current paper is on implementations of neural networks via deterministically modelled reaction networks, stochastic variants are possible as well. In particular, stochastically modelled reaction networks will be the more natural choice whenever the goal is the approximation of distributions as opposed to functions [8]. Study of such implementations is therefore another exciting avenue of future research.

We end the this section with a brief collection of some notation that will be used throughout this paper. We denote the empty set by Ø. We denote an arbitrary index set by I. We use the notation ˙iIAi to mean the union iIAi where AiAj=Ø for all i,jI such that ij. By partition of a set S, we mean a collection of non-empty subsets of S, {AiØ:iI}, such that S=˙iIAi. For two vectors u, v, we will denote the Hadamard product, which is simply term-wise multiplication, via . That is, we have

(uv)i=uivi.

For a function f:RcR and a vector u=(u1,,uc) we denote by uf the vector whose ith component is ∂f/∂ui. For a vector valued function f, we denote by f ′(x) the vector whose ith component is fi′(x).

The remainder of the paper is organized as follows. Sections 2 and 3 give primers, including notation used in this paper, on reaction networks and neural networks, respectively. As there are two distinct notions of networks in this paper, it is important to carefully separate the two. In §4, we present our main theoretical results pertaining to ODE implementations of neural networks. In §5, we demonstrate how to utilize our theoretical results to construct a reaction network that implements a given neural network with a fixed set of parameters and a smoothed ReLU activation function. In §6, we provide a detailed example, including a demonstration of how to utilize our theory to implement neural networks with different activation functions.

2. Reaction networks

Reaction networks are graphical representations of interactions between different ‘species’. In this context, the word species may refer to different organisms (for example, if you are modelling the interactions among foxes and hares) or to different chemical compounds (for example, if you are modelling the dynamics of a biochemical process within a cell). In this paper, we are primarily interested in the latter context and will also refer to reaction networks as ‘chemical reaction networks’, as is common.

Definition 2.1. —

A reaction network, or chemical reaction network, consists of a nonempty and finite set of species S and directed graph with vertices C and directed edges R satisfying the following conditions:

  • each vertex is a linear combination of the species over the non-negative integers;

  • every species appears with a positive coefficient in at least one vertex;

  • no two vertices are the same linear combination of the species;

  • each vertex is connected by a directed edge to at least one other vertex;

  • there are no directed edges from a vertex to itself.

Vertices of the reaction network are called complexes, and directed edges are called reactions. If Y, Y^C are two complexes and there is a directed edge from Y to Y^, we will write YY^R. We will often denote a reaction network via G=(S,C,R). △

When considering general/theoretical systems, we will typically denote the species as S={X1,,Xn}, in which case our vertices/complexes are of the form

Y=b1X1++bnXn,wherebiZ0for each i{1,,n}.

We will use the common slight abuse of notation by also associating a complex YC with the vector in Z0n whose ith component is bi. Using this convention, we define the reaction vector for a reaction YY^R as

ζYY^=Y^YZ0n.

When considering specific examples, we will use more suggestive notation for our species. We present two examples to solidify the notation. It is a common practice, which we use here, to specify a reaction network by writing all the reactions, since the sets S, C and R are contained in this description.

Example 2.2. —

Consider the following reaction network with two species, S={X1,X2}:

X1+X22X2

and

X2X1.

Here the set of complexes/vertices is {X1 + X2, 2X2, X2, X1}. For example, it could be that X1 is an active form of a protein and X2 is the inactive form and two actions can take place: (i) an inactive protein can catalyse the inactivation of an active protein and (ii) an inactive protein can spontaneously become active. For another example, we could use the network to model disease spread, with X1 representing healthy/susceptible individuals and X2 representing those that are infected.

Whatever the modelling scenario is, the network is the same and consists of two species, four complexes (vertices) and two reactions. The associated reaction vectors are

ζX1+X22X2=[11]andζX2X1=[11].

Example 2.3. —

Consider the following reaction network with three species, S={X1,X2,X3}:

0X1+X2,X1X3X1+X3.

In this example, molecules of X1 and X2 enter the system from outside of it via 0 → X1 + X2, X1 can spontaneously convert to X3 and vice versa via the two reactions X1X3, and X3 catalyses the removal of X1 molecules via the reaction X1 + X3X3.

The reaction network tells us the constituent species of a model, the counts of each of the species required for each of the reactions to take place and the counts of the products of each reaction. Moreover, the reaction vectors give the net changes in the counts of the species due to the occurrence of the different reactions. However, the reaction network does not determine the rates at which the different reactions take place.

A common modelling choice is to assume that the vector of concentrations of the species at time t ≥ 0, denoted by x(t)R0n, satisfies a system of the form

x˙(t)=YY^RλYY^(x(t))ζYY^, 2.1

where the enumeration is over all of the reactions and λYY^:R0nR0 is some function. The set of functions Λ={λYY^} is called the kinetics of the model, and the most common form of kinetics, and the one we use throughout, is termed mass-action kinetics in which

λYY^(x)=κYY^i=1nxiYi,

for some choice of rate constant κYY^>0 and where Yi is the ith component of Y viewed as a vector in Z0n. When Λ is mass-action kinetics, we say that (G,Λ) is a mass-action system. When mass-action kinetics is used, it is common to place the reaction rate constant next to the associated arrow in the graph, YκYY^Y.

3. Neural networks

We give a basic introduction to the type of neural networks we consider in this paper—feed forward. For more on neural networks, see [3337]. Loosely, a neural network is a graph that gives a visual depiction of a certain type of mathematical function. The class of functions they can represent, which will be detailed below, have many parameters, and are ‘universal’ in that they can be used to approximate any continuously differentiable function arbitrarily well [38,39]. The power of neural networks comes from the fact that they can be ‘trained’ from data, which simply means that the parameters of the function can be calibrated algorithmically so as to produce a final function capable of carrying out some pre-determined task (such as image recognition).

Below, we will first introduce the basic structure of a neural network. Next, we will explain how each such graph, when combined with a choice of parameters and an ‘activation function’, is simply a representation for a particular function. We will call such a network, in which all parameters, together with the activation function, are fixed, a ‘hardwired’ neural network. Finally, we will discuss how neural networks can be trained by finding parameters for the network that minimize (at least locally) a desired cost function. This minimization is often performed by a version of gradient descent and is termed backpropagation in the field.

3.1. Structure of a neural network

Formally, a feed-forward neural network G = (V, D) is a directed graph on a set of nodes V and a set of directed edges DV × V, such that there is a partition of V into layers L, V=˙=0mL, with the property that (X~,X~)D if and only if X~L and X~L+1 for some ℓ ∈ {0, …, m − 1}. We will refer to the set L as the ℓth layer of G, so G has m + 1 layers, and each L, with 0 ≤ ℓ ≤ m, contains c > 0 nodes. The nodes in L0 are referred to as input nodes, while those in Lm as the output nodes. All nodes in =1m1L are referred to as hidden nodes or intermediate nodes. We use input layer, output layer, and hidden layer to refer to each layer that contains the corresponding nodes. Note that we can partition D as follows:

D=˙:1m˙X~L˙X~:(X~,X~)D(X~,X~). 3.1

For the sake of brevity, for the remainder of the paper we will refer to feed-forward neural networks simply as neural networks.

Indices can often become burdensome when working with neural networks. Thus, we minimized their use in the preceding explanation, and will continue to do so when possible. That said, it will be useful to have an enumeration and so we will denote the jth node in layer ℓ by X~j. See figure 1.

Figure 1.

Figure 1.

The graphical structure of a neural network. The red, blue and green nodes are input nodes, hidden nodes and output nodes, respectively. An arrow from one node to another is a representation of the direction of influence, i.e. an edge in D. The value of the ‘tail’ node is input for computation of the value of the ‘head’ node.

3.2. A neural network as a mathematical function

We label each non-input node and each directed edge with a real number. A label for a non-input node is termed a bias, whereas a label for an edge is termed a weight. Moreover, we associate an activation function with each non-input node, which will be described fully below. We will call a neural network with such a labelling and a choice of activation function a hardwired neural network. For each ℓ ∈ {1, …, m}, we will denote by βRc the vector whose ith component gives the bias for node X~i, and will denote by WRc×c1 the matrix whose (i, j)th entry represents the weight of the edge between nodes X~j1 and X~i Note that the ordering of the indices of W seems backwards at first glance. However, this ordering will make certain expressions slightly cleaner later, and is standard in the field.

We will use the notation B for the assignment of node labels (biases) and W for the assignment of edge labels (weights). That is, for each of ℓ ∈ {1, …, m}, we have B()=β and W()=W. Collectively, P=(B,W) is an assignment of labels to G = (V, D). So long as we have also chosen an activation function φ, which will be described directly below, we may denote the resulting hardwired neural network via (G,P,φ).

Let φ:RR0 be a continuous, monotonic function, which is then extended to φ:RcR0c for c{2,3,} by letting (φ(y))i = φ(yi). We present a few examples of some so-called activation functions φ:RR0.

  • 1.

    φ1(y) = 1/(1 + ey). This sigmoid function is a bijection onto the interval (0, 1), and is used quite commonly. See figure 2.

  • 2.

    φ2(y) = max(0, y). This function is termed the ReLU function (rectified linear units). See figure 3.

  • 3.
    Let h ≥ 0 and define
    φ3(y)=12(y+y2+4h). 3.2
    This function is a smoothed version of the ReLU function, while remaining strictly monotonic, and will play a key role in the present work. See figure 4.

Figure 2.

Figure 2.

The function 1/(1 + ey).

Figure 3.

Figure 3.

The ReLU activation function.

A pair of consecutive layers Lℓ−1 and L along with all edges between the two layers, encode a function ψ:Rc1Rc which is defined via

ψ(y)=φ(Wy+β). 3.3

Taking compositions, a hardwired neural network is then simply a visual representation for the function Ψ(G,P,φ):Rc0R0cm defined via

Ψ(G,P,φ)=ψmψm1ψ1.

Thus, the function associated with a neural network is simply a sequence of compositions that alternates between linear functions (via matrix multiplication and vector addition) and nonlinear functions (via application of the activation function).

It is useful to provide a bit more notation before moving on. Suppose that dRc0 is the input to the function Ψ(G,P,φ) (or, equivalently, the function ψ1). We then define a0 = d and for 1 ≤ ℓ ≤ m we define

z(d)=Wa1(d)+β 3.4

and

a(d)=φ(z(d)), 3.5

recursively, where we recall that the ith component of φ(z(d)) is φ(zi(d)). The vector a(d) is said to give the activations of the nodes in the ℓth layer. With these definitions, we have that for any {1,,m}

Ψ(G,P,φ)(d)=ψmψ(a1(d)).

Moreover, note that Ψ(G,P,φ)=am, which is a useful compact notation for Ψ(G,P,φ).

3.3. Learning from data

Suppose now that we are given N pieces of data of the form (d,τ(d))Rc0×cm. For example, and to take a common example, dR784 could be the values of the 28 × 28 = 784 pixels in a greyscale image of a hand-drawn number, and τ(d)R010 could be the vector ei (the vector with a 1 in the ith digit and zeros elsewhere) if the image is that of a hand-drawn i − 1. Here d is considered the input data and τ(d) is considered the ‘truth’. We could then construct a neural network with c0 = 784 and cm = 10 simply by choosing (i) the number of hidden layers, and how many nodes per layer, (ii) biases and weights, P=(B,W), for the nodes and directed edges, and (iii) an activation function φ. In such a manner, our hardwired function Ψ(G,P,φ) is determined.

At this point, we could ask how closely our function matches the ‘truth’ by looking at some cost function. Therefore, assume that we have a cost function of the form

Cost(P)=1NdC(d,P)=1NdC(d), 3.6

where the sum is over all the data and C is a function giving a measure of how closely Ψ(G,P,φ)(d)=am(d) approximates τ(d). The second equality above points out that for notational convenience we will typically suppress the dependence of the parameters P=(B,W) in C. Some of the most commonly used cost functions are given below:

  • 1.
    The quadratic cost function, in which case
    C(d)=12(Ψ(G,P,φ)(d)τ(d))2=12(am(d)τ(d))2. 3.7
  • 2.
    The one-norm cost function, in which case
    C(d)=|am(d)τ(d)|.
  • 3.
    The cross-entropy cost function, in which case
    C(d)=[τ(d)ln(am(d))+(1(τ(d)))ln(1am(d))].

In this paper, we will take C to be given by the quadratic cost function (3.7). This choice of cost function does not play a significant role in the present work.

Of course, we did not specify how we chose our parameters P=(B,W) for the model. Supposing we choose them randomly somehow, there is no reason our function Ψ(G,P,φ) should be a good approximation for τ for the given data. Therefore, we would like to find those parameters P that minimize the cost function and to do so it is natural to use gradient descent. Thus, we need to be able to efficiently compute βCost and CostWij for each appropriate value of ℓ, i, and j. Because of the sum in (3.6), it is sufficient to compute the gradient of C(d), and these can be computed as follows [36]:

δL(d)=amC(d)φ(zm(d))δ(d)=((W+1)Tδ+1(d))φ(z(d))βC(d)=δ(d)andC(d)Wij=δi(d)aj1(d),} 3.8

where amC(d) is the gradient of C(d) with respect to am. For example, if C is given by the quadratic cost function (3.7), we have

amC(d)=(am(d)τ(d)).

4. Neural networks and ODEs

Fix a hardwired neural network G = (V, D) with parameters P=(B,W), whose ℓth layer contains c nodes, in which each node has activation function φ. Let W, a and β be as in the previous section.

Now consider a system of ODEs defined recursively via

xi0(t)di,forsomefixeddR0c0 4.1

and

ddtxi(t)=fi(x1(t),xi(t)),for{1,,m}, 4.2

where xR0c. Here we use dR0c0 to denote our initial condition as it represents the input ‘data’ to the system. Note that xℓ−1 is acting as an external ‘forcing function’ on x. In particular, the system above has a natural feed-forward structure. For r{0,1,,m}, we denote by Fr the subsystem of (4.1) and (4.2) consisting of only those terms xi for which ℓ ≤ r. Note that for any 1 ≤ rm, Fr contains Fr1 and that Fm is all of (4.1) and (4.2).

Definition 4.1. —

Suppose that for each fixed choice of dR0c0 the system (4.1) and (4.2) has a unique solution {x:1m} that satisfies

limtx(t)=φ(Wa1(d)+β)=a(d)R0c

for any choices of xi(0)R0 for ℓ ≥ 1. Then we say that the system (4.1) and (4.2) implements the neural network (G,P,φ). ▵

Note that in order for a system to implement a neural network according to the above definition, it is not enough for the system to simply convert inputs, d, to the correct outputs, am(d)=Ψ(G,P,φ)(d). Instead, we require that the system calculates the activations for each node in the network, i.e. a(d) for all ℓ ≤ m, and do so for any choice of initial condition in layers 1 through m.

Example 4.2. —

Consider a system (4.1) and (4.2) with

fi(x1,xi)=h+ρi(x1)xi(xi)2, 4.3

where

ρi(x1)=(Wx1+β)i=j=1c1Wijxj1+βi. 4.4

We claim that the system (4.1) and (4.2) with this choice of fi implements a neural network with the smoothed ReLU function (3.2). This statement will be proved rigorously below once we have some additional mathematical machinery.

For a particular choice of ℓ and i, we can think of the one-dimensional system (4.2) as simultaneously implementing both the linear updating step (3.4) and evaluation with the activation function (3.5) for node i in layer ℓ. This observation motivates the following.

Definition 4.3. —

If the system (4.1) and (4.2) implements the neural network (G,P,φ), then (4.2) is termed the activation system for node i in layer ℓ. ▵

The following definition is added for completeness.

Definition 4.4. —

We will say that y:R0Rn converges exponentially to y^Rn, and will write y(t)expy^ if there are c, h > 0 for which |y(t)y^|ceht for all t ≥ 0. ▵

The following definition characterizes some nice properties that activation systems (4.2) can have.

Definition 4.5. —

Consider the following one-dimensional system in which y:R0Rp is some forcing function:

ddtx(t)=f(y(t),x(t)). 4.5
  • 1.
    Let q > 0. The system (4.5) is said to have q-polynomial decay if for any compact set KRp there is an M > 0 and a constant c > 0 such that when yK and x > M we have
    f(y,x)cxq.
  • 2.

    System (4.5) is said to be exponentially feed-forward if for each y^Rp there is an x^R such that y(t)expy^ implies x(t)expx^, assuming x(t) exists for all t ≥ 0.

Thus, the system (4.5) has q-polynomial decay if it decays faster than the solution to u˙=cuq when (i) the forcing function takes values that are not too large (quantified by K) and (ii) the current value of the process is large (quantified by M). Note that for u(0) > 0, the solution to u˙=cuq converges from infinity in finite time if q > 1. For completeness, we have proven this in proposition A.1 in appendix A.

The usefulness of a system of the form (4.5) being exponentially feed-forward comes from the fact that we would like to be able to understand the long-term behaviour of x˙=f(y(t),x(t)) via an understanding of the long-term behaviour of x˙=f(y^,x(t)). We note with the following simple example that one is not always able to do so.

Example 4.6. —

Consider the system of the form (4.5) with

f(y,x)={1ify>1xify1.

The system with y(t) = 1 + et satisfies y(t)exp1. However, for this particular choice of y(t), we have y(t) > 1 for all t ≥ 0. Thus, x(t) = x(0) + t, which does not converge to the fixed point of x˙=f(1,x), which is zero regardless of x(0).

Given the discussion above, it will be useful to consider dynamical systems of the form

x˙(t)=f(y,x(t)),

where y should be thought of as a (time-independent) collection of parameters, but now x is allowed to be higher-dimensional.

Definition 4.7. —

Suppose that x˙(t)=f(y,x(t)) with x(t)R0n and yR>0p is a parametrized dynamical system such that for any choice of x(0)R0n and yR>0p the system has a unique solution. We will say that the system

  • 1.

    is reliable if there is a continuously differentiable function X:R>0pR>0n such that for any choice of x(0)R0n, we have limtx(t)=X(y);

  • 2.

    converges from infinity in finite time if there is a compact set KR0n and a R>0pR>0 such that x(t)K for any tT(y) and x(0)R0n;

  • 3.
    is exponentially reliable if it is reliable and there is a λ:R>0pR>0 such that
    |x(t)X(y)||x(0)X(y)|eλ(y)t.

Note that the definition of reliable does not rule out the existence of fixed points outside of R0n.

The main question we have is the following: when can we conclude that the fully parametrized system (4.1) and (4.2) has our desirable properties (reliability, convergence from infinity in finite time, and exponential reliability). The following theorem shows that these properties follow from easily checked conditions on the functions fi. In the theorem below, the vector of parameters y should be thought of as a steady state value for xℓ−1(t).

Theorem 4.8. —

Consider the system (4.1) and (4.2). Suppose that for each {1,,m} and i ∈ {1, …, c} the dynamical system

ddtx(t)=fi(y,x(t)),x(t)R,yR0c1,

is reliable. Moreover, assume that

ddtx(t)=fi(y,x(t))

has q-polynomial decay for some q > 1 and is exponentially feed-forward. Then the system (4.1) and (4.2) converges from infinity in finite time and is exponentially reliable.

Proof. —

The proof proceeds by induction on r for the systems Fr, where we remind the reader that the systems Fr are defined below (4.1) and (4.2). Consider the case ℓ = 1, where we have

ddtxi1(t)=fi1(x0,xi1(t)),fori{1,,c1}.

Here, reliability of xi1 follows by our assumption. The convergence of xi1 from infinity in finite time follows by the assumption of q-polynomial decay (compare with u˙=cuq). Finally, the exponential reliability of xi1 follows from the exponential feed-forward assumption (here x0expx0 trivially). Hence, the system F1 satisfies all the desired properties.

Now suppose the result holds for Fr with r < m. Then there is a compact set KR0cr and a time T > 0 so that xr(t)K for all tT, and moreover xrexpx^r. Hence, by the assumption of q-polynomial decay, xr+1(t) converges from infinity in finite time, and we may conclude that the system Fr+1 does as well. Finally, by the exponential feed-forward assumption on layer r + 1, together with the assumption that x˙i=fir+1(y,xi(t)) is reliable, we may conclude that Fr+1 is exponentially reliable, and the proof is complete. ▪

We return to the activation system presented in example 4.2.

Proposition 4.9. —

Consider the hardwired neural network (G,P,φ) and the system (4.1) and (4.2) with

fi(x1,xi)=h+ρi(x1)xi(xi)2,

where

ρi(x1)=(Wx1+β)i=j=1c1Wijxj1+βi,

and h > 0. This system implements, in the sense of definition 4.1, the hardwired feed-forward neural network (G,P,φ) where φ is given as the smoothed ReLU function (3.2). Moreover, the system converges from infinity in finite time and is exponentially reliable.

Proof. —

The fixed points of x˙=fi(z,x(t)) are

ρi(z)±ρi(z)2+4h2,

which satisfy

ρi(z)ρi(z)2+4h2<0<ρi(z)+ρi(z)2+4h2.

Note that the strict inequalities follow from h > 0. The positive equilibrium is continuously differentiable in the argument z. Moreover, for any x(0)R0, asymptotic stability follows from standard methods. Hence, each of x˙=fi(z,x(t)) is reliable.

For each ℓ and i, the system x˙(t)=fi(y(t),x(t)) has 2-polynomial decay. Hence, to apply theorem 4.8 and complete the proof we simply need to show that x˙(t)=fi(y(t),x(t)) is exponentially feed-forward.

Thus, consider x˙(t)=fi(y(t),x(t)) and suppose that y(t)expy^. Denote

x+=ρi(y^)+ρi(y^)2+4h2andx=ρi(y^)ρi(y^)2+4h2

and let

V(x)=12(xx+)2.

Then, by adding and subtracting appropriately,

ddtV(x(t))=(x(t)x+)(h+ρi(y(t))x(t)x(t)2)=(x(t)x+)(h+ρi(y^)x(t)x(t)2)+(x(t)x+)(ρi(y(t))ρi(y^))x(t)=(x(t)x)(x(t)x+)2+(x(t)x+)(ρi(y(t))ρi(y^))x(t).

By assumption y(t)expy^, and so by linearity we have that ρi(y(t))expρi(y^). Moreover, standard methods can be used to show that x(t) is uniformly bounded in time. Combining the above allows us to conclude that

ddtV(x(t))a(t)MV(x(t)),

where 0 ≤ a(t) ≤ c eht for some c, h > 0. Hence, by Gronwall’s inequality, see appendix A,

12(x(t)x+)2=V(x(t)12(x(t)x(0))2eMt+c0teM(ts)ehsds=12(x(t)x(0))2eMt+cMh(ehteMt),

where we can select hM by taking h slightly smaller if need be. Taking square roots shows that x(t)expx+ as desired. ▪

5. Reaction network implementation of a hard-wired neural network with a smoothed ReLU activation function

This section is split into two parts. In §5.1, we give some preliminary definitions and concepts. In §5.2, we give the explicit construction.

5.1. Preliminaries

Consider a reaction network G=(S,C,R). It is convenient to separate the species set S into a disjoint union of dynamic and enzymatic species.

Definition 5.1. —

XiS is said to be an enzymatic species if (ζYY^)i=0 for all YY^R. A species is said to be a dynamic species if it is not an enzymatic species. ▴

Thus, an enzymatic species is one whose concentration is fixed for all time to its initial value, regardless of the initial value of the system. Enzymatic species are referred to as such because they facilitate reactions to occur, just like biological enzymes; higher availability of enzymes results in a proportional speedup of reactions. We will use the notation Sdyn and Senz for the set of dynamic species and enzymatic species, respectively, and since any species can only be one or the other, S=Sdyn˙Senz.

Example 5.2. —

Consider the reaction network

X+Y+Ek12Y+E

and

Y+Fk2X+F.

Here Sdyn={X,Y} and Senz={E,F}.

The concentrations of enzymatic species are time-invariant by definition, and so they satisfy the trivial ODE de/dt = 0, where e refers to the concentration of some enzyme E. This ODE obviously has the solution e(t) = e(0) for all t ≥ 0, independent of the dynamics of the other variables, and so it is without any loss of information that we can withhold the ODEs for the enzymes from our description. We simply regard the initial values of the enzymes as parameters in the dynamical system. Thus, we would say that the parametrized mass-action dynamical system associated to the network in example 5.2 is

x˙=k1exy+k2fyandy˙=k1exyk2fy,} 5.1

where we regard e and f as positive parameters similar to k1 and k2.

An alternative approach is to remove all enzymatic species and to ‘absorb’ their time-invariant concentration into the rate constant of the reaction. For instance, the network in example 5.2 is dynamically equivalent to the following network:

X+Yk1e2Y

and

Yk2fX,

in the sense that both give rise to an identical system of differential equations (5.1).

Even though the former construction, in which enzymatic species are included in the model description, may seem superfluous, it offers flexibility that will be found to be useful later when we construct reaction networks modularly, and then take unions of them. In these situations species that were once enzymatic for one of the subnetworks can be dynamic for the resulting larger network. This perspective will also be useful in later work when we change our outlook from a reaction network implementation of a hardwired neural network to a neural network capable of learning. For a preview, suppose that we add a reaction to the network in example 5.2, so the resulting network is

X+Y+Ek12Y+EY+Fk2X+FandZ+Fk3Z.} 5.2

Then F has lost its status as an enzyme and has been moved to the set of dynamic species. The species partition for the new network is Sdyn={X,Y,F} and Senz={E,Z}. Addition of the reaction Z + FZ allows us to modulate the concentration of F and therefore also the rate at which the reaction Y + FX + F occurs.

The example is illustrative of some general properties, which we now state. A subnetwork of a reaction network G=(S,C,R) is a reaction network G=(S,C,R) such that RR. It necessarily follows that SS, since every species in S must participate in some reaction in R and therefore also in R, and by similar reasoning CC. While SdynSdyn, the containment for enzymes runs backwards, i.e. SenzSSenz. For example, the reaction network in example 5.2 is a subnetwork of the reaction network (5.2). The above-mentioned containments are easily checked to hold for this particular example.

With the assumption of mass-action kinetics, and for any particular choice of reaction rate constants, a reaction network can be translated into a system of ODEs via (2.1). Given this mapping, it is natural to say that a dynamical property of the parametrized ODE system is a property of the underlying reaction network itself. We proceed by fixing some notation, which will allow us to translate definition 4.7 to the reaction network setting. Let G=(S,C,R) be a reaction network with S=Sdyn˙Senz, and fix some (arbitrary) ordering of the dynamic species set Sdyn. Let n:=|Sdyn| and x(t)R0n denote the vector of concentrations of the dynamic species with respect to the ordering. We also arbitrarily order the set of parameters, which includes the reaction rate constants and the initial concentrations of enzymes. With the ordering, the parameters can be identified with a vector in yR>0p where p:=|R|+|Senz|.

Definition 5.3. —

Suppose that x˙(t)=f(y,x(t)) with x(t)R0n and yR>0p is a parametrized dynamical system obtained by applying mass-action kinetics to G=(S,C,R) with S=Sdyn˙Senz, n:=|Sdyn| and p:=|R|+|Senz|. Suppose that for any choice of x(0)R0n and yR>0p the system x˙(t)=f(y,x(t)) has a unique solution. We say that G is reliable, converges from infinity in finite time, or exponentially reliable if the parametrized system has those respective properties according to definition 4.7.

5.2. Construction

We will give the construction of a reaction network that implements a neural network with the smoothed ReLU activation function. We will specifically design the network so that the ODE system has fi as given in example 4.2.

The construction will proceed in the following manner. First, we build an explicit reaction network implementation of only a single edge, as depicted in figure 5, of the neural network, and describe the resulting parametrized ODE system. Second, we build on the previous step by giving a reaction network implementation of a single fixed node X~ in the neural network along with all of its inputs, and again describe the resulting parametrized ODE system. Finally, we describe the reaction network implementation of the entire neural network, which results in the parametrized ODE system in (4.1) and (4.2). For the sake of readability, we will limit the amount of enumeration used in our construction.

Figure 5.

Figure 5.

A single edge in a neural network, with one input and one output node.

Step 1. The first step of the process, producing the reactions necessary for the implementation of a single edge, is carried out in table 1. The species sets for this particular reaction network are Sdyn={X}, and Senz={H,W+,W,B+,B,X}. The associated mass-action ODE system is one-dimensional, in the variable x, and if we assume all reactions occur with a rate constant of 1, is

ddtx(t)=h+((b+b)+(w+w)x)x(t)(q1)x(t)q. 5.3

We will assume from here on that q = 2 and note that it is easy to make the necessary changes in the description for a general value different from 2. Note that when q ≠ 2 the resulting activation function will be different from the smoothed ReLU.

Table 1.

Components of an elementary reaction network—chemical implementation of a single directed edge (X~,X~) along with nodes X~ and X~ of the neural network. A neural network is naturally viewed as a disjoint union of its edges, which allows putting together a chemical implementation as an appropriate union of elementary reaction networks.

which aspect of neural network is implemented chemically? chemical implementation of a single directed edge (X~,X~) of the neural network which term results in the ODE for the species X?
closeness to ReLU HH+X h
input X~ and weight of
 the edge (X~,X~)
X+W++XX+W++2X
X+W+XX+W
(w+w)xx,
 where w : w+w implements the edge weight
additive node bias of X~ B++XB++2X
B+XB
(b+b)x,
 where b : b+b implements the node bias
q-polynomial decay,
 stability/convergence from ∞
qXX
 (q > 1)
(q − 1)xq

Step 2. For the second step, we implement via reaction network the neural network depicted in figure 6, which now simply consists of X~ along with all its inputs. For this particular node, we assume there are c > 0 inputs. The construction proceeds by simply taking the union over the c edges (X~i,X~) of the reaction networks described in Step 1. After this union, we once again have that X is the only dynamic species and the mass-action ODE for its concentration is given by

ddtx(t)=h+((b+b)+i=1c(wxi,x+wxi,x)xi)x(t)x(t)2=h+ρxx(t)x(t)2, 5.4

where ρx is defined by the equation above (and is analogous to ρi from (4.4)). Note that the above corresponds with the equations in example 4.2.

Figure 6.

Figure 6.

Step two: node X~ along with all its c inputs.

Step 3. The third step is to construct the final network by taking the union of the construction described in the second step over all non-input nodes X~. In terms of dynamical systems, this constitutes taking a union of the systems of ODEs given by (5.4), with the appropriate indices applied to the variables and parameters. The final system of equations appears in (4.1) and (4.2) and in example 4.2. The entire system is repeated here for convenience of the reader:

xi0=di,for some fixeddR0c0,

and

ddtxi(t)=h+(j=1c1Wijxj1(t)+βi)xi(t)(xi(t))2,for{1,,m}.

Note that many species that were enzymatic in a particular network, for example the species associated with the terms xi in the second step, are dynamic species in the final model.

6. An example

In this section, we provide an example to visually demonstrate several aspects of our theory and our constructions. The focus of this paper was not on training a network—that will be the focus of our next work. Instead, in this paper we focused on the different qualitative properties of possible constructions, as detailed in definitions 4.7 and 5.3, and so this example will primarily share that focus. We will showcase how the limiting values of the ODE associated with a reaction network that implements the modified ReLU activation function, as detailed in §5, match precisely with the more standard implementation of the neural network via direct use of the activation function (3.2). Moreover, we will demonstrate the fast convergence of the ODE, a property we have proven to hold in proposition 4.9. Next, we will demonstrate the flexibility of the developed theory by chemically implementing a different activation function: one that grows like y, as y → ∞ (as opposed to linear growth in the case of ReLU), and converges to 0, as y → −∞. This new implementation will still satisfy the conditions of theorem 4.8, and hence still enjoy the properties of definitions 4.7 and 5.3. Finally, we will explain how any activation function with growth of the form y1/k, as y → ∞, for any integer k ≥ 1, can likewise be implemented chemically.

As it is a standard example in the field, we use the MNIST dataset of handwritten digits [40]. See figure 7 for four representative images from this dataset. These images have 784 = 28 × 28 pixels, and the task of the neural network is to take a greyscale image of such a hand drawn digit, and correctly identify the digit. For example, we want the output to correctly identify the images in figure 7 as 8, 2, 6 and 7, respectively.

Figure 7.

Figure 7.

Representative examples of hand-drawn images from the MNIST database [40].

The input to the neural network can be regarded as a single vector of size 784. Further, it is natural to choose the number of output nodes to be 10, with each node representing a different digit from the set {0, 1, …, 9}. To complete the specification of the structure of the neural network, we will, somewhat arbitrarily, choose to have a single hidden layer with 40 nodes. Therefore, our neural network has:

c0=784,c1=40,c2=10.

As already mentioned, we will use the construction detailed in §5.2, yielding a smoothed ReLU (3.2) as our activation function, and we will choose h = 1 as our smoothing parameter.

We will now clearly specify how we implemented our neural network in Matlab. For the sake of reproducibility, we first set the seed of our random number generator by using the command ‘rng(1234)’. We used this seed for every computation we are reporting in this section. We then initialized our weights and biases randomly by using scaled Gaussians via the following commands:

W1=(1/sqrt(c0))randn(c1,c0);W2=(1/sqrt(c1))randn(c2,c1);beta1=randn(c1,1);beta2=randn(c2,1);

In order to ensure that we use exactly the same random variables as does the reaction network implementation, we also defined an initial condition via the command

x00=10rand([50,1]);

as that call is necessary in our reaction network implementation in which each of the hidden and output nodes is a dynamic variable and therefore uses an initial condition. While present in the code, this term is not used in the standard neural network implementation.

We used a quadratic cost function (3.7) in which the ‘truth’, denoted τ(d), was a vector with a 10 in the place of the true digit (i.e. if the digit represented by d is zero, then τ(d) has a 10 in the first component, if the digit represented by d is 1, then τ(d) has a 10 in the second component, etc.), and has ones in all other components. In order to implement gradient descent, we used a learning rate of η = 0.1 so that after each iteration of the neural network, we update our parameters via

ββηβCostWijWijηCostWij,

for appropriate ℓ, i and j. In order to estimate the derivatives above, we utilized stochastic gradient descent by using a batch of 300 randomly selected elements from the first 60 000 entries in the MNIST dataset. The specific call we used in our Matlab code was

Vals=randperm(60000,BatchSize);

where BatchSize had been set to 300. See figure 8 for (i) the estimate of the cost function and (ii) the number correctly predicted, out of the randomly chosen batch of 300, by the neural network over 1000 iterations of the learning process. Note that near the end of the 1000 iterations, the neural network is correctly identifying just over 95% of the digits. For the sake of comparison, in figure 9 we give similar plots for the standard ReLU activation function (i.e. taking h = 0). Now the neural network correctly identifies around 88% of the digits. The superiority of the smoothed version of the ReLU activation function was apparent in nearly all the seeds of the random number generator that we tried (data not shown). The precise reason for the superiority of the smoothed version of the ReLU activation function in the present setting is unclear to us, though perhaps the lack of a zero derivative for y < 0 is playing a role.

Figure 8.

Figure 8.

Performance of the smoothed ReLU cost function with h = 1. (a) Estimate of the cost function over each iteration of the neural network (from 300 randomly selected elements from the MINST dataset). (b) Total number of images from the 300 whose digits were correctly identified. For each image, the x-axis represents the iteration number of the learning process.

Figure 9.

Figure 9.

Performance of the ReLU cost function (i.e. h = 0). (a) Estimate of the cost function over each iteration of the neural network (from 300 randomly selected elements from the MINST dataset). (b) Total number of images from the 300 whose digits were correctly identified. For each image, the x-axis represents the iteration number of the learning process.

We now demonstrate the learning of the reaction network in a different manner: by visualizing the output trajectories of a subset of the nodes on a particular image from the database, but after a different number of iterations of the learning process. We arbitrarily chose the 30th image in the database, which is the 7 presented as the right-most image in figure 7. Note that since the image is that of a 7, we hope and expect that the equilibrium value associated with the 8th output node of our system will eventually converge towards 10, whereas the values of the other output nodes will converge towards 1. See figure 10 for trajectories of output nodes 1, 2, 6 and 8 (associated with the digits 0, 1, 5 and 7), and hidden nodes 1 and 32. As expected, the equilibrium value associated with the 8th output node does indeed separate from the others and moves towards 10, as the number of iterations increases, whereas the other output nodes remain near the value 1. Also of interest is that the equilibrium value associated with output node 2, which is associated with the digit 1, converges towards 1 slower than do the other output nodes. We assume this is because the digit, which is a seven, has characteristics similar to the digit 1. For the purposes of this particular calculation, our initial condition for all 50 nodes was chosen to be equal to one.

Figure 10.

Figure 10.

Plots of trajectories from the ODEs associated with our reaction network construction after different numbers of iterations. Note how the equilibrium of the 8th output node, which is associated with the correct digit of 7, seems to be converging towards 10 as the number of iterations increases, whereas the equilibria associated with the other output nodes converge towards 1.

As mentioned above, the fact that a neural network using a ReLU activation function (smoothed or not) can be trained to identify the hand-drawn digits from the MNIST dataset is not the point of this paper, and is very well known. Instead, we now focus on the behaviour of the ODE associated with the reaction network implementation. Using the same set-up as detailed above (including the randomized initial conditions), but with both the BatchSize and the number of iterations set to 1, we may output the values of the activations a for the neural network with the smoothed ReLU activation function with h = 1. There are a total of 50 terms (one for each node) in these vectors, which is too many to visualize. We therefore arbitrarily selected the first and third nodes from the output layer and the first and 32nd nodes from the hidden layer. The resulting values are

a12=1.54691661955827,a32=0.885658219979311,a11=1.06208572398989,a321=0.72187173499449.

Next, we solved the system of 50 ODEs associated with our reaction network construction, with randomized initial conditions detailed above, while using exactly the same random variables as in the standard neural network implementation. We solved the resulting system of 50 ODEs, and representative plots for the chosen four nodes are given in figure 11. We simulated until time 5 and found

x12(5)=1.54703516476441,x32(5)=0.885627963885228,x11(5)=1.06216260026126,x321(5)=0.721901309123957.

As our theory guaranteed, the values match those of the standard neural network very well.

Figure 11.

Figure 11.

Representative plots with ‘regularly sized’ initial conditions. We chose to visualize nodes 1 and 3 from the output layer and nodes 1 and 32 from the hidden layer.

Of course, it is impossible to ‘demonstrate’ convergence from infinity. Instead, we simply modified the initial conditions to

x00=1000rand([c1,1]);

and performed the same ODE computations as detailed above. See figure 12 for plots of the solutions. The final values were

x12(5)=1.54686939850725,x32(5)=0.885624161108907,x11(5)=1.06213690351638,x321(5)=0.721896022634959,

which, again, match the values from the previous iterations.

Figure 12.

Figure 12.

These are plots of the same trajectories. Note the scales on both the x-axes and the y-axes. On the left we see the fast convergence down ‘from infinity’ to values in the single digits. On the right, we see exponential convergence to the limiting values.

6.1. Modifying the activation function

We slightly modify the reaction network construction of §5 by using q = 3 instead of q = 2 in the final column of table 1. Thus, for each of the hidden and output nodes we simply change the reaction network so that it includes the reaction 3XX instead of 2XX. This change modifies the ODE for a particular node (hidden or output) to be

x˙=h+ρx2x3, 6.1

with h > 0 and ρ as before. Note that the above is the analogue of fi in (4.3), and we are suppressing subscripts and superscripts for the sake of clarity (as we have done at times throughout the paper).

By Descartes’ rule of signs, for each particular choice of h and ρ the system (6.1) has precisely one positive fixed point. As can be shown by standard methods, this fixed point is stable. Fixing h > 0, the activation function for the resulting system is found by solving for the unique positive fixed point as a function of ρ. See figure 13 for a plot of this activation function when h = 1. We see that the function is monotonic, grows like y/2, as y → ∞, and converges to zero, as y → −∞. Finally, it can be shown by similar arguments as in the proof of proposition 4.9 that the resulting chemical system implements, in the sense of definition 4.1, the hardwired feed-forward neural network (G,P,φ) where φ is given in figure 13, and that the system converges from infinity in finite time (due to it having 3-polynomial decay) and is exponentially reliable.

Figure 13.

Figure 13.

Activation function φ implemented by the reaction network from table 1 with q = 3 and h = 1. This function is defined as the map between y and the unique positive fixed point of the polynomial 1 + yx − 2x3. A plot of y/2 is added for the sake of comparison, which would be the corresponding activation function if h were taken to be zero while q = 3.

In order to implement this chemical system, we solved the associated ODEs, as detailed above. However, in the case when q = 2 we had a nice analytic formula for the activation function, φ, given by (3.2), which we could easily differentiate to find φ′(z), and plug that expression into the relevant terms in (3.8) for the purposes of gradient descent. In this case, we are not so fortunate. However, this derivative can be calculated in a straightforward manner. For a fixed value of z, we may denote the unique positive fixed point of (6.1), with z = ρ, via φ(z), in which case we have that φ(z) is defined implicitly via

0=h+zφ(z)2φ(z)3.

Differentiating with respect to z and solving yields

φ(z)=φ(z)z32φ(z)2.

As φ(z) is the output from the ODE solver, we also get the derivative in a straightforward manner.

With all the details in place, we can run the system and implement the neural network via our new chemical reaction network. In figure 14, we provide plots of the estimated cost and the number of images correctly identified, out of a batch of 300, using this new chemical system and activation function when all other variables (i.e. numbers of layers, hidden nodes, seed of the random number generator, etc.) are kept the same as above. We note that this activation function performs similarly to the ReLU activation function (figure 9).

Figure 14.

Figure 14.

Performance of the new activation function (when q = 3). (a) Estimate of the cost function over each iteration of the neural network (from 300 randomly selected elements from the MINST dataset). (b) Total number of images from the 300 whose digits were correctly identified. For each image, the x-axis represents the iteration number of the learning process.

Finally, we note that we could also select q to be any integer greater than 3, and a similar analysis can be carried out. In particular, when q is an integer greater than or equal to 2, we get an activation function that grows like y1/(q−1). Moreover, the derivative can be calculated as above, and found to satisfy

φ(z)=φ(z)zq(q1)φ(z)q1.

These systems with different activation functions could be useful in different settings.

Acknowledgements

We thank Erik Winfree for helpful comments with an early draft of this paper.

Appendix A

Proposition A.1. —

Consider the ODE u˙=cuq where c > 0, uR0, and qR. If q > 1, then u(t) ≤ 1 for any t > ((q − 1)c)−1 and any u(0)=u0R0.

Proof. —

For q > 1, the function −cuq is locally Lipschitz for uR0 and so the initial value problem with u(0)=u0R0 has a unique solution which can be found by separation of variables:

u(t)=1((q1)ct+u0(q1))1/(q1)

for all tR0. Clearly, u(t)t0 monotonically for any u0R0. It suffices to assume that u0 > 1. Define t1 to be the time for which u(t1) = 1. Then, since u0 > 1, we have (q1)ct1=1u01q(0,1), implying

t1=1(q1)c(1u01q)(0,1(q1)c).

Noting the monotonicity of u(t) now finishes the proof. ▪

We provide a version of Grönwall’s inequality [41].

Lemma A.2 (Grönwall’s inequality). —

Consider the interval I = [t0, t]. Let α:IR and β:IR be continuous functions. Let V:IR be a continuously differentiable function satisfying

ddtV(t)α(t)V(t)+β(t)fortI. A 1

Let V(t0) = V0. Then,

V(t)V0exp(t0tα(s)ds)+t0texp(stα(r)dr)β(s)dsfortI. A 2

Data accessibility

All code used to produce the images in this paper are available upon request.

Authors' contributions

This work was first conceptualized by Anderson in 2017 and 2018. Anderson and Joshi began the development of both the theory and the constructions during Joshi’s sabbatical visit to the University of Wisconsin-Madison in the spring of 2019. Anderson, Joshi and Deshpande finished the theory and writing during the spring and summer of 2020.

Competing interests

We declare we have no competing interests.

Funding

D.F.A. gratefully acknowledges support via the Army Research Office through grant no. W911NF18-1-0324, and via the William F. Vilas Trust Estate.

References

  • 1.Buisman H, ten Eikelder H, Hilbers P, Liekens A. 2009. Computing algebraic functions with biochemical reaction networks. Artif. Life 15, 5-19. ( 10.1162/artl.2009.15.1.15101) [DOI] [PubMed] [Google Scholar]
  • 2.Cappelletti D, Ortiz-Muñoz A, Anderson DF, Winfree E. 2020. Stochastic chemical reaction networks for robustly approximating arbitrary probability distributions. Theor. Comput. Sci. 801, 64-95. ( 10.1016/j.tcs.2019.08.013) [DOI] [Google Scholar]
  • 3.Condon A, Hajiaghayi M, Kirkpatrick D, Maňuch J. 2020. Approximate majority analyses using tri-molecular chemical reaction networks. Natural Comput. 19, 249-270. ( 10.1007/s11047-019-09756-4) [DOI] [Google Scholar]
  • 4.Cummings R, Doty D, Soloveichik D. 2016. Probability 1 computation with chemical reaction networks. Natural Comput. 15, 245-261. ( 10.1007/s11047-015-9501-x) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Doty D, Soloveichik D. 2018. Stable leader election in population protocols requires linear time. Distributed Comput. 31, 257-271. ( 10.1007/s00446-016-0281-z) [DOI] [Google Scholar]
  • 6.Gopalkrishnan M. 2016. A scheme for molecular computation of maximum likelihood estimators for log-linear models. In Int. Conf. on DNA-Based Computers, pp. 3–18. Berlin, Germany: Springer.
  • 7.Napp N, Adams R. 2013. Message passing inference with chemical reaction networks. Adv. Neural Inf. Process. Syst. 26, 2247-2255. [Google Scholar]
  • 8.Poole W, Ortiz-Munoz A, Behera A, Jones N, Ouldridge TE, Winfree E, Gopalkrishnan M. 2017. Chemical Boltzmann machines. In Int. Conf. on DNA-Based Computers, pp. 210–231. Berlin, Germany: Springer.
  • 9.Qian L, Soloveichik D, Winfree E. 2011. Efficient Turing-universal computation with DNA polymers. DNA Comput. Mol. Programming 16, 123-140. ( 10.1007/978-3-642-18305-8_12) [DOI] [Google Scholar]
  • 10.Singh A, Wiuf C, Behera A, Gopalkrishnan M. 2019. A reaction network scheme which implements inference and learning for hidden Markov models. In Int. Conf. on DNA Computing and Molecular Programming, pp. 54–79. Berlin, Germany: Springer.
  • 11.Soloveichik D, Seelig G, Winfree E. 2010. DNA as a universal substrate for chemical kinetics. Proc. Natl Acad. Sci. USA 107, 5393-5398. ( 10.1073/pnas.0909380107) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Virinchi V, Behera A, Gopalkrishnan M. 2017. A stochastic molecular scheme for an artificial cell to infer its environment from partial observations. In Int. Conf. on DNA-Based Computers, pp. 82–97. Berlin, Germany: Springer.
  • 13.Virinchi V, Behera A, Gopalkrishnan M. 2018. A reaction network scheme which implements the EM algorithm. In Int. Conf. on DNA Computing and Molecular Programming, pp. 189-207. Berlin, Germany: Springer.
  • 14.Blount D, Banda P, Teuscher C, Stefanovic D. 2017. Feedforward chemical neural network: an in silico chemical system that learns XOR. Artif. Life 23, 295-317. ( 10.1162/ARTL_a_00233) [DOI] [PubMed] [Google Scholar]
  • 15.Chiang K, Jiang J, Fages F. 2015. Reconfigurable neuromorphic computation in biochemical systems. In 2015 37th Annual Int. Conf. of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015, pp. 937–940. ( 10.1109/EMBC.2015.7318517) [DOI]
  • 16.Hjelmfelt A, Weinberger E, Ross J. 1991. Chemical implementation of neural networks and Turing machines. Proc. Natl Acad. Sci. USA 88, 10 983-10 987. ( 10.1073/pnas.88.24.10983) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hopfield JJ. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl Acad. Sci. USA 81, 3088-3092. ( 10.1073/pnas.81.10.3088) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kim J, Hopfield J, Winfree E. 2004. Neural network computation by in vitro transcriptional circuits. Adv. Neural Inf. Process. Syst. 17, 681-688. [Google Scholar]
  • 19.Moorman A, Samaniego CC, Maley C, Weiss R. 2019. A dynamical biomolecular neural network. In 2019 IEEE 58th Conf. on Decision and Control (CDC), Nice, France, 11–13 December 2019, pp. 1797–1802. ( 10.1109/CDC40024.2019.9030122) [DOI]
  • 20.Vasic M, Chalk C, Khurshid S, Soloveichik D. 2020. Deep molecular programming: a natural implementation of binary-weight ReLU neural networks. (https://arxiv.org/pdf/2003.13720.pdf) [DOI] [PMC free article] [PubMed]
  • 21.Benenson Y. 2012. Biomolecular computing systems: principles, progress and potential. Nat. Rev. Genet. 13, 455-468. ( 10.1038/nrg3197) [DOI] [PubMed] [Google Scholar]
  • 22.Bray D. 1990. Intracellular signalling as a parallel distributed process. J. Theor. Biol. 143, 215-231. ( 10.1016/S0022-5193(05)80268-1) [DOI] [PubMed] [Google Scholar]
  • 23.Buchler N, Gerland U, Hwa T. 2003. On schemes of combinatorial transcription logic. Proc. Natl Acad. Sci. USA 100, 5136-5141. ( 10.1073/pnas.0930314100) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Mestl T, Lemay C, Glass L. 1996. Chaos in high-dimensional neural and gene networks. Physica D 98, 33-52. ( 10.1016/0167-2789(96)00086-3) [DOI] [Google Scholar]
  • 25.Mjolsness E, Sharp DH, Reinitz J. 1991. A connectionist model of development. J. Theor. Biol. 152, 429-453. ( 10.1016/S0022-5193(05)80391-1) [DOI] [PubMed] [Google Scholar]
  • 26.Rössler OE. 1974. Chemical automata in homogeneous and reaction-diffusion kinetics. In Physics and mathematics of the nervous system (eds M Conrad, W Güttinger, M Dal Cin). Lecture Notes in Biomathematics, vol. 4, pp. 399–418. Berlin, Germany: Springer. ( 10.1007/978-3-642-80885-2_23) [DOI]
  • 27.Rössler OE. 1974. A synthetic approach to exotic kinetics (with examples). In Physics and mathematics of the nervous system (eds M Conrad, W Güttinger, M Dal Cin). Lecture Notes in Biomathematics, vol. 4, pp. 546–582. Berlin, Germany: Springer. ( 10.1007/978-3-642-80885-2_34) [DOI]
  • 28.Sugita M, Fukuda N. 1963. Functional analysis of chemical systems in vivo using a logical circuit equivalent: III. Analysis using a digital circuit combined with an analogue computer. J. Theor. Biol. 5, 412-425. ( 10.1016/0022-5193(63)90087-0) [DOI] [PubMed] [Google Scholar]
  • 29.Vohradsky J. 2001. Neural network model of gene expression. FASEB J. 15, 846-854. ( 10.1096/fj.00-0361com) [DOI] [PubMed] [Google Scholar]
  • 30.Di Cera E, Phillipson PE, Wyman J. 1989. Limit-cycle oscillations and chaos in reaction networks subject to conservation of mass. Proc. Natl Acad. Sci. USA 86, 142-146. ( 10.1073/pnas.86.1.142) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Cherry K, Qian L. 2018. Scaling up molecular pattern recognition with DNA-based winner-take-all neural networks. Nature 559, 370-376. ( 10.1038/s41586-018-0289-6) [DOI] [PubMed] [Google Scholar]
  • 32.Qian L, Winfree E, Bruck J. 2011. Neural network computation with DNA strand displacement cascades. Nature 475, 368-372. ( 10.1038/nature10262) [DOI] [PubMed] [Google Scholar]
  • 33.Bishop C. 1995. Neural networks for pattern recognition. Oxford, UK: Oxford University Press. [Google Scholar]
  • 34.Mitchell T. 1997. Machine learning, vol. 45, pp. 870-877. Burr Ridge, IL: McGraw Hill. [Google Scholar]
  • 35.Murphy K. 2012. Machine learning: a probabilistic perspective. Cambridge, MA: MIT Press. [Google Scholar]
  • 36.Nielsen MA. 2015. Neural networks and deep learning. San Francisco, CA: Determination Press. [Google Scholar]
  • 37.Shalev-Shwartz S, Ben-David S. 2014. Understanding machine learning: from theory to algorithms. Cambridge, UK: Cambridge University Press. [Google Scholar]
  • 38.Cybenko G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303-314. ( 10.1007/BF02551274) [DOI] [Google Scholar]
  • 39.Hornik K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251-257. ( 10.1016/0893-6080(91)90009-T) [DOI] [Google Scholar]
  • 40.LeCun Y. 1998. The MNIST database of handwritten digits. See http://yann.lecun.com/exdb/mnist/.
  • 41.Bellman R. 1943. The stability of solutions of linear differential equations. Duke Math. J. 10, 643-647. ( 10.1215/S0012-7094-43-01059-2) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All code used to produce the images in this paper are available upon request.


Articles from Journal of the Royal Society Interface are provided here courtesy of The Royal Society

RESOURCES