Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2023 Jun 16:arXiv:2305.10699v2. Originally published 2023 May 18. [Version 2]

Dirichlet Diffusion Score Model for Biological Sequence Generation

Pavel Avdeyev 1, Chenlai Shi 1, Yuhao Tan 1, Kseniia Dudnyk 1, Jian Zhou 1
PMCID: PMC10246113  PMID: 37292476

Abstract

Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.

1. Introduction

Diffusion probabilistic models are a family of models that reverse diffusion process to generate data from noise. Score-based generative stochastic differential equation (SDE) is a type of continuous-time diffusion model that has many desirable properties, such as allowing likelihood evaluation through a connection to a probability flow ordinary differential equation (ODE) and flexibility in sampling approaches. However, the originally proposed generative SDEs are not directly suitable for modeling discrete data. Recent works have proposed methods for adapting diffusion models to discrete data (Campbell et al., 2022; Chen et al., 2022b; Sun et al., 2022; Austin et al., 2021; Hoogeboom et al., 2021b;a), including continuous-time diffusion in discrete space (Campbell et al., 2022), but no methods are formulated within the continuous-time SDE diffusion framework (Song et al., 2020), except for quantization-based methods. In this manuscript, we propose a general mechanism to extend this approach to discrete data, while allowing continuous-time diffusion in probability simplex space. Specifically, designed to utilize the natural connection between Dirichlet distribution and discrete data, we consider continuous-time diffusion within the probability simplex for which the stationary distribution is Dirichlet distribution. Forward diffusion (data-to-noise) of discrete data starts from the vertices of the probability simplex space, and diffuses continuously in the same space, and the continuous-discrete space gap can be bridged with a latent variable interpretation.

While our intended application is in biological sequence generation, we evaluated our, Dirichlet diffusion score model (DDSM)1, on a range of discrete data generation tasks to better understand its performance. In addition to demonstrating competitive performance on a small benchmark dataset, binarized MNIST, we applied it to generating Sudoku puzzles to test for its ability in generating highly structured data with strong constraints. The model can not only generate but also solve Sudoku puzzles including hard puzzles, which is the first time this is achieved with a purely generative modeling approach. Finally, we applied DDSM to a real-world application in biological sequence generation. Specifically, we developed the first model for designing human promoter DNA sequences that drive gene expression, and demonstrate that it designs diverse sequences comparable to human genome promoter sequences.

2. Background

2.1. Score-based Generative Modeling with SDE

Itô diffusion process is defined as

dx=f(x,t)dt+G(x,t)dw,

where w is the standard Wiener process (a.k.a., Brownian motion), the drift coefficient f(,t):nn and diffusion coefficient G(,t):nn are vector-valued functions of xt. Song et al. 2020 exploited the remarkable result by Anderson 1982 that the time-reversal of this diffusion process can be obtained by the following SDE:

dx={f(x,t)[G(x,t)G(x,t)]G(x,t)G(x,t)xlogpt(x)}dt+G(x,t)dw¯, (1)

where [G(x,t)G(x,t)] indicates row-sums of element-wise derivative with respect to x. The corresponding probability flow ODE is defined as

dx={f(x,t)12[G(x,t)G(x,t)]12G(x,t)G(x,t)xlogpt(x)}dt. (2)

It gives the same distribution at time t as the reverse-time SDE. Both the reverse-time SDE and the probability flow ODE can be sampled from given ∇x log pt(x) or the score of pt(x). Therefore, learning the reverse diffusion becomes the problem of learning the score function, which is usually parameterized as a neural network known as the score model. The training loss is the score matching loss

0TEpt(xt)[λ(t)xlogp(xt)sθ(xt,t)22]dt, (3)

where sθ (x,t) is the score model to be trained, and λ(t) is a positive weighting function. The loss is equivalent to the denoising score matching loss

0TEp0(x0)p(xtx0)[λ(t)xlogp(xtx0)sθ(xt,t)22]dt, (4)

which is used in practice because xlogp(xtx0) is usually easier to compute. Forward diffusion processes considered so far have Gaussian stationary distribution and are applicable for continuous data in n.

2.2. Univariate Jacobi Diffusion Process

We consider Jacobi diffusion process2 in the following form

dx=s2[a(1x)bx]dt+sx(1x)dw, (5)

where 0 ≤ x ≤ 1, s > 0 is the speed factor, and a > 0, b > 0 determines the stationary distribution Beta(a, b). We usually use s=1 or s=2a+b (see Appendix A.2 for discussion of choices). Note that when x approaches 0 or 1, the diffusion coefficient converges to 0 and the drift coefficient converges to a or −b, keeping the diffusion within [0,1].

The spectral expansion of the transition density function was derived by Kimura 1955; 1957. Hence, the diffused density at any time t is computed by the following formula:

pa,b(xtx0)=a,b(xt)n=0eλntdnRn(a,b)(x0)Rn(a,b)(xt)=a,b(xt)(1+n=1eλntdnRn(a,b)(x0)Rn(a,b)(xt)), (6)

where ℬa,b(xt) is the Beta(a, b) density, Rn(a,b)(x) denotes the n-th order modified Jacobi polynomial of order n and are eigenfunctions of the generator of the Jacobi diffusion process (Steinrücken et al., 2013a; Griffiths & Spano’, 2010). The corresponding eigenvalues are λn=12sn(n1+a+b). The gradient of the log transition density function can be computed via automatic differentiation.

3. Diffusion Processes for Generative SDE Modeling of Discrete Data

3.1. Forward Diffusion SDE for Two-Category Data

Using the univariate Jacobi diffusion as the forward diffusion process provides a natural generalization of the scorebased generative SDE approach (Section 2.1) to discrete data with two categories, encoded as 0 and 1. The forward diffusion starting from 0 or 1 at the initial timepoint will continuously diffuse in the [0, 1] interval and converge to a Beta stationary distribution (see Fig. 1a). If a = 1 and b = 1 in Eq. 5 and 6 then the Beta stationary distribution is Beta(1,1) (i.e., a uniform distribution in the interval [0, 1]).

Figure 1.

Figure 1.

Schematic overview of forward and reverse diffusion SDEs for Dirichlet diffusion score model. Forward and reverse diffusion SDEs for 2-category data (a) and k-category data (b) by stick-breaking construction are shown.

The score-based generative SDE model can be trained via the denoising score matching objective (see Eq. 4) following the transition density formula (see Eq. 6). By combining equations 1 and 5, we have the following reverse-time SDE:

dx={s2[a(1x)bx]s(12x)sx(1x)xlogpt(x)}dt+sx(1x)dw¯. (7)

Replacing ∇x log pt(x) with score model sθ(xt,t) allows sampling from the trained model via reverse diffusion.

3.2. Forward Diffusion SDE for k-Category Data

To model discrete variables with k categories, e.g. DNA sequence (with four bases A,C,G,T) or protein sequence (20 amino acid residues), we need to consider diffusion in probability simplex.

We seek to use a multivariate diffusion process over the probability simplex for which the stationary distribution is Dirichlet distribution (see Fig. 1b). Jacobi diffusion process converges to Beta stationary distribution, a univariate special case of Dirichlet distribution. Using the connection between Beta distribution and Dirichlet distribution, we construct a multivariate diffusion process on probability simplex that converges to Dirichlet distribution with k − 1 independent univariate Jacobi diffusion processes by a classical stick-breaking construction

x1t=v1t,x2t=(1v1t)v2t,x3t=(1v1t)(1v2t)v3t,,

where v1t,v2t,,vk1t are drawn from independent Jacobi diffusion processes at time t. Thus, we obtain a multivariate diffusion process with Dirichlet stationary distribution using x1,tx2,t,xkt.

For notation simplicity, we will use v and x to indicate the k − 1 and k dimensional representations for the rest of the manuscript, respectively. The conversion between v and x is done by stick-breaking transform and its inverse.

For obtaining any Dirichlet stationary distribution, we parameterize the Jacobi diffusion process. For example, for the stationary distribution to be the flat Dirichlet distribution Dir(1, 1, ..., 1) (i.e., the uniform distribution over the probability simplex), we need to choose the Jacbobi diffusion processes with stationary distributions Beta(1, k − 1),Beta(1, k − 2),...,Beta(1, 1) for v1, v2, ..., vk−1 of stick-breaking construction. This can be simply achieved by choosing parameters a, b in equation 5 to be (1, k − 1),(1, k − 2), ..., (1, 1).

We refer to this multivariate diffusion process as multivariate Jacobi diffusion process by stick-breaking construction, and the generative modeling approach with this diffusion process as Dirichlet diffusion score model. An infinite dimensional form of this diffusion process with a more general distribution family has been proposed as the GEM process (Feng & Wang, 2007). The proposed process is a finite-dimensional version of the GEM process. We note that other forms of diffusion processes for which the stationary distribution is Dirichlet distribution exist (see Steinrücken et al. 2013b; Bakosi & Ristorcelli 2013). However, they are much more computationally expensive to use as forward diffusion processes since they cannot be decomposed into independent univariate diffusion processes.

3.3. Score Matching Training for k-Category Discrete Data

Using the multivariate Jacobi diffusion by stick-breaking construction as the forward diffusion process allows us to train score-based diffusion model for k-category discrete data.

The initial value of the diffusion in x space is set to be the discrete data represented by k-dimensional one-hot encoding such as (0, 0, 1, ..., 0). To sample from the forward diffusion, we first map the initial values of x to v space via inverse stick-breaking transform. For all dimensions of v after the first 1 that is undetermined by inverse stick-breaking transform given x, we consider them as drawn from corresponding stationary Beta distribution. Explicitly drawing the Beta samples for initial values are not needed since the density remains stationary and the samples and scores at any time are directly computed from the Beta distributions.

The forward diffusion samples in v space at any time t are drawn from the Jacobi diffusion processes (for dimension with deterministic initial values) and stationary Beta distributions (for dimension with undetermined initial values). The scores for the transition density function in denoising score-matching loss (Eq. 4) are then computed from the corresponding Jacobi diffusion transition density function and Beta density function.

By applying the change-of-variable conversion, we can equivalently perform score matching in either v space or x space since we can convert between the score of x to the score of v:

logpx(x)x=(logpv(v)v+log|detvx|v)vx,logpv(v)v=logpx(x)xxvlog|detvx|v.

The score model, sθ(xt,t) is more naturally formulated as a function of x, and we choose to compute score-matching loss in v space because of the diagonal form of diffusion coefficient.

Once the score model is learned, sampling from the reverse diffusion process in v space is nearly identical to sampling from multiple univariate reverse diffusion processes as in Equation 7, except for that the score model takes all dimensions of v as input (after converting to x space).

3.4. Weighting Function for Score Matching Loss of General SDE

The choice of weighting function λ(t) in score matching loss (Eq. 3 and 4) has previously been studied (Song et al., 2021; Huang et al., 2021). With the assumption that the scalar diffusion coefficient g(v, t) of the forward diffusion process does not depend on the value v being diffused, λ(t) = g(t)2 is shown to be the likelihood weighting (Song et al., 2021). Minimizing the loss function with this weighting is equivalent to maximizing the ELBO (Huang et al., 2021). However, this assumption about g(v, t) does not hold for the Jacobi diffusion process.

Here we motivate the use of

L(v,t)=logpv(v)vlogqv(v)vGGT2=(logpv(v)vlogqv(v)v)G(v,t)G(v,t)T(logpv(v)vlogqv(v)v)T (8)

to be the general form of weighted score-matching loss for any SDE with matrix-form diffusion coefficient G(v, t), from the argument that the loss function should satisfy the property of invariance under change-of-variable, which is satisfied by likelihood function. In Appendix A.3, we show that this loss function is invariant to change-of-variable by any bijective, differentiable function x = h(v), while the unweighted loss is not. If G is scalar or diagonal and does not depend on v, we recover the likelihood weighting from Song et al. 2020.

3.5. Likelihood Computation

After the model is trained with score-matching, we can estimate likelihood using probability flow ODE (see Eq. 2). Our formulation allows both computing the likelihood from the continuous distribution over the probability simplex, and computing a variation lower-bound of the likelihood of discrete data.

Likelihood Computation for Continuous Variable in Probability Simplex

We will first estimate the likelihood in the v space, which can be easily converted to likelihood in the x space using the probability change-of-variable formula (Chen et al., 2018).

Following Song et al. 2021, the probability flow ODE for a Jacobi diffusion process is:

dv={s2[a(1v)bv]s2(12v)s2v(1v)sθ(v,t)}dt=f˜dt

By the instantaneous change-of-variable formula, we have:

p0(v0)=e0ttr(vf˜(vt))dtpt(vt).

The trace of Jacobian can be unbiasedly approximated by Hutchinson’s estimator

tr(vf˜(vt))=Eϵ𝒩(0,I)[ϵTvf˜(vt)ϵ]

and pt(vt) is computed with the stationary distribution. p0(v0) is converted to p0(x0) by applying the change-ofvariable formula to obtain the likelihood.

Since this continuous-space likelihood is not directly comparable with discrete-space likelihood, next we will derive an evidence lower bound (ELBO) that allows direct comparison with likelihood in discrete data space.

Bounding Discrete Data Likelihood with Variational Lowerbound

To obtain a variational lowerbound for discrete likelihood, we consider the continuous variable x in probability simplex space, drawn from the reverse diffusion process, as directly parameterizing categorical distributions. The discrete data y are drawn from these categorical distributions. Thus we obtain the discrete likelihood by marginalizing over x p(y) = pCat(y|x)p(x)dx.

While this is generally intractable computationally, we use the variational lowerbound ELBO

logp(y)EqDiff(xy)[logqDiff (xy)+logpCat(yx)+logpODE(x)],

where qDiff(x|y) is the density of forward diffusion from y at time t0˜, with t0˜ chosen to be close to 0. pCat(y|x) is the categorical distribution likelihood. pODE(x) is the continuous-space likelihood of probability flow ODE as described in the previous subsection, but with the lower end of time being t0˜ instead of 0. This expectation is unbiasedly estimated by sampling from the forward diffusion process. This ELBO formulation is chosen so that the diffusion model training will also minimize the variational gap of this ELBO, which reduces to the KL divergence between the forward diffusion density and the reverse diffusion density up to a constant whent0˜0. This bound can be tightened by choosing t0˜ closer to zero. This bound is fairly tight in practice when t0˜ is small, and we performed an empirical analysis on a simple test case (see Appendix B.8).

3.6. Improving Sampling Efficiency and Sample Quality

Lastly, we introduce two techniques that can be applied to improve the efficiency of forward diffusion sampling during training, or improve sample quality post-training, which are both detailed in Appendix. The sampling strategy for the forward diffusion process presented in Section 3.3 requires drawing samples from k − 1 Jacobi diffusion processes for k-category data, which can be demanding when k is high. In Appendix A.4, we describe a strategy to simplify sampling, needing to effectively sample from only one univariate Jacobi diffusion process.

The second technique is designed to improve sample quality. Comparing to unbiasedly sampling from the learned model distribution, it is often desirable to sample near the high probability density regions, which often corresponds to higher quality samples in suitable applications. In Appendix A.5 we propose a simple technique, time-dilation, applicable to reverse diffusion sampling without modifying the score model, when a flat distribution such as the flat Dirichlet distribution is the stationary distribution. In Appendix B.9, we compare sample quality obtained by time dilation with other sampling strategies which reported to improve sample quality (Song et al., 2020).

4. Results

4.1. Implementation Notes of Dirichlet Diffusion Score Model

Sampling from Jacobi diffusion processes is more expensive than commonly used SDEs with Gaussian stationary distributions (Song et al., 2020), as we need an SDE sampler such as Euler-Maruyama sampler. However, we only need to generate samples from two starting points, 0 and 1, for any categorical data. Hence, we can pre-sample a dictionary of diffused samples at different time points t and sample from the dictionary during training time. Similarly, the log transition density function gradient at the samples can also be precomputed. This approach allows efficient training with little additional overhead. We discuss additional implementation details in the Appendix B.

4.2. Application to Binarized MNIST

We first applied the method to a benchmark dataset for generative modeling, the binarized MNIST dataset, and obtained competitive performance (Table 1). More details of all applications are included in Appendix C. Examples of samples are shown in Appendix D.

Table 1.

Binarized MNIST benchmark performance

Method NLL (nats) ↓
DDSM (ours) 78.04 ± 0.37
CR-VAE 76.93
Locally Masked PixelCNN 77.58
PixelRNN 79.20
PixelCNN 81.30
EoNADE 84.68
MADE 86.43
NADE 88.33

4.3. Sudoku Generation as a Constraint Satisfying Generation Test

To test the ability to generate highly structured data that satisfy hard constraints, we applied our method to the problem of generating and solving Sudoku. This problem has not been solved through generative modeling to the best of our knowledge.

For training, we used a Sudoku generator to continuously produce Sudoku puzzles. A fully-filled Sudoku puzzle can be encoded with 81 categorical variables with the number of categories k = 9. The model architecture we adopt is based on a 20-block transformer architecture with a relative positional encoding designed for the Sudoku problem.

Specifically, a Sudoku puzzle is represented by a set of 81 elements and a binary relative positional encoding that is 27 dimensional (81 × 81 × 27), corresponding to whether two elements are of the same row, same column, or the same 3×3 block. The relative positional encoding is transformed by a linear layer and added to the transformer’s attention prior to the softmax.

The generative capability of the model is evaluated by the percentage of generated Sudoku that is correctly filled (Table 2). Only whether a Sudoku is completely correct is considered, with no partial credit given. Applying the time-dilation technique (Appendix B.3) to drive samples toward high-density areas improved sample quality to up to 100%. In contrast, the heuristic algorithm for generating of Sudoku that we used to generate training data, has only 0.31% success rate. Even though no previous generative modeling approach has been applied to Sudoku, we also trained the Sudoku Transformer with Bit Diffusion (Chen et al., 2022b) and D3PM-uniform/Multinomial Diffusion (Hoogeboom et al., 2021b; Austin et al., 2021), using the same model architectures. DDSM with time dilation achieved the best performance in comparison with these methods (Appendix C.3).

Table 2.

Sudoku generation and solving accuracies for single samples with DDSM. With multiple samples, all Sudoku puzzles we tested were solved. See Appendix C.3 for comparison with baseline diffusion methods.

Task Time dilation Accuracy (%)
Generation 8x 100
4x 99.88 ± 0.06
2x 98.87 ± 0.16
1x 95.08 ± 0.46
(Heuristic algorithm baseline) 0.31
Solving 8x 98.26 ± 0.18
4x 97.54 ± 0.18
2x 96.45 ± 0.32
1x 93.85 ± 0.42

Interestingly, similar to prior observations on image generation quality (Song et al., 2020), we also observe a trade-off between Sudoku-solving ability and computational budget, with improved Sudoku-generation accuracy using a higher number of sampling steps and time-dilation.

4.4. Solving Sudoku via Conditional Generation

We applied the Sudoku generative SDE model to solving Sudoku puzzles by a conditional generation with the inpainting method (Song et al., 2020) of clamping entries to the given clues of the puzzle. We evaluated the model on an easy Sudoku dataset with 36 clues on average (Wang et al., 2019) and a hard Sudoku dataset with minimally 17 clues (Palm et al., 2018), which is the minimum number of clues possible for Sudoku (McGuire et al., 2014).

Even though no additional training is done for solving Sudoku, the generative SDE model trained with DDSM solved most puzzles in the easy dataset with a single sample (Table 2). In contrast, both models trained with Bit Diffusion and D3PM-uniform have difficulties in the Sudoku solving task, with only < 10% of easy puzzles solved with a single sample.

The Sudoku-solving performance of the DDSM model can be further improved by increasing time-dilation, and a single sample solves 99.4% of the easy dataset and 42.4% of the hard dataset (128x time-dilation). When multiple samples are allowed, the model can solve 100% of all puzzles. The number of samples required to solve a Sudoku puzzle significantly increases with lower than 25 clues, with about 2.3x increase per one fewer clue given which is still significantly better than random guesses (Appendix E). This allows us to solve 100% hard Sudoku puzzles in the dataset. Neither models trained with Bit Diffusion nor D3PM-uniform have the ability to solve hard puzzles. Previous state-of-the-art supervised models SATNET (Wang et al., 2019) and Recurrent relational network (Palm et al., 2018) can solve most but not all of them (see Appendix E for more discussion).

4.5. Generation of Promoter DNA Sequences

Finally, we applied the method to designing human promoter DNA sequences (Figure 3). Promoters are key DNA sequences that drive the transcription of genes and partially determine gene expression levels. Designing promoter sequences can have broad applications in biomedical research and bioengineering applications, such as controlling synthetic gene expression. Human promoter sequences are known to be highly diverse and rules that determine promoter sequence activity are not fully understood (Wang et al., 2017). Thus it is an ideal problem to be addressed through deep generative modeling. No prior computational approach for designing human promoter sequences exists to our knowledge.

Figure 3.

Figure 3.

Design human gene promoter sequences with conditional Dirichlet diffusion score model trained to generate sequences from transcriptional initiation signal profile. Example model-designed sequences based on a transcriptional initiation signal profile of a test set promoter are shown. The corresponding human genome sequence is shown in comparison. Known promoter motifs are annotated in the sequences. We provide an introduction for readers unfamiliar with this topic in Appendix F.1.

To enable human promoter sequence design, we trained a conditional Dirichlet diffusion score model to perform conditional generation of promoter sequences using transcription initiation signal profile as an additional input to the score model (Figure 3). Transcription initiation signal profiles reflect the transcription initiation activity at every sequence position and are obtained from CAGE experiments (Consortium et al. 2014). The conditional generation model allows controlling the transcription initiation signal profile produced by the sequence, including controlling the expression level. We constructed the human promoter sequence dataset containing 100,000 promoter sequences and corresponding transcription initiation signal profiles, with each sequence 1024 basepairs long and centered at the annotated transcription start site position (Hon et al., 2017). This set of promoters spans the whole range of human promoter activity levels from highly expressed protein-coding gene promoters to ncRNA gene promoters with very low expression.

4.6. Evaluation of Designed Promoter DNA Sequences

With a custom score-model architecture that we call Promoter Designer, the generative model obtained a conditional NLL estimate of ≤ 1.32 bits per basepair for promoters that are from test set chromosomes and among the top 10,000 promoters, whereas simple baselines using position-specific base composition achieves only 1.92 bits. Promoter sequences with higher activity levels also obtained better NLL estimates (Figure 4a), in line with the expectation that sequences of high-activity promoters are less random compared to low-activity promoters. Multiple sequence samples conditioned on the same transcriptional initiation signal profile are typically diverse (Figure 3) while sharing similar characteristics. The generated sequences return no hits when compared with the human genome using BLAST (Morgulis et al., 2008), thus the model does not simply memorize human genomic sequences.

Figure 4.

Figure 4.

Performance of promoter design diffusion sequences model. a) NLL evaluated on test chromosomes 8 and 9, grouped by the expression level of the promoter (90–100% percentile being most expressed). (b-c) Model-designed sequences have comparable position-specific nucleotide composition (b) and motif location distribution (c). d) Designed sequences (blue) are compared with human genome sequences (gray) using promoter activity predicted from sequence by Sei (Chen et al., 2022a). Generated sequences are grouped by the targeted promoter activity level (x-axis). Y-axis shows predicted promoter activity (average H3K4me3 prediction across cell types), divided by baseline prediction for average genomic sequences.

The sequence samples are observed to contain highly similar properties as promoter sequences from the human genome (Figure 4bd), such as position-specific nucleotide composition relative to the transcription start site (Figure 4b) as well as distribution of known promoter related motifs (Figure 4c).

To evaluate whether the generated sequences recapitulated more complex sequence rules of promoter activity, we applied a published deep learning sequence model Sei that can predict active promoter from sequence (based on chromatin mark H3K4me3 predictions) (Chen et al., 2022a), the generated sequence has comparable predicted promoter activity with the human genome sequence, for both high and low activity promoters (Figure 4d). Applying time-dilation further increased predicted promoter activity (Appendix F).

We also trained models with baseline discrete data diffusion approaches. The evaluation is based on comparing generated sequences and human genome promoter sequences (ground truth) on the test chromosomes similar to Figure 4c. The metric SP-MSE is the MSE between the predicted promoter activity of generated sequences and human genome sequences (lower is better). Our model trained with DDSM outperforms models trained with baseline approaches (Table 3).

Table 3.

Promoter design performance comparison for different models. We trained all models with the same Promoter Designer architecture and the same early stopping criterion for this comparison.

Model SP-MSE ↓
DDSM (time dilation 4x) 0.0334
DDSM (time dilation 2x) 0.0348
DDSM (time dilation 1x) 0.0363
D3PM-uniform / Multinomial Diffusion 0.0375
Bit Diffusion (one-hot encoding) 0.0395
Bit Diffusion (bit-encoding) 0.0414

5. Related Work

Diffusion models were first proposed in Sohl-Dickstein et al. 2015; Ho et al. 2020, including their first application to discrete data using a binomial diffusion process for a binary dataset (Sohl-Dickstein et al., 2015). Song et al. 2020 proposed the score-based generative SDE diffusion model framework with continuous time for continuous data. Recent works for generalizing diffusion models to discrete data have mostly considered discrete time (Hoogeboom et al., 2021b; Austin et al., 2021). More recent works (Campbell et al., 2022; Sun et al., 2022) proposed continuous-time approaches for discrete-space diffusion based on a continuous-time Markov chain formulation. Another direction of work is to apply existing continuous-time continuous-space diffusion approach to discrete data encodings, such as bit encoding (Chen et al., 2022b) or word embedding (Li et al., 2022), and quantize the continuous samples. Concurrent to this work, Richemond et al. 2022 proposed to use k independent Cox-Ingersoll-Ross (CIR) processes as a forward SDE for which the stationary distribution is Gamma distribution. The authors proposed to normalize the k-dimensional vector obtained by the CIR processes to unit sum, and the stationary distribution is k independent gamma distributions. In addition, Lou & Ermon 2023 introduced reflected SDEs to score-based diffusion model, which can be applied to restrict the diffusion to the probability simplex.

Latent diffusion models also allow generative modeling of discrete data by modeling only the distribution of continuous latent variable with diffusion (Vahdat et al., 2021; Lovelace et al., 2022). While the reverse diffusion process in our approach can also be interpreted as a latent variable that emits discrete data (Section 3.5), the relationship between latent and discrete variables is fixed rather than learned.

On generative modeling for biological sequence design, deep generative models have been recently applied to DNA sequence design (Killoran et al., 2017; Wang et al., 2020; Zrimec et al., 2022), even though no diffusion model has been developed. No prior method exists for designing human promoter sequences to our knowledge. On the protein sequence design problem, several deep generative models have been developed to generate sequences conditioned on protein structure (Ingraham et al., 2019; Dauparas et al., 2022), and diffusion models have been applied to generate protein structure (Trippe et al., 2022; Watson et al., 2022; Lee & Kim, 2022). Recently works also jointly generate structure and sequence with diffusion (Luo et al., 2022; Anand & Achim, 2022).

Our contribution is to propose the approach for discrete data modeling with continuous-time SDE diffusion in probability simplex space, and applied this approach to develop the first method for human promoter sequence design and a novel application to generative modeling of Sudoku. All existing works using score-based generative SDEs are based on diffusion processes that converge to Gaussian stationary distributions, and here we expand the generative SDE toolkit to include ones that converge to Dirichlet stationary distribution. In addition, we propose a simple and easily applicable technique, time-dilation, to improve sample quality.

6. Discussion

We provided a continuous-time Dirichlet diffusion score model framework (DDSM), for generative modeling of biological sequences, which can also be used for other types of discrete data. The approach also provides a plug-in substitute for Gaussian stationary distribution SDEs for discrete variables and expands the toolkit of generative diffusion model.

The size of the continuous diffusion state scales linearly with the number of categories and thus may not directly scale to a very high number of categories due to memory and computational constraints. A potential way to address this issue is to use bit encoding or hierarchical encoding schemes that can reduce the encoding dimensions required down to log2(C), where C is the number of categories. However, the current approach is ideal for applications in modeling biological sequences, such as DNA and RNA sequences (4 bases) and protein sequences (20 amino acid residues), as well as other data that can be encoded with a moderate number of categories.

We are encouraged by the promising results in designing human promoter sequences and the strong constraint satisfaction capability demonstrated in generating and solving sudoku, and look forward to further development in biological sequence design based on this approach.

Supplementary Material

1

Figure 2.

Figure 2.

Sudoku generation and solving through generative modeling with diffusion model, as a test for constraint satisfaction.

Acknowledgements

This work is supported by Cancer Prevention and Research Institute of Texas grant RR190071, NIH grant DP2GM146336, and the UT Southwestern Endowed Scholars Program.

Footnotes

Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

2

It is also known as the univariate Wright-Fisher diffusion

References

  1. Anand N. and Achim T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint, 2022. doi: 10.48550/ARXIV.2205.15019. [DOI] [Google Scholar]
  2. Anderson B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982. [Google Scholar]
  3. Austin J., Johnson D. D., Ho J., Tarlow D., and van den Berg R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021. [Google Scholar]
  4. Bakosi J. and Ristorcelli J. R. A stochastic diffusion process for the dirichlet distribution. International Journal of Stochastic Analysis, 2013. ISSN 20903332. Report. [Google Scholar]
  5. Campbell A., Benton J., De Bortoli V., Rainforth T., Deligiannidis G., and Doucet A. A continuous time framework for discrete denoising models. arXiv preprint, 2022. doi: 10.48550/ARXIV.2205.14987. [DOI] [Google Scholar]
  6. Castro-Mondragon J. A., Riudavets-Puig R., Rauluseviciute I., Lemma R. B., Turchi L., Blanc-Mathieu R., Lucas J., Boddie P., Khan A., Manosalva Perez N., Fornes O., Leung T. Y., Aguirre A., Hammal F., Schmelter D., Baranasic D., Ballester B., Sandelin A., Lenhard B., Vandepoele K., Wasserman W. W., Parcy F., and Mathelier A. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res., 50(D1):D165–D173, January 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen K. M., Wong A. K., Troyanskaya O. G., and Zhou J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet., 54(7):940–949, July 2022a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen R. T. Q., Rubanova Y., Bettencourt J., and Duvenaud D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [Google Scholar]
  9. Chen T., Zhang R., and Hinton G. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022b. [Google Scholar]
  10. Consortium, T. F., the RIKEN PMI, and (DGT), C. A promoter-level mammalian expression atlas. Nature, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dauparas J., Anishchenko I., Bennett N., Bai H., Ragotte R. J., Milles L. F., Wicky B. I. M., Courbet A., de Haas R. J., Bethel N., Leung P. J. Y., Huddy T. F., Pellock S., Tischer D., Chan F., Koepnick B., Nguyen H., Kang A., Sankaran B., Bera A. K., King N. P., and Baker D. Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615):49–56, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Feng S. and Wang F.-Y. A class of infinite-dimensional diffusion processes with connection to population genetics. Journal of Applied Probability, 44(4):938–949, 2007. doi: 10.1239/jap/1197908815. [DOI] [Google Scholar]
  13. Griffiths R. C. and Spano’ D. Diffusion processes and coalescent trees. arXiv preprint, 2010. doi: 10.48550/ARXIV.1003.4650. [DOI] [Google Scholar]
  14. Ho J., Jain A., and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020. [Google Scholar]
  15. Hon C.-C., Ramilowski J. A., Harshbarger J., Bertin N., Rackham O. J. L., Gough J., Denisenko E., Schmeier S., Poulsen T. M., Severin J., Lizio M., Kawaji H., Kasukawa T., Itoh M., Burroughs A. M., Noma S., Djebali S., Alam T., Medvedeva Y. A., Testa A. C., Lipovich L., Yip C.-W., Abugessaisa I., Mendez M., Hasegawa A., Tang D., Lassmann T., Heutink P.,Babina M., Wells C. A., Kojima S., Nakamura Y., Suzuki H., Daub C. O., de Hoon M. J. L., Arner E., Hayashizaki Y., Carninci P., and Forrest A. R. R. An atlas of human long non-coding RNAs with accurate 5’ ends. Nature, 543(7644):199–204, March 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hoogeboom E., Gritsenko A. A., Bastings J., Poole B., Berg R. v. d., and Salimans T. Autoregressive diffusion models. arXiv preprint, 2021a. doi: 10.48550/ARXIV.2110.02037. [DOI] [Google Scholar]
  17. Hoogeboom E., Nielsen D., Jaini P., Forré P., and Welling M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021b. [Google Scholar]
  18. Huang C.-W., Lim J. H., and Courville A. C. A variational perspective on diffusion-based generative models and score matching. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021. [Google Scholar]
  19. Ingraham J., Garg V., Barzilay R., and Jaakkola T. Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst., 32, 2019. [Google Scholar]
  20. Killoran N., Lee L. J., Delong A., Duvenaud D., and Frey B. J. Generating and designing dna with deep generative models. arXiv preprint, 2017. doi: 10.48550/ARXIV.1712.06148. [DOI] [Google Scholar]
  21. Kimura M. Stochastic processes and distribution of gene frequencies under natural selection. Cold Spring Harb Symp Quant Biol., 20:33–53, 1955. [DOI] [PubMed] [Google Scholar]
  22. Kimura M. Some problems of stochastic processes in genetics. Ann. Math. Stat., 28(4):882–901, 1957. [Google Scholar]
  23. Lee J. S. and Kim P. M. ProteinSGM: Score-based generative modeling for de novo protein design. bioRxiv, pp. 2022.07.13.499967, July 2022. [DOI] [PubMed] [Google Scholar]
  24. Li X. L., Thickstun J., Gulrajani I., Liang P., and Hashimoto T. B. Diffusion-LM Improves Control-lable Text Generation. arXiv preprint, 2022. doi: 10.48550/ARXIV.2205.14217. [DOI] [Google Scholar]
  25. Lou A. and Ermon S. Reflected diffusion models, 2023.
  26. Lovelace J., Kishore V., Wan C., Shekhtman E., and Weinberger K. Latent diffusion for language generation. arXiv preprint, 2022. doi: 10.48550/ARXIV.2212.09462. [DOI] [Google Scholar]
  27. Luo S., Su Y., Peng X., Wang S., Peng J., and Ma J. Antigen-Specific antibody design and optimization with Diffusion-Based generative models for protein structures. bioRxiv, pp. 2022.07.10.499510, October 2022. [Google Scholar]
  28. McGuire G., Tugemann B., and Civario G. There is no 16-clue sudoku: Solving the sudoku minimum number of clues problem via hitting set enumeration. Exp. Math., 23(2):190–217, April 2014. [Google Scholar]
  29. Morgulis A., Coulouris G., Raytselis Y., Madden T. L., Agarwala R., and Schäffer A. A. Database indexing for¨ production MegaBLAST searches. Bioinformatics, 24 (16):1757–1764, August 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Palm R., Paquet U., and Winther O. Recurrent relational networks. Adv. Neural Inf. Process. Syst., 31, 2018. [Google Scholar]
  31. Richemond P. H., Dieleman S., and Doucet A. Categorical sdes with simplex diffusion, 2022.
  32. Sohl-Dickstein J., Weiss E., Maheswaranathan N., and Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015. [Google Scholar]
  33. Song Y., Sohl-Dickstein J., Kingma D. P., Kumar A., Ermon S., and Poole B. Score-based generative modeling through stochastic differential equations. arXiv preprint, 2020. doi: 10.48550/ARXIV.2011.13456. [DOI] [Google Scholar]
  34. Song Y., Durkan C., Murray I., and Ermon S. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems, volume 34, pp. 1415–1428. Curran Associates, Inc., 2021. [Google Scholar]
  35. Steinrücken M., Wang Y. R., and Song Y. S. An explicit transition density expansion for a multi-allelic wright–fisher diffusion with general diploid selection. Theoretical population biology, 83:1–14, 2013a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Steinrücken M., Wang Y. X. R., and Song Y. S. An explicit transition density expansion for a multi-allelic Wright–Fisher diffusion with general diploid selection. Theor. Popul. Biol., 83:1–14, February 2013b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sun H., Yu L., Dai B., Schuurmans D., and Dai H. Score-based continuous-time discrete diffusion models. arXiv preprint, 2022. doi: 10.48550/ARXIV.2211.16750. [DOI] [Google Scholar]
  38. Tancik M., Srinivasan P., Mildenhall B., Fridovich-Keil S., Raghavan N., Singhal U., Ramamoorthi R., Barron J., and Ng R. Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst., 33:7537–7547, 2020. [Google Scholar]
  39. Trippe B. L., Yim J., Tischer D., Baker D., Broderick T., Barzilay R., and Jaakkola T. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint, 2022. doi: 10.48550/ARXIV.2206.04119. [DOI] [Google Scholar]
  40. Vahdat A., Kreis K., and Kautz J. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021. [Google Scholar]
  41. Wang P.-W., Donti P., Wilder B., and Kolter Z. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, pp. 6545–6554, 2019. [Google Scholar]
  42. Wang Y., Wang H., Wei L., Li S., Liu L., and Wang X. Synthetic promoter design in escherichia coli based on a deep generative network. Nucleic Acids Research, 48 (12):6403–6412, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Wang Y.-L., Kassavetis G. A., Kadonaga J. T., and Others. The punctilious RNA polymerase II core promoter. Genes Dev., 31(13):1289–1301, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Watson J. L., Juergens D., Bennett N. R., Trippe B. L., Yim J., Eisenach H. E., Ahern W., Borst A. J., Ragotte R. J., Milles L. F., Wicky B. I. M., Hanikel N., Pellock S. J., Courbet A., Sheffler W., Wang J., Venkatesh P., Sappington I., Torres S. V., Lauko A., De Bortoli V., Mathieu E., Barzilay R., Jaakkola T. S., DiMaio F., Baek M., and Baker D. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, pp. 2022.12.09.519842, December 2022. [Google Scholar]
  45. Zrimec J., Fu X., Muhammad A. S., Skrekas C., Jauniskis V., Speicher N. K., Börlin C. S., Verendel V., Chehreghani M. H., Dubhashi D., Siewers V., David F., Nielsen J., and Zelezniak A. Controlling gene expression with deep generative design of regulatory DNA. Nat. Commun., 13(1):5099, August 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Articles from ArXiv are provided here courtesy of arXiv

RESOURCES