Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2024 Oct 2:arXiv:2406.12056v3. [Version 3]

Learning Molecular Representation in a Cell

Gang Liu 1, Srijit Seal 2, John Arevalo 2, Zhenwen Liang 1, Anne E Carpenter 2, Meng Jiang 1, Shantanu Singh 2
PMCID: PMC11213146  PMID: 38947938

Abstract

Predicting drug efficacy and safety in vivo requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder’s latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule’s neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching.

1. Introduction

Drug properties, e.g., toxicity and adverse effects [25], are induced by molecular initiating events—interactions between a molecule and a biological system—that first impact the cellular level and ultimately influence tissue or organ functions [32]. However, a chemical molecule’s structure alone is insufficient information to predict its impact on cells: each chemical interacts with multiple cells and genes and induces complex changes in gene expression and cell morphology, making predictions of downstream responses challenging [5, 33]. Hence, molecular representation learning should make use of information about cellular response, enhancing the representation of the mode of action and thereby improving predictions for downstream bioactivity tasks [25, 54].

There is a lack of exploration for holistic molecular representations from molecular structure, cell morphology, and gene expression [18, 60, 26, 54, 46]. For example, graph self-supervised methods only manipulate molecular structures to perturb or mask molecular graphs using contrastive or predictive losses [18, 60, 21]. Moshkov et al. [33] explored the ability of different data modalities, taken independently, to predict molecules’ assay activity in a diverse set of assays (tasks). They found (from [33]’s Fig. 2) that molecular structure supports highly accurate prediction (AUC > 90%) in 31% (16/52) of tasks, gene expression in 37% (19/52) and cell morphology in 54% (28/52). Similarly, in our experiments (Figure 3), we observe that molecular structure is not a one-size-fits-all solution.

Figure 2:

Figure 2:

Molecular Representation Learning Using the Context Graph: (a) In Section 4.1, we construct the graph with various interaction, perturbation, and cosine similarities among molecules x, cell morphology profiles c, and genes e. Given a training batch of molecules, including x1 and x4, random walk extracts paths, for instance, of length four. (b) In Section 4.2, we aim to learn molecular representations based on the information bottleneck, preserving minimal information from the input molecule while ensuring sufficient information for decoding the target along the walk path 𝒫x.

Figure 3:

Figure 3:

Percentage of Tasks Where Representations Excel: We compare the relative performance of three single representation (Single Rep.) approaches (molecular structure, cell morphology, and gene expression) and three aligned representations (Aligned Rep.): InfoAlign, CLOOME, InfoCORE.

Cells can be perturbed by treating them with chemicals or genetic reagents that disrupt a particular gene or pathway. These chemical and genetic perturbations in vitro naturally bridge molecules with cell morphology and gene expression, as illustrated in Figure 1 (b). However, multi-modal contrastive methods such as CLOOME [46] and InfoCORE [54], depicted in Figure 1 (a), focus primarily on aligning molecular representations with cell morphology [46, 54] or gene expression [54]. These approaches fall short in two ways. (1) They do not remove redundant information, grey-colored area in Figure 1 (b), that may harm representation generalization. The presence of redundant information [54] may induce spurious correlations, adversely affecting the generalization of molecular representations. For example, in small molecule perturbations [3, 6], batch identifiers can signify confounding technical factors, creating misleading associations between molecular structures and cell morphology [54]. (2) They treat molecules as the sole connectors between gene expression and cell morphology, ignoring the potential for genetic perturbations [6] to directly establish connections between these modalities. Genetic perturbations [6] not only enrich the feature space of gene expression and cell morphology but also enhance the navigation of molecular representation learning towards the overlapped (bottleneck) area in Figure 1 (b).

Figure 1:

Figure 1:

Comparison of Representation Learning Methods: (a) Existing contrastive methods use two encoders—one for molecules and another for cell morphology or gene expression features—without sharing the molecule encoders for different alignment targets. (b) InfoAlign remove redundant information from molecules, cell morphology, and gene expressions based on the information bottleneck, resulting in more concise yet predictive molecular representations [1].

To address the aforementioned challenges, we conceptualize the cellular response processes as a context graph, capturing a more complete set of interactions among molecules, gene expression, and cell morphology. We identify the neighborhood of the molecule on the context graph and apply the information bottleneck [53] to optimize molecular representations, which aligns them with neighboring biological variables to remove redundant information and improve generalization.

We propose the Information Alignment (InfoAlign) approach, as presented in Figure 1 (b). InfoAlign uses one encoder and multiple decoders with information bottleneck for minimal sufficient statistics in representation learning. The minimality objective optimizes the encoder to learn the minimal informative representation from molecular structures by discarding redundant information. The sufficiency objective ensures the encoder retains sufficient information, allowing decoders to reconstruct features for biological variables in neighborhood areas of the context graph. We construct the context graph based on molecule and genetic perturbations [4, 6, 50] and introduce more biological (gene-gene interaction [16]) and computational (cosine similarity) criteria to increase edge connectivity. We conduct random walks on the context graph, beginning with the molecule in the training batch, to identify its neighborhood. Cumulative edge weights indicate similarity between the molecule and variables along the path. The molecule is encoded, and its latent representation is decoded to align with features identified in the random walk. Encoders and decoders are jointly optimized using an upper bound for the minimality objective and a lower bound for the sufficiency objective.

The sufficiency objective introduces a decoder-based bound for multi-modal alignment. We show its theoretical advantages by demonstrating that it provides a tighter bound than the encoder-based approaches used in previous contrastive methods [36, 40], as discussed in Section 4.3. In experiments, InfoAlign outperforms up to 27 baselines across three classification and one regression dataset, covering 685 tasks, with average improvements of up to 6.4%. InfoAlign also demonstrates strong zero-shot multi-modal matching on two molecule-morphology datasets.

2. Related Work

Representation Learning on Molecular Structure:

Representation learning approaches for molecules can be categorized into sequential-based [24, 45] or graph-based models [18, 60, 62, 27]. Sequential models, utilizing string formats of molecules like SMILES and SELFIES [24], have evolved from Recurrent Neural Networks (RNNs) to Transformers [7, 45]. These models typically follow specific pretraining strategies similar to language models such as BERT [9], RoBERTa [30, 7] and GPT [39]. The pretraining targets are thus often the next token predictions or mask language modeling [9, 7] on SMILES or SELFIES sequences [39]. Graph Neural Networks (GNNs) are the architectures for graph-based approaches [18, 60, 62, 29], where methods to pretrain GNNs often perturb or mask the atoms, edges, or substructures of molecular graphs with contrastive [18, 60] and predictive losses [62, 21]. Recent evidence highlights the challenges of developing universal molecular representations based solely on molecular structures without integrating domain knowledge [3, 47, 51, 48, 28]. Although using motifs is a common method to incorporate such knowledge [44, 21], the incorporation of information about molecules’ biological impacts is much less explored. We aim to enhance molecular representation learning by incorporating domain knowledge from cellular response data.

Representation Learning with Different Modalities:

Existing methods on multimodal alignment, such as CLIP [40], primarily address pairwise relationships between texts and images and use methods like InfoNCE [36, 54, 46]. These approaches use separate encoders for different modalities to compute contrastive loss, which is upper bounded by the number of negative examples [38]. Subsequent research on molecules similarly focuses on pairwise alignment between molecules and cell images [46, 54], molecules and protein sequences [20], and molecules and text [10, 22]. Although BioBridge [57] handles multiple modalities, it leverages a knowledge graph for transforming representations between modalities rather than optimizing molecular representations.

Representation Learning with Cellular Response Data:

A primary goal of molecular representation learning is to predict molecular bioactivity. Likewise, emerging gene expression [50] and morphological profiling approaches [5, 49] that describe perturbed genetic or cellular states in cell cultures can also be used to predict bioactivity. In some datasets, molecules are the perturbations, and the perturbed cell states measured are gene expression values for a thousand or more genes [50] and/or microscopy Cell Painting images, which can be represented as a thousand or more morphology features [8]. Recently created large-scale perturbation datasets [3, 50, 6] could enrich molecular representation learning approaches. CLOOME [46] and MoCoP [34] contrast cellular images with molecules and InfoCORE [54] contrasts molecule with either morphological profiling [4] or gene expression [54]. InfoCORE [54] aims to mitigate confounding batch identifiers, but its effectiveness depends on a batch classifier, which is impractical without batch identifiers. We integrate cellular response data and molecules into a context graph to capture cellular response patterns, focusing on learning molecular representations to remove nuisances [52].

3. Problem Definition

We denote x𝒳 as the molecule from the space 𝒳. An encoder with parameters pθ(zx) maps x to a D-dimensional latent representation zRD. One may implement a Graph Neural Network (GNN) [58] as the encoder. The GNN first updates node representations and then performs a readout operation (e.g., summation) over the nodes to obtain the latent representation.

Existing research has extensively used structural features to pretrain the GNN encoder [18, 21]. However, incorporating more expressive features from the cellular context, such as cell morphology and gene expression, remains largely unexplored for improving molecular representations. In this work, we use these features as targets to optimize molecular representations.

4. Multi-modal Alignment with InfoAlign

We present the overall representation learning framework in Figure 2. In Section 4.1, we construct the context graph for cellular response data. In Section 4.2, we introduce representation learning methods based on the principle of minimal sufficiency for molecules and their related modalities. In Section 4.3, we demonstrate the theoretical advantages of the proposed method.

4.1. Random Walks on Cellular Context Graph

Node Construction:

We model the interactions of the molecule x with other molecules, cell c, and genes e using the context graph. They are nodes with different features y. Molecular features are vectors obtained using fingerprint [43]. Cell morphology features are vectors derived from CellProfiler [5] applied to Cell Painting microscopy images. Gene expression features are scalars using L1000 [50] methods. We further rescale the feature spaces to a range between 0 and 1.

Edge Construction:

We link nodes using various chemical, biological, and computational criteria. For example, molecules can perturb cultured human cells, inducing changes in cell morphology [6] and gene expression [50], thus linking them to cell morphology and gene expression nodes. Genes could also perturb cells, inducing links between genes and cell morphology [6]. Additionally, we calculate cosine similarity within the same feature space and use biological criteria such as gene-gene interactions [16] to enrich the edge space. Each edge is assigned a weight w with a value between 0 and 1. We construct the context graph with details Section 5. An example is provided in Figure 2 (a).

Random Walk Path Extraction:

The context graph identifies related cellular response patterns for input molecules in representation learning. Given an input molecule x, we extract its neighborhood through random walks starting from x. Specifically, we employ degree-based transition probabilities [37] and denote the walking path as 𝒫x:xw1v2w2wLvL, where v2 is a direct neighbor of x. To quantify the similarity between x and node vi(2iL) on 𝒫x, we compute the cumulative product of edge weights as αvi𝒫x=j=1i-1wj.

4.2. Optimization for Representation with Information Bottleneck

The information bottleneck (IB) [53, 1] is an appealing method for defining concise representations with strong predictive power. For molecular representation, we extract minimal sufficient information from the random variable X of molecules. This is achieved by aligning the molecular representations Z with the targets Y, derived from node features along the walk path 𝒫. The IB has two principles based on mutual information (MI): (1) the minimality principle, which minimizes MI between molecules and their latent representations as I(X;Z), and (2) the sufficiency principle, which decodes latent representations to maximally reconstruct feature spaces for variables along the walk path I(Z;Y). Together, these form the optimization objectives:

minp(zx)[-I(Z;Y)+βI(X;Z)], (1)

where β controls the trade-off between minimality and sufficiency. The exact computation of I(Z;Y) and I(X;Z) is intractable due to the unknown conditional distribution p(yz) and the marginal p(z). We introduce the variational approximations q(yz) and q(z) for them, respectively. This results in a lower bound estimation for the first decoding term IDLB and an upper bound for the second encoding term IEUB [38].

I(Z;Y)Ep(z,y)[logq(yz)]+H(Y)IDLBI(X;Z)Ep(x)[KL(p(zx)q(z))]IEUB (2)

H(Y) is the differential entropy. Proofs are in appendix A.1.Together, IDLB and IEUB upper bound Eq. (1), forming a tractable objective -IDLB+IEUB to optimize the encoder. For the target Y, the IDLB objective requires decoders rather than encoders, as typically used in prior work [46]. We use distinct decoders, denoted as qϕ with parameters ϕ, for various targets, including molecular fingerprints, gene expressions, and cell morphology features.

After ignoring the constant terms, one could formulate the loss function according to Eq. (2) for the molecule sample x, its latent representation z, and the targets yv from 𝒫x:

=1Lv𝒫xαv𝒫xlogqϕyvz+βKLpθ(zx)𝒩(0,I), (3)

where the first term aligns the representation with other features, and KL is the Kullback–Leibler divergence used for regularization. 𝒩(0,I) is the The Gaussian prior. In this formulation, the encoder models a distribution instead of a single representation z, learning the mean and variance μ,σRD. One may use parameterization tricks to sample z from the distribution [1]. The decoder then reconstructs yv, the features of the neighboring node v on the context graph.

InfoAlign uses multiple decoders for qϕ to align multi-modal features, while prior work relies on encoders with CLIP-like losses to align the latent space [40, 14, 54, 46]. Next, we provide the theoretical benefits of decoder-based alignment alongside the empirical advantages in Section 6.

4.3. Theoretical Motivation for Decoder-based Alignment

InfoNCE [36] is the contrastive loss used for most CLIP-like methods [40, 54]. In this work, we show that the MI lower bound based on InfoAlign is tighter than that based on InfoNCE.

Proposition 4.1.

For the molecular representation Z and target Y (from cell morphology, gene expressions, or molecular fingerprints), the encoder-based MI lower bound IELB for InfoNCE can be derived by incorporating K-1 additional samples, denoted as y2:K, to build the Monte Carlo estimate m() of the partition function [35, 38]:

IELB=1+Ep(z,y)py2:Klogeh(z,y)mz;y,y2:K-Ep(z)py2:Kp(y)eh(z,y)mz;y,y2:K, (4)

where h(z,y) is the neural network parameterized critic for density approximation with the energy-based variational family. The decoder-based lower bound IDLB is defined in Eq. (2), then we have that IDLB is tighter than IELB, i.e., I(Z;Y)IDLB(Z;Y)IELB(Z;Y).

Proofs are in appendix A.2. The result aligns with empirical observations in previous studies such as DALL-E 2 [41], where a prior model was introduced to improve representations from CLIP [40] before decoding to another modality. In this work, we learn decodable latent representations from molecules to align with different biological features.

5. Implementation of Context Graph and Pretraining Setting

Data Source of Context Graph:

We create the context graph based on (1) two Cell Painting datasets [4, 6], containing around 140K molecule perturbations (molecule and cell morphology pairs) and 15K genetic perturbations (gene and cell morphology pairs) across 1.6 billion human cells; (2) Hetionet [16], which captures gene-gene and gene-molecule relationships from millions of biomedical studies; and (3) a dataset reporting differential gene expression values for 978 landmark genes [56] for chemical perturbations (molecule and gene expression pairs) [50].

Node Features:

Different profiling methods provide node features in different ways. Morgan fingerprints [43] are feature vectors extracted from each molecule’s structure, CellProfiler features [5] are computed from the image of each cell and represent cell morphology, and L1000 profiles [50] capture gene expression values on 978 landmark genes from cells treated with a chemical perturbation. Here are two practical considerations for the context graphs: (1) Chandrasekaran et al. [6] provided one dataset that measured the cell morphology impacts of perturbing individual genes. The 15K genetic perturbations [6] provide gene-cell morphology pairs but lack corresponding gene expression profiles. Still, we keep the gene nodes from this dataset to account for potential gene-gene interactions and incorporate cell morphology features into them. (2) All 978 landmark genes have expression values linked to the molecules used in [56]. We update new gene expression nodes with 978-dimensional feature vectors. These vectors summarize all molecule-gene expression connections for a small molecule perturbation. This approach efficiently reduces dense connections between landmark genes and molecules. We select the top 1% of gene-molecule expression values as new edges to enrich the context graph’s connectivity. We scale cell morphology and gene expression features to a range of 0 to 1 using the Min-Max scaler along each dimension.

Edge Weights:

For edges based on chemical perturbations [4, 6], we assign the edge weight of 1. We also compute cosine similarity for nodes if they are in the same feature space (such as two cell morphology/gene expression profiles, or Morgan fingerprints). To avoid noisy edges from computations, we apply a 0.8 threshold for cosine similarity, and additionally explicitly enforce 99.5% sparsity by selecting top similar edges.

All together, this results in a context graph of 276,855 nodes (129,592 molecules, 4533 genes + 13,795 gene expressions, and 128,935 cell morphology) and 366,384 edges.

Encoder and Decoder:

We use the Graph Isomorphism Model (GIN) [58] as the molecule encoder. All molecules on the context graph are used to pretrain the encoder. Because we extract feature vectors as the decoding targets in different modalities, we could efficiently use Multi-Layer Perception (MLP) as modality decoders. In each training batch, random walks start from the molecule node to extract the walk path. Then, decoders are pretrained to reconstruct corresponding node features from nodes over the path. More details are in appendix B.

6. Experiments

We focus on three research questions (RQs) regarding InfoAlign’s representation for molecular property prediction, molecule-morphology matching, and performance analysis.

6.1. RQ1: Molecular Property Prediction

6.1.1. Experimental Setting

Dataset and Evaluation:

We select datasets for important tasks including activity classification for various assays in ChEMBL2K [13] and Broad6K [33], drug toxicity classification using ToxCast [42], and absorption, distribution, metabolism, and excretion (ADME) regression using Biogen3K [11]. The dataset statistics are in Table 1, covering 685 tasks. We apply scaffold-splitting for all datasets. We follow [19] for the ToxCast dataset, and a 0.6:0.15:0.25 ratio for training, validation, and test sets for other datasets. We use the Area under the curve (AUC) for classification and mean absolute error (MAE) for regression. Mean and standard deviations are reported from ten runs.

Table 1:

Datasets and task information. Classf. denotes classification and Regr. denotes regression.

Dataset Type # Task # Molecules # Atoms
Avg./Max
# Edges
Avg./Max
# Available Cell
Morphology
# Available Gene
Expressions

ChEMBL2K Classf. 41 2355 23.7/61 25.6/68 2353 631
Broad6K Classf. 32 6567 34.1/74 36.8/82 2673 1138
ToxCast Classf. 617 8576 18.8/124 19.3/134 N.A. N.A.
Biogen3K Regr. 6 3521 23.2/78 25.3/84 N.A. N.A.
Baseline:

We include 27 baselines across six categories: (1) three molecular fingerprint (FP)-based methods [43]; (2) eleven pretrained GNNs; (3) four pretrained chemical language models; (4,5) six methods based on cell morphology and gene expression values from cells treated with each molecule; (6) CLOOME [46] and InfoCORE [54] for multi-modal alignment using structure, morphology, and gene expression data. We use MLPs, Random Forests (RF), and Gaussian Processes (GP) for methods in categories (1,4,5). We fine-tune MLPs on various representation learning approaches for predicting molecular properties. Setting details and all results are in appendices C.1 and C.3.

6.1.2. Results and Analysis

We present results across various assays in Tables 2 and 3 and Figure 3. Key observations include:

Table 2:

Results on ChEMBL2K and Broad6K. We report average AUC (Avg.), as well as the percentage of tasks achieving AUC above 80%, 85%, and 90%. We highlight the best and second best mean. We also highlight the Inline graphic in each category.

Dataset ChEMBL2k (AUC ↑) Broad6k (AUC ↑)
(# Molecule / # Task) (2355 /41) (6567/32)
Method Avg. >80% >85% >90% Avg. >80% >85% >90%

Morgan Fingerprints
MLP 76.8±2.2 48.8±3.9 34.6±6.3 21.9±5.7 63.3±0.3 6.3±0.0 4.4±1.7 3.1±0.0
RF 54.7±0.7 0.0±0.0 0.0±0.0 0.0±0.0 55.5±0.1 0.0±0.0 0.0±0.0 0.0±0.0
GP 51.0±0.0 0.0±0.0 0.0±0.0 0.0±0.0 50.6±0.0 0.0±0.0 0.0±0.0 0.0±0.0

Pretrained GNN
AttrMask [18] 73.9±0.5 46.8±2.7 31.2±4.4 14.6±1.7 59.8±0.2 3.1±0.0 3.1±0.0 3.1±0.0
ContextPred [18] 77.0±0.5 55.1±1.3 34.1±4.6 14.6±1.7 60.0±0.2 7.5±1.7 3.1±0.0 3.1±0.0
EdgePred [18] 75.6±0.5 54.2±4.0 34.6±7.2 12.2±2.4 59.9±0.2 3.1±0.0 3.1±0.0 3.1±0.0
GraphCL [60] 75.6±1.6 46.8±7.6 32.2±6.8 18.0±3.7 67.2±0.5 15.6±3.1 3.1±0.0 3.1±0.0
GROVER [44] 73.3±1.4 38.5±2.0 22.4±3.6 14.6±2.4 66.2±0.1 15.6±0.0 3.8±1.4 3.1±0.0
JOAO [60] 75.1±1.0 47.8±5.1 33.7±2.0 19.0±3.2 67.3±0.4 12.5±0.0 3.8±1.4 3.1±0.0
MGSSL [62] 75.1±1.1 39.0±4.6 29.3±3.0 10.3±3.2 66.9±0.5 13.8±2.8 3.1±0.0 3.1±0.0
GraphLoG Xu et al. 73.5±0.7 41.9±2.0 29.3±3.4 15.6±2.8 62.9±0.4 4.4±1.7 0.0±0.0 0.0±0.0
GraphMAE [17] 74.7±0.1 33.2±1.3 27.8±1.3 12.2±1.7 66.8±0.3 14.4±1.7 3.1±0.0 3.1±0.0
DSLA [23] 69.3±1.0 23.9±4.7 14.6±5.5 6.8±1.1 63.3±0.3 6.3±0.0 3.1±0.0 3.1±0.0
UniMol [63] 76.8±0.4 46.8±2.0 33.7±1.1 24.9±2.0 65.4±0.1 7.5±1.7 3.1±0.0 3.1±0.0

Pretrained Chemical Language Models
Roberta [31] 74.7±1.9 46.3±3.4 35.1±4.4 22.9±1.3 59.8±0.7 5.0±1.7 3.1±0.0 3.1±0.0
GPT2 [31] 71.0±3.4 31.2±11.2 20.0±9.4 7.3±6.9 60.6±0.3 7.5±1.7 1.9±1.7 1.9±1.7
MolT5 [10] 69.9±0.8 32.2±2.0 21.0±4.1 8.8±1.3 56.4±0.8 3.8±1.4 2.5±1.4 2.5±1.4
ChemGPT [12] 65.0±1.1 16.1±2.8 11.2±3.3 5.4±1.1 55.1±0.9 3.1±0.0 3.1±0.0 1.3±1.7

Cell Morphology
MLP 64.3±2.4 15.6±6.6 8.3±3.7 4.9±3.9 51.9±1.0 0.0±0.0 0.0±0.0 0.0±0.0
RF 55.9±0.7 3.9±1.3 3.9±1.3 2.4±0.0 55.3±0.1 0.0±0.0 0.0±0.0 0.0±0.0
GP 50.1±0.0 0.0±0.0 0.0±0.0 0.0±0.0 54.7±0.0 0.0±0.0 0.0±0.0 0.0±0.0

Gene Expression
MLP 56.1±1.1 5.1±1.4 3.4±1.3 3.4±1.3 56.9±1.4 1.9±1.7 1.9±1.7 1.9±1.7
RF 52.8±0.3 0.0±0.0 0.0±0.0 0.0±0.0 55.2±0.2 0.0±0.0 0.0±0.0 0.0±0.0
GP Run out of time 50.1±0.0 0.0±0.0 0.0±0.0 0.0±0.0

Multi-modal Alignment
CLOOME 66.7±1.8 26.8±4.6 16.1±3.7 10.7±5.1 61.7±0.4 3.1±0.0 3.1±0.0 0.0±0.0
InfoCore (GE) 79.3±0.9 62.4±2.8 46.3±3.0 30.3±2.2 60.2±0.2 3.1±0.0 0.0±0.0 0.0±0.0
InfoCore (CP) 73.8±2.0 37.6±9.2 26.3±4.7 10.7±4.1 61.1±0.2 6.3±0.0 3.1±0.0 0.0±0.0
InfoAlign (Ours) 81.3±0.6 66.3±2.7 49.3±2.7 35.1±3.7 70.0±0.1 18.8±2.2 3.1±0.0 3.1±0.0
Table 3:

Results on ToxCast and Biogen3K. We report the average AUC and the percentage of AUC above 80% on ToxCast, and regression MAE (scaled by × 100) for Biogen3K. We highlight the best and second best mean. We also highlight the Inline graphic in each category.

Dataset ToxCast (AUC ↑) Biogen3K (MAE ×100 ↓ )
(# Molecule / # Task) Method (8576/617) (3521 /6)
Method Avg. >80% Avg. hPPB rPPB RLM HLM ER Solubility

Morgan Fingerprints
MLP 57.6±1.0 1.6±0.3 66.2±2.4 66.1±2.6 56.8±2.3 56.5±4.2 74.6±6.2 73.7±7.3 69.5±3.0
RF 52.3±0.1 0.2±0.1 52.8±0.2 44.2±0.1 44.2±0.1 42.0±0.2 67.7±0.7 66.9±0.9 51.6±0.1
GP Run out of Time 60.0±0.0 51.3±0.0 59.5±0.0 49.7±0.0 68.8±0.0 69.3±0.0 61.6±0.0

Pretrained GNN
AttrMask [18] 63.1±0.8 3.2±1.2 67.3±0.3 82.4±1.1 49.8±0.7 51.7±1.0 57.9±0.6 62.6±0.5 99.1±1.2
ContextPred [18] 63.0±0.6 3.3±1.3 68.5±0.9 85.0±7.9 49.7±0.4 55.1±2.7 61.4±1.8 63.1±0.5 96.5±3.7
EdgePred [18] 63.5±1.1 4.8±3.0 67.8±0.9 81.2±10.2 48.0±0.5 53.5±2.8 62.2±1.8 62.9±0.7 99.1±6.9
GraphCL [60] 52.2±0.2 0.5±0.3 53.9±0.6 43.8±0.3 45.4±0.6 40.6±0.5 76.7±1.0 67.1±2.2 49.6±0.3
GROVER [44] 53.1±0.4 0.5±0.1 54.9±1.6 44.5±0.4 46.5±0.7 41.7±0.6 73.2±5.7 71.0±4.3 52.6±0.3
JOAO [61] 52.3±0.2 0.4±0.1 55.0±0.8 44.5±0.5 47.6±0.5 40.6±0.2 74.3±2.8 71.5±2.6 51.4±0.6
MGSSL [62] 64.2±0.2 4.0±0.4 53.2±0.3 44.8±0.6 49.7±0.3 41.5±0.2 65.6±1.8 64.6±0.5 52.7±0.5
GraphLoG [59] 58.6±0.4 2.5±0.3 56.9±0.4 49.3±0.3 54.8±0.5 42.6±0.3 66.8±1.7 69.0±1.3 58.8±0.5
GraphMAE [17] 53.3±0.1 0.6±0.1 52.8±0.8 43.3±0.9 51.2±0.8 40.9±0.3 64.4±2.7 65.9±3.8 50.9±1.4
DSLA [23] 57.8±0.5 0.7±0.1 57.9±0.7 50.4±0.7 53.6±1.7 43.3±0.9 68.6±1.2 70.8±2.0 60.9±0.6
UniMol [63] 64.6±0.2 4.8±1.0 55.8±2.8 50.1±5.2 49.9±5.6 43.6±1.1 65.4±4.9 65.8±1.2 59.9±6.6

Pretrained Chemical Language Models
Roberta [31] 64.2±0.8 3.1±1.8 69.0±2.6 71.4±14.5 65.1±19.2 63.7±24.6 67.5±5.2 69.9±4.9 76.7±13.2
GPT2 [31] 61.5±1.1 2.4±0.6 74.0±8.5 65.4±12.9 73.1±20.8 54.1±12.9 83.2±21.5 86.1±19.8 81.8±25.5
MolT5 [10] 64.7±0.9 3.6±1.1 65.1±0.5 76.7±2.1 55.9±1.1 49.2±1.0 70.3±0.8 73.1±1.0 65.3±1.7
ChemGPT [12] Token Error 75.7±8.5 59.5±7.3 88.8±32.3 76.1±11.8 84.0±20.6 77.2±8.5 68.6±7.1

Multi-modal Alignment
CLOOME 54.2±0.9 0.9±0.2 64.3±0.4 65.2±1.5 56.9±0.8 44.2±0.8 70.7±0.4 73.6±0.8 75.0±2.1
InfoCORE (GE) 65.3±0.2 5.4±1.7 69.9±1.2 79.9±3.6 51.6±1.8 51.3±2.1 78.6±0.3 77.8±1.9 80.3±0.9
InfoCORE (CP) 62.4±0.4 1.3±0.5 71.0±0.6 74.5±4.9 53.5±0.7 53.6±2.1 80.8±1.5 79.4±3.4 84.4±1.0
InfoAlign (Ours) 66.4±1.1 6.6±1.6 49.4±0.2 39.7±0.4 39.2±0.3 40.5±0.6 66.7±1.7 62.0±1.5 48.4±0.6
(1). Molecular structures are superior compared to cell morphology and gene expression features for predicting various molecular assays.

This is likely because the datasets and tasks we selected fundamentally involve predicting the binding affinity of a molecule to a protein [13]; furthermore, in these datasets, molecules with activity in a given assay tend to have highly related structures, rather than representing two or more structurally distinct classes of molecules with activity; together this implies that molecular structure alone will tend to yield strong results. When comparing the three popular structure-based representation approaches, no single method outperforms the others across all four datasets. Pretrained GNNs generally perform better than fingerprint-based methods and pretrained chemical language models, thanks to recent advancements. However, continued efforts in universal structural representation are still necessary.

(2). Cell morphology and gene expression features may complement molecular structures, yielding more generalizable representations.

As shown in Figure 3, cell morphology and gene expression outperform molecular structure in approximately 20% and 10% of tasks on the ChEMBL2K and Broad6K datasets, respectively. This suggests that incorporating cell context into representation learning would be beneficial. That said, existing multi-modal baselines (InfoCORE, CLOOME) only outperform molecular structure-based approaches on ChEMBL2K and ToxCast, as they do not construct molecular representations holistically by using all cell-related modalities.

(3). InfoAlign achieves the best average performance on all tasks compared to 27 baselines.

The improvements from InfoAlign range from 2.5% to 6.4% on average across four datasets compared to the second-best method. These gains are more significant when using the 80% AUC threshold on classification datasets. While InfoCORE (GE) performs best among baselines on the ChEMBL2K and ToxCast datasets, it struggles to align molecular representations with more than two modalities and sometimes leads to negative transfer, as seen in Broad6K and Biogen3K.

6.2. RQ2: Molecule-Morphology Cross-Modal Matching

6.2.1. Experimental Setting

We evaluated zero-shot matching performance of various methods for predicting cell morphology from query molecules, including baselines CLOOME and InfoCORE (CP) with pretrained encoders. For retrieval, we calculate the cosine similarity between the molecular representation and all cell morphology candidates, rank these candidates, and compute Normalized Discounted Cumulative Gain (NDCG) and HIT scores for the top-1 and top-10 candidates as metrics. To ensure a fair evaluation of zero-shot matching, we exclude the cell morphology data for molecules that were used to train the baseline encoders. Consequently, we have 80 molecule-cell morphology pairs from ChEMBL2K and 196 pairs from Broad6K. All the morphology data are used as candidates for matching.

For InfoAlign, we use the pretrained decoder from Section 5 to extract the morphology features of the encoded molecule and then calculate the likelihood of these decoded features against the candidate morphology data. We then rank the candidates in the decoding space based on their likelihood scores.

6.2.2. Results and Analysis

Cross-modal matching results are in Table 4. InfoAlign outperforms InfoCORE on ChEMBL2K and is comparable to InfoCORE on Broad6K, with both surpassing CLOOME. Additionally, we visualized the distribution of ranking positions for correct matching pairs to compare overall retrieval performance. The results show that InfoAlign and InfoCORE perform similarly, while CLOOME consistently ranks correct pairs lower.

Table 4:

Retrieval results on ChEMBL2K (top) and Broad6K (bottom): Left tables display ranking metrics for top candidates. Right figures visualize the distribution of rankings for the correct matching.

ChEMBL2K NDCG % (↑) HIT % (↑) graphic file with name nihpp-2406.12056v3-t0008.jpg

top-1 top-10 top-1 top-10

CLOOME 0 2.0 0 6.3
InfoCORE 0 4.5 0 11.3
InfoAlign 1.3 5.7 1.3 12.5

Broad6K NDCG % (t) HIT %(t) graphic file with name nihpp-2406.12056v3-t0009.jpg

top-1 top-10 top-1 top-10

CLOOME 0.5 0.9 0.5 1.5
InfoCORE 1.0 2.5 1.0 4.6
InfoAlign 0.5 2.3 0.5 5.1

6.3. RQ3: Performance Analysis

6.3.1. Ablation Studies

We perform ablation studies on Eq. (3) by pretraining encoders with different targets removed: (1) molecule-related, (2) cell morphology-related, and (3) gene expression-related features. The results in Table 5 cover all datasets. We observe that both cell morphology and gene expression features are crucial for achieving the best performance. Different biological targets have varying impacts across datasets: molecular structure has more influence on Broad6K and Biogen3K, while gene expression is more important for ChEMBL2K and ToxCast.

Table 5:

Ablation studies on the pretraining loss. Different node features are removed from the context graph to assess their impact on downstream tasks. Avg. AUC is reported.

ChEMBL2K
AUC ↑
Broad6K
AUC ↑
ToxCast
AUC ↑
Biogen3K
MAE ↓ (×100)

Default as Eq. (3) 81.3 ± 0.6 70.0 ± 0.1 66.4 ± 1.1 49.4 ± 0.2
w/o Cell Morphology 80.7 ± 0.6 68.6 ± 0.1 65.5 ± 1.1 51.7 ± 1.1
w/o Gene Expressions 78.3 ± 0.5 68.6 ± 0.2 64.7 ± 1.0 50.3 ± 0.5
w/o Molecular Features 79.1 ± 0.2 67.1 ± 0.4 65.8 ± 2.3 51.7 ± 0.6

6.3.2. Hyperparameter Analysis

Lastly, we perform analysis for the hyperparameters: the strength of the regularization to the prior Gaussian distribution β and the length of the random walk paths L. Results are presented in Figure 4. We observe a trade-off between the principles of minimality and sufficiency in Figure 4a: a too-high β value (minimal information) makes it challenging for the representation to be sufficiently expressive for molecular, gene expression, and cell morphology features, potentially degrading downstream performance. Conversely, a too-low β value weakens minimality and may impair generalization. The convergence of the pretraining loss could serve as a good indicator to balance these aspects. For the hyperparameter L, we observe in Figure 4b that downstream performance on ChEMBL2K is relatively robust across a wide range of walk lengths.

Figure 4:

Figure 4:

Analysis on the hyperparameters: strength of β and random walk length L. AUC is computed on the test set of ChEMBL2K.

7. Conclusion

In this work, we proposed learning molecular representations in a cell context with three modalities: molecular structure, gene expression, and cell morphology. We introduced the information bottleneck approach, InfoAlign, using a molecular graph encoder and multiple MLP decoders. InfoAlign learned minimal sufficient molecular representations extracted by reconstructing features in the random walk path on a cellular context graph. This context graph incorporated molecules, cell morphology, and gene expression information defined in scalar or vector spaces to construct nodes, and used various chemical, biological, and computational criteria to define their weighted edges. We demonstrated the theoretical and empirical advantages of the proposed method. InfoAlign outperformed other representation learning methods in various molecular property prediction and zero-shot molecule-morphology matching tasks.

Acknowledgments

This study was supported by National Institutes of Health (R35 GM122547 to AEC) and an internship funded by the Massachusetts Life Sciences Center (to GL).

A. Proof Details

A.1. Proof of Eq. (2)

For the input, latent, and target variables X,Z, and Y, the exact computation of the mutual information (MI) I(Z;Y) and I(X;Z) is intractable. For the molecule x, its latent representation z, and any biological target from cellular responses y, we introduce variational approximations q(yz) to obtain a lower bound on I(Z;Y):

I(Z;Y)=Ep(z,y)logp(z,y)q(yz)p(y)p(z)q(yz),=Ep(z,y)logq(yz)p(y)+Ep(z)[KLpyzqyz,Ep(z,y)[logq(yz)]+H(Y)IDLB (5)

This is because that KL(p(yz)q(yz))0. We introduce the variational approximations q(z) for a upper bound on I(X;Z):

I(X;Z)=Ep(x,z)logp(x,z)q(z)p(x)p(z)q(z),=Ep(x,z)logp(zx)q(z)-Epz[(pzqz,Ep(x)[KL(p(zx)q(z))]IEUB (6)

A.2. Proof of proposition 4.1

For the molecule x, its latent representation z, and any biological target from cellular responses y, we use the neural network parameterized critic h(z,y) with the energy-based variational family for density approximation [38]:

qyz=pyEpyehz,yehz,y.

Thus, we can rewrite IDLB based on the unnormalized distribution of q(yz):

IDLB=Ep(z,y)[logq(yz)]+H(Y)=Ep(z,y)logp(y)Ep(y)eh(z,y)eh(z,y)-Epylogpy,=Ep(z,y)hz,y-Epz,yEpyehz,y,=Ep(z,y)hz,y-EpzlogZ˜z, (7)

where Z˜(z)=Ep(y)eh(z,y) is the partition function.

Note that the log partition function is intractable. Poole et al. [38] introduced a new variational parameter a() to upper bound Z˜(z), deriving a tractable lower bound for IDLB :

IDLBEp(z,y)[h(z,y)]-Ep(z)[Ep(y)eh(z,y)a(z)+log(a(z))-1]. (8)

This is because x,a>0, the inequality log(x)xa+log(a)-1 holds, which can be applied to the second term of Eq. (7). The INWJ bound [35] is a special case where a(z)=e.

INWJEp(z,y)[h(z,y)]-Ep(z)Ep(y)eh(z,y)e+log(e)-1=Ep(z,y)hz,y-e-1EpzZ˜z. (9)

INWJ has high variance due to the estimation of the upper bound on the log partition function. Based on INWJ and multiple examples, one can derive the encoder-based lower bound IELB for InfoNCE.

Suppose there are K-1 additional examples independently and identically sampled and denoted as y2:K, and the critic is configured with parameters a() as 1+logeh(z,y)az;y,y2:K. Then, we can rewrite INWJ for its multi-sample version:

INWJ=Ep(z,y)py2:K1+logeh(z,y)az;y,y2:K-e-1Epypzpy2:Ke1+logehz,yaz;y,y2:K,=1+Ep(z,y)py2:Klogeh(z,y)az;y,y2:K-Epypy2:Kpzehz,yaz;y,y2:K. (10)

Multiple samples can be utilized for the Monte Carlo method mz;y,y2:K to estimate the upper bound on the partition function az;y,y2:K :

az;y,y2:K=mz;y,y2:K=1Keh(z,y)+i=2Kehz,yi,

where K-1 independent samples are drawn from ipyi and one sample from p(z,y) for the term Ep(z,y)py2:K[] or K samples from i=1Kpyi (we set y1=y ) for a p(z) sample in the Ep(y)py2:K[] term. Therefore, we can derive IELBINCE:

IELBINCE=1+Ep(z,y)py2:Klogeh(z,y)mz;y,y2:KEp(y)py2:Kp(z)eh(z,y)mz;y,y2:K,=1+EpYZpzpy2:Klogehz,y1Ki=1Kehz,yiEpypy2:Kpzehz,y1Ki=1Kehz,yi,=Epz,yhz,yEpzlog1Ki=1Kehz,yi. (11)

Note that for Ep(y)py2:Kp(z)[], we average the bound over K replicates as well to ensure that the last term in Eq. (10) is the constant 1. Now, IELB or INCE is upper bounded by logK, rather than a(). Hence, the difference between IDLB and IELB is

IDLBIELB=Ep(z)log1Ki=1Kehz,yiEp(z)(logZ˜(z))0. (12)

When K is sufficiently large to estimate the partition function, we have Ep(z)logEp(y)eh(z,y) for the left term, indicating that IDLB-IELB=0. Since INCE is upper bounded by logK [36], smaller values of K may result in a less tight IELB, causing IDLB-IELB0 to always hold. In particular, I(Z;Y)>logK implies that the bound IELB will be loose.

B. Context Graph Details

B.1. Edge Construction

Edges represent similarity relationships between molecules, genes and cells. According to the chemical or biological criteria, we have following types of edges:

  1. Molecule-Cell Morphology Edges: These edges are introduced through molecule perturbation experiments from cell painting datasets created by [3] and the JUMP dataset [6]. It links molecule nodes with cell morphology nodes. We use the edge weight 1 for all these edges.

  2. Edge-Cell Morphology Edges: These edges should be introduced by genetic perturbation from the JUMP dataset [6]. The perturbations are either based on gene overexpression (ORF) or gene knockout techniques (CRISPR). They link the gene nodes and the cell morphology nodes. However, the genes introduced by the genetic perturbations lack gene expression profiling from [50] as node features. We did not implement gene-cell morphology edges from [6] due to the absence of differential gene expression profiling values [50]. Instead, we merged the gene nodes from [6] with their linked cell morphology nodes, creating single nodes. This approach enables a more efficient context graph, incorporating some gene nodes with cell morphology features.

  3. Molecule-Gene Edges: These edges could represent molecule-gene binding and regulation relationships, linking molecules to genes [16]. Some links can be sourced from [16], and we also retrieve gene-molecule links from [56] by selecting the top 5% absolute differential expression values.

  4. Gene-Gene Edges: These edges denote the relationships of gene-gene covariance and interaction and we use the links from [16].

Figure 5:

Figure 5:

From the initial idea in Section 4 to the practical implementation of the context graph, we first display relations between molecules and all the landmark genes from [56] for the X1-E3 and X3-E2 relationships. E3 and E2 are landmark genes involved in small molecule perturbations and cell morphology perturbation; we display them separately for clarity. Next, we merge all landmark genes into new gene expression nodes and integrate genes from genetic perturbations in the JUMP dataset [6] with cell morphology features. Practical considerations are detailed in Section 5 and appendix B.

We enrich the edges in the context graph by incorporating computational similarity edges, where cosine similarity is computed among within nodes having the same type and feature vectors. We note that the cell morphology features from [3] and [6] have different dimensions since the latter has applied batch correction techniques [2] on the CellProfiler features [5]. Thus, we cannot compute the similarity between these two subsets of cell morphology nodes. We use (1) a 0.8 similarity threshold and (2) a minimal sparsity of 99.5% by selecting top 0.5% similar edges to avoid excessive noise in computational similar edges.

B.2. Dataset Sources of Nodes

Here are the datasets we used to create different types of nodes on the context graph:

Figure 6:

Figure 6:

An overview of the representation’s predictive performance on all 41 bioactivity prediction tasks in ChEMBL2K. Results for molecular structure are obtained from the best method ContextPred. Results for cell morphology and gene expression come from the best method based on MLPs.

  • Molecule nodes: Molecular nodes are sourced from two cell painting datasets: one by Bray et al. [4] and the other from the recently released JUMP dataset [6], and the third source from Wang et al. [56], which are used to study adverse drug reactions.

  • Gene nodes: Gene nodes are from the landmark genes used by Wang et al. [56] in creating the LINCS L1000 profiling of drugs. Other gene nodes come from genetic perturbations in the JUMP dataset [6]. The gene nodes from [6] have cell morphology features as described in appendix B.1. The landmark gene nodes from [56] have scalar gene expression profiles, but these values are updated in the new gene expression nodes.

  • Cell morphology nodes: Cell nodes are sourced from the two cell painting datasets [3, 6].

  • Gene expression nodes: Based on landmark genes from [56], each gene expression node summarizes all gene expression profiles into vectors from a small molecule perturbation. Since Wang et al. [56] measured the same landmark genes for a set of molecules, we update new gene expression nodes with feature vectors for all these landmark genes. This approach efficiently constructs decoding targets from molecules to gene expression profiles and prevents redundant gene-molecule connections.

We present an example of the cellular context graph in Figure 5.

C. Experiment Details

C.1. Prediction Datasets

All experiments were run on a single 32G V100. Prediction dataset statistics are in Table 1:

  • ChEMBL2K [13]: The dataset is a subset of the ChEMBL dataset [13], overlapping with the JUMP CP [6] datasets. We determined activity using the “activity_comment” provided by ChEMBL. If not, we applied a threshold of 6.5, labeling compounds with pChEMBL > 6.5 as active. We exclude all molecules in the dataset from the pretraining set to avoid data leakage. There are a total of 41 tasks related to protein binding affinity, which are converted to binary activity values. We filter the dataset to ensure that each task has at least one positive and five negative examples.

  • Broad6K [33]: The original version provided by Moshkov et al. [33] is a collection of 16,170 molecules tested in 270 assays, resulting in a total of 585,439 readouts. However, there are a large number of missing values, with 153 assays having a missing value percentage above 99%. To mitigate bias in the conclusions, we extract subsets where the percentage is less than 50%.

  • ToxCast [42]: The toxicology data is collected from the “Toxicology in the 21st Century” initiative, widely utilized in many graph machine learning models [19]. The dataset comprises 8,576 molecules and 617 binary classification tasks.

  • Biogen3K [11]: The dataset includes properties that describe the disposition of a drug in the body, including absorption, distribution, metabolism, and excretion (ADME). It is collected from 120 Biogen datasets across six ADME in vitro endpoints over 20 time points spanning about 2 years. The endpoints include human liver microsomal (HLM) stability reported as intrinsic clearance (Clint, mL/min/kg), MDR1-MDCK efflux ratio (ER), Solubility at pH 6.8 (μg/mL), rat liver microsomal (RLM) stability reported as intrinsic clearance (Clint, mL/min/kg), human plasma protein binding (hPPB) percent unbound, and rat plasma protein binding (rPPB) percent unbound.

Figure 7:

Figure 7:

An overview of the representation’s predictive performance across five major task categories on Broad6K. Results for molecular structure are obtained from the best method based on JOAO. Results for cell morphology come from the best method based on RF. Results for gene expression are derived from the best method based on MLP.

We utilize scaffold-splitting with a ratio of 0.6:0.15:0.25 and follow [19] for the ToxCast dataset. We use the Area under the curve (AUC) score for classification and mean absolute error (MAE) for regression. We report the mean and standard deviations from ten runs.

C.2. Implementation and Baseline

We consider baselines from three representation sources: molecular structures, cell morphology, and gene expressions. Moreover, we have three different ways to represent molecular structures, including fingerprints based on domain knowledge, GNNs based on the graph structure of molecules, and chemical language models (ChemLM) based on SMILES-sequence structure of molecules.

  1. Molecular descriptors/fingerprints [43] (Structure only): We train MLPs, Random Forests (RF), and Gaussian Processes (GP) on these representations.

  2. Pretrained GNN representations [18] (Structure only): We consider AttrMask, ContextPred, and EdgePred with supervised pretraining [18]. We also include GraphCL [60], GROVER [44], JOAO [61], MGSSL [62], GraphLoG [59], GraphMAE [17], DSLA [23], and UniMol [63]. We implement GraphCL, GROVER, and JOAO based on [55]. Fine-tuned MLPs are applied on top of the pretrained representations.

  3. Pretrained ChemLM representations [12] (Structure only): We consider pretrained models such as 102M Roberta and 87M GPT2 implemented by [31]. We also include MolT5 [10] and 19M ChemGPT [12]. We apply fine-tuned MLPs on top of these pretrained representations.

  4. Cell Morphology [43] (Cell or Structure only): Cell morphology features are available in for part of molecules in the ChEMBL2K and Broad6K datasets. We train MLPs, RF, and GP on these representations. Note that not all molecules have corresponding cell morphology feature vectors; in such cases, we replace the predictions on the missing feature with ML predictions on the structure.

  5. Gene Expression [43] (Gene or Structure only): Differential gene expression values are available for part of molecules in the ChEMBL2K and Broad6K datasets. We train MLPs, RF, and GP on these representations. Note that not all molecules have corresponding gene expression vectors over landmark genes; in such cases, we replace the predictions on the missing feature with ML predictions on the structure.

  6. CLOOME [46] and InfoCORE [54] (Structure-Cell or Structure-Gene aligned): CLOOME utilizes ResNet [15] and descriptor-based MLP to align representation from cell morphology images with the molecular structure representation. We use their pretrained MLP to obtain molecular representations and fine-tune another MLP on top of these representations. InfoCORE has two versions, InfoCORE-CP and InfoCORE-GE, which align the molecular graph representation with cell morphology features or differential gene expression features, respectively. We use both versions as baselines and fine-tune another MLP on top of these representations.

C.3. More Results for Molecular Property Prediction

We present additional comparisons on the ChEMBL2K dataset between basic representation approaches and InfoAlign across all task dimensions in Figure 6. Similarly, results for the Broad6K dataset, comparing basic representations across five major task dimensions (Cell, Yeast, Viral, Biochem, and Bacterial-related targets), are shown in Figure 7. Combined with Tables 2 and 3, these detailed results lead to further observations:

  1. Different structure-based molecular representations vary in sensitivity to model architecture. Dramatic performance drops occur with Morgan FP when replacing the MLP architecture with RF or GP in the ChEMBL2K and Broad6K datasets. Conversely, in the Biogen3K dataset, RF and GP significantly outperform MLP. In contrast, pretrained GNN and ChemLM representations maintain more consistent performance across various datasets.

  2. Learning universal molecular representations solely from molecular structures remains challenging, even within the representation category. For pretrained GNN representations, ContextPred outperforms others on the ChEMBL2K dataset. JOAO excels on the Broad6K dataset. UniMol and GraphMAE are the best pretrained GNN representations on ToxCast and Biogen3K datasets, respectively. For ChemLM representations, MolT5 excels over other sequential-based models in the ToxCast and Biogen3K datasets, but this is not the case with the ChEMBL2K and Broad6K datasets. Different datasets may emphasize varied aspects of bioactivity classification or regression and pose generalization challenges for molecular representation learning.

  3. InfoAlign shows strong generalization for the targets of non-human cells, as shown in Figure 7. Although the context graph primarily uses data from small molecule and genetic perturbation datasets [3, 6] focused on human cell cultures, InfoAlign also exhibits robust generalization to bacterial and viral targets compared to basic representation approaches.

References

  • [1].Alemi Alexander A, Fischer Ian, Dillon Joshua V, and Murphy Kevin. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016. [Google Scholar]
  • [2].Arevalo John, Su Ellen, Dijk Robert van, Carpenter Anne E, and Singh Shantanu. Evaluating batch correction methods for image-based cell profiling. bioRxiv, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bray Mark-Anthony, Singh Shantanu, Han Han, Davis Chadwick T, Borgeson Blake, Hartland Cathy, Kost-Alimova Maria, Gustafsdottir Sigrun M, Gibson Christopher C, and Carpenter Anne E. Cell painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nature protocols, 11(9):1757–1774, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Bray Mark-Anthony, Gustafsdottir Sigrun M, Rohban Mohammad H, Singh Shantanu, Ljosa Vebjorn, Sokolnicki Katherine L, Bittker Joshua A, Bodycombe Nicole E, Dančík Vlado, Hasaka Thomas P, et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the cell painting assay. Gigascience, 6(12):giw014, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Carpenter Anne E, Jones Thouis R, Lamprecht Michael R, Clarke Colin, Kang In Han, Friman Ola, Guertin David A, Chang Joo Han, Lindquist Robert A, Moffat Jason, et al. Cellprofiler: image analysis software for identifying and quantifying cell phenotypes. Genome biology, 7: 1–11, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Chandrasekaran Srinivas Niranj, Ackerman Jeanelle, Alix Eric, Michael Ando D, Arevalo John, Bennion Melissa, Boisseau Nicolas, Borowa Adriana, Boyd Justin D, Brino Laurent, et al. Jump cell painting dataset: morphological impact of 136,000 chemical and genetic perturbations. Biorxiv, pages 2023–03, 2023. [Google Scholar]
  • [7].Chithrananda Seyone, Grand Gabriel, and Ramsundar Bharath. Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020. [Google Scholar]
  • [8].Cimini Beth A, Chandrasekaran Srinivas Niranj, Kost-Alimova Maria, Miller Lisa, Goodale Amy, Fritchman Briana, Byrne Patrick, Sakshi Garg, Nasim Jamali, Logan David J, et al. Optimizing the cell painting assay for image-based profiling. Nature protocols, 18(7):1981–2013, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [Google Scholar]
  • [10].Edwards Carl, Lai Tuan, Ros Kevin, Honke Garrett, Cho Kyunghyun, and Ji Heng. Translation between molecules and natural language. arXiv preprint arXiv:2204.11817, 2022. [Google Scholar]
  • [11].Fang Cheng, Wang Ye, Grater Richard, Kapadnis Sudarshan, Black Cheryl, Trapa Patrick, and Sciabola Simone. Prospective validation of machine learning algorithms for absorption, distribution, metabolism, and excretion prediction: An industrial perspective. Journal of Chemical Information and Modeling, 63(11):3263–3274, 2023. [DOI] [PubMed] [Google Scholar]
  • [12].Frey Nathan C, Soklaski Ryan, Axelrod Simon, Samsi Siddharth, Gomez-Bombarelli Rafael, Coley Connor W, and Gadepally Vijay. Neural scaling of deep chemical models. Nature Machine Intelligence, 5(11):1297–1305, 2023. [Google Scholar]
  • [13].Gaulton Anna, Bellis Louisa J, Patricia Bento A, Chambers Jon, Davies Mark, Hersey Anne, Light Yvonne, McGlinchey Shaun, Michalovich David, Al-Lazikani Bissan, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Girdhar Rohit, El-Nouby Alaaeldin, Liu Zhuang, Singh Mannat, Alwala Kalyan Vasudev, Joulin Armand, and Misra Ishan. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023. [Google Scholar]
  • [15].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]
  • [16].Himmelstein Daniel Scott, Lizee Antoine, Hessler Christine, Brueggeman Leo, Chen Sabrina L, Hadley Dexter, Green Ari, Khankhanian Pouya, and Baranzini Sergio E. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife, 6:e26726, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Hou Zhenyu, Liu Xiao, Cen Yukuo, Dong Yuxiao, Yang Hongxia, Wang Chunjie, and Tang Jie. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 594–604, 2022. [Google Scholar]
  • [18].Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, and Leskovec J. Strategies for pre-training graph neural networks. In International Conference on Learning Representations (ICLR), 2020. [Google Scholar]
  • [19].Hu Weihua, Fey Matthias, Zitnik Marinka, Dong Yuxiao, Ren Hongyu, Liu Bowen, Catasta Michele, and Leskovec Jure. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020. [Google Scholar]
  • [20].Huang Kexin, Xiao Cao, Glass Lucas M, and Sun Jimeng. Moltrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics, 37(6):830–836, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Inae Eric, Liu Gang, and Jiang Meng. Motif-aware attribute masking for molecular graph pre-training. arXiv preprint arXiv:2309.04589, 2023. [Google Scholar]
  • [22].Jin Bowen, Liu Gang, Han Chi, Jiang Meng, Ji Heng, and Han Jiawei. Large language models on graphs: A comprehensive survey. arXiv preprint arXiv:2312.02783, 2023. [Google Scholar]
  • [23].Kim Dongki, Baek Jinheon, and Hwang Sung Ju. Graph self-supervised learning with accurate discrepancy learning. Advances in Neural Information Processing Systems, 35:14085–14098, 2022. [Google Scholar]
  • [24].Krenn Mario, Ai Qianxiang, Barthel Senja, Carson Nessa, Frei Angelo, Frey Nathan C, Pascal Friederich, Gaudin Théophile, Gayle Alberto Alexander, Jablonka Kevin Maik, et al. Selfies and the future of molecular string representations. Patterns, 3(10), 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Liu Anika, Seal Srijit, Yang Hongbin, and Bender Andreas. Using chemical and biological data to predict drug toxicity. SLAS Discovery, 28(3):53–64, 2023. [DOI] [PubMed] [Google Scholar]
  • [26].Liu Gang, Zhao Tong, Xu Jiaxin, Luo Tengfei, and Jiang Meng. Graph rationalization with environment-based augmentations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1069–1078, 2022. [Google Scholar]
  • [27].Liu Gang, Zhao Tong, Inae Eric, Luo Tengfei, and Jiang Meng. Semi-supervised graph imbalanced regression. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1453–1465, 2023. [Google Scholar]
  • [28].Liu Gang, Inae Eric, Zhao Tong, Xu Jiaxin, Luo Tengfei, and Jiang Meng. Data-centric learning from unlabeled graphs with diffusion model. Advances in neural information processing systems, 36, 2024. [Google Scholar]
  • [29].Liu Gang, Xu Jiaxin, Luo Tengfei, and Jiang Meng. Graph diffusion transformer for multi-conditional molecular generation. arXiv preprint arXiv:2401.13858, 2024. [Google Scholar]
  • [30].Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [Google Scholar]
  • [31].Mary Hadrien, Noutahi Emmanuel, DomInvivo, Zhu Lu, Moreau Michel, Pak Steven, Gilmour Desmond, Whitfield Shawn, t, Valence-JonnyHsu, Hounwanou Honoré, Kumar Ishan, Maheshkar Saurav, Nakata Shuya, Kovary Kyle M., Wognum Cas, Craig Michael, and Bot DeepSource. datamol-io/datamol: 0.12.3, January 2024. URL 10.5281/zenodo.10535844. [DOI] [Google Scholar]
  • [32].Mast Fred D, Ratushny Alexander V, and Aitchison John D. Systems cell biology. Journal of Cell Biology, 206(6):695–706, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Moshkov Nikita, Becker Tim, Yang Kevin, Horvath Peter, Dancik Vlado, Wagner Bridget K, Clemons Paul A, Singh Shantanu, Carpenter Anne E, and Caicedo Juan C. Predicting compound activity from phenotypic profiles and chemical structures. Nature Communications, 14(1):1967, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Nguyen Cuong Q, Pertusi Dante, and Branson Kim M. Molecule-morphology contrastive pretraining for transferable molecular representation. bioRxiv, pages 2023–05, 2023. [Google Scholar]
  • [35].Nguyen XuanLong, Wainwright Martin J, and Jordan Michael I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010. [Google Scholar]
  • [36].Oord Aaron van den, Li Yazhe, and Vinyals Oriol. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [Google Scholar]
  • [37].Perozzi Bryan, Al-Rfou Rami, and Skiena Steven. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014. [Google Scholar]
  • [38].Poole Ben, Ozair Sherjil, Oord Aaron Van Den, Alemi Alex, and Tucker George. On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171–5180. PMLR, 2019. [Google Scholar]
  • [39].Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, Sutskever Ilya, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [Google Scholar]
  • [40].Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [Google Scholar]
  • [41].Ramesh Aditya, Dhariwal Prafulla, Nichol Alex, Chu Casey, and Chen Mark. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. [Google Scholar]
  • [42].Richard Ann M, Judson Richard S, Houck Keith A, Grulke Christopher M, Volarath Patra, Thillainadarajah Inthirany, Yang Chihae, Rathman James, Martin Matthew T, Wambaugh John F, et al. Toxcast chemical landscape: paving the road to 21st century toxicology. Chemical research in toxicology, 29(8):1225–1251, 2016. [DOI] [PubMed] [Google Scholar]
  • [43].Rogers David and Hahn Mathew. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010. [DOI] [PubMed] [Google Scholar]
  • [44].Rong Yu, Bian Yatao, Xu Tingyang, Xie Weiyang, Wei Ying, Huang Wenbing, and Huang Junzhou. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559–12571, 2020. [Google Scholar]
  • [45].Ross Jerret, Belgodere Brian, Chenthamarakshan Vijil, Padhi Inkit, Mroueh Youssef, and Das Payel. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022. [Google Scholar]
  • [46].Sanchez-Fernandez Ana, Rumetshofer Elisabeth, Hochreiter Sepp, and Klambauer Günter. Cloome: contrastive learning unlocks bioimaging databases for queries with chemical structures. Nature Communications, 14(1):7339, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Seal Srijit, Carreras-Puigvert Jordi, Trapotsi Maria-Anna, Yang Hongbin, Spjuth Ola, and Bender Andreas. Integrating cell morphology with gene expression and chemical structure to aid mitochondrial toxicity detection. Communications Biology, 5(1):858, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Seal Srijit, Yang Hongbin, Trapotsi Maria-Anna, Singh Satvik, Carreras-Puigvert Jordi, Spjuth Ola, and Bender Andreas. Merging bioactivity predictions from cell morphology and chemical fingerprint models using similarity to training data. Journal of Cheminformatics, 15 (1):56, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Seal Srijit, Trapotsi Maria-Anna, Spjuth Ola, Singh Shantanu, Carreras-Puigvert Jordi, Greene Nigel, Bender Andreas, and Carpenter Anne E. A decade in a systematic review: The evolution and impact of cell painting. bioRxiv, pages 2024–05, 2024. [Google Scholar]
  • [50].Subramanian Aravind, Narayan Rajiv, Corsello Steven M, Peck David D, Natoli Ted E, Lu Xiaodong, Gould Joshua, Davis John F, Tubelli Andrew A, Asiedu Jacob K, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6): 1437–1452, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Sun Ruoxi, Dai Hanjun, and Yu Adams Wei. Does gnn pretraining help molecular representation? Advances in Neural Information Processing Systems, 35:12096–12109, 2022. [Google Scholar]
  • [52].Tian Yonglong, Sun Chen, Poole Ben, Krishnan Dilip, Schmid Cordelia, and Isola Phillip. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020. [Google Scholar]
  • [53].Tishby Naftali, Pereira Fernando C, and Bialek William. The information bottleneck method. arXiv preprint physics/0004057, 2000. [Google Scholar]
  • [54].Wang Chenyu, Gupta Sharut, Uhler Caroline, and Jaakkola Tommi S. Removing biases from molecular representations via information maximization. In The Twelfth International Conference on Learning Representations, 2023. [Google Scholar]
  • [55].Wang Hanchen, Kaddour Jean, Liu Shengchao, Tang Jian, Lasenby Joan, and Liu Qi. Evaluating self-supervised learning for molecular graph embeddings. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
  • [56].Wang Zichen, Clark Neil R, and Ma’ayan Avi. Drug-induced adverse events prediction with the lincs l1000 data. Bioinformatics, 32(15):2338–2345, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [57].Wang Zifeng, Wang Zichen, Srinivasan Balasubramaniam, Ioannidis Vassilis N, Rangwala Huzefa, and Anubhai Rishita. Biobridge: Bridging biomedical foundation models via knowledge graph. arXiv preprint arXiv:2310.03320, 2023. [Google Scholar]
  • [58].Xu Keyulu, Hu Weihua, Leskovec Jure, and Jegelka Stefanie. How powerful are graph neural networks? International Conference on Learning Representations, 2019. [Google Scholar]
  • [59].Xu Minghao, Wang Hang, Ni Bingbing, Guo Hongyu, and Tang Jian. Self-supervised graph-level representation learning with local and global structure. In International Conference on Machine Learning, pages 11548–11558. PMLR, 2021. [Google Scholar]
  • [60].You Yuning, Chen Tianlong, Sui Yongduo, Chen Ting, Wang Zhangyang, and Shen Yang. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020. [Google Scholar]
  • [61].You Yuning, Chen Tianlong, Shen Yang, and Wang Zhangyang. Graph contrastive learning automated. In International Conference on Machine Learning, pages 12121–12132. PMLR, 2021. [Google Scholar]
  • [62].Zhang Zaixi, Liu Qi, Wang Hao, Lu Chengqiang, and Lee Chee-Kong. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34:15870–15882, 2021. [Google Scholar]
  • [63].Zhou Gengmo, Gao Zhifeng, Ding Qiankun, Zheng Hang, Xu Hongteng, Wei Zhewei, Zhang Linfeng, and Ke Guolin. Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6K2RM6wVqKu. [Google Scholar]

Articles from ArXiv are provided here courtesy of arXiv

RESOURCES