MVGFormer: Multi-view perspective with graph-guided transformer for cryo-ET segmentation

Haoran Li; Xingjian Li; Huan Wang; Jiahua Shi; Huaming Chen; Yizhou Zhao; Bo Du; Johan Barthelemy; Daisuke Kihara; Jun Shen; Min Xu

doi:10.1016/j.knosys.2025.114810

. Author manuscript; available in PMC: 2026 Apr 9.

Published in final edited form as: Knowl Based Syst. 2025 Nov 12;331:114810. doi: 10.1016/j.knosys.2025.114810

MVGFormer: Multi-view perspective with graph-guided transformer for cryo-ET segmentation^★

Haoran Li ^a,^b,^c, Xingjian Li ^d, Huan Wang ^a, Jiahua Shi ^c, Huaming Chen ^e, Yizhou Zhao ^d, Bo Du ^f, Johan Barthelemy ^g, Daisuke Kihara ^h, Jun Shen ^a,^*, Min Xu ^d,^*

PMCID: PMC13061321 NIHMSID: NIHMS2138841 PMID: 41959693

Abstract

Cryo-Electron Tomography (cryo-ET) is a cutting-edge 3D imaging technology that enables detailed examination of biological macromolecular structures at near-atomic resolution. Recent deep learning applications on cryo-ET, such as cryo-ET segmentation, have drawn widespread interest for their potential to improve particle alignment, classification, and other tasks. However, current methods heavily rely on convolutional architectures, which prioritize local information while neglecting the global structural information inherent in cryo-ET data. Transformer-based models, known for their large receptive field, have become the de-facto design for 2D vision tasks due to their ability to effectively capture global information. This approach is also well-suited for 3D tasks, given the complex nature of 3D objects. Based on this, we extend 2D vision transformers into 3D and propose a novel transformer-based framework for cryo-ET segmentation, named MVGFormer. MVGFormer introduces a multi-view perspective fusion transformer encoder, which captures rich global structural information from multiple perspectives using unique positional embeddings. To enhance contextual awareness, we design a parallel context encoder that builds a visual graph to guide attention. We further introduce two complementary 3D decoders: multi-level feature fusion (MF) and parallel atrous convolutions (P3DA), which together capture multi-scale structural cues for precise segmentation. Furthermore, we introduce a view-masked self-supervised learning strategy to reinforce the effectiveness of the multi-view design and improve the model’s representation capability. To our knowledge, MVGFormer is the first transformer-based model for cryo-ET segmentation. We empirically evaluate MVGFormer on six cryo-ET datasets across three different tasks. Extensive experimental results demonstrate its superiority over state-of-the-art 3D segmentation methods.

Keywords: Cryo-electron tomography, Volumetric image segmentation, Deep learning

1. Introduction

Cryo-Electron Tomography (cryo-ET) enables the rapid freezing and thinning of biological samples to an appropriate thickness, and facilitating imaging through an electron microscope [1]. Recent technical advances in cryo-ET allow it to capture high-resolution structures of macromolecular complexes in their near-native state [2], which is of great significance for studying the aspects of cell biology, biochemistry, and biomedicine. The analysis of cryo-ET plays an important role in facilitating the understanding of the virus infection mechanisms, drug discovery and disease treatment. However, due to the small size of each particle in the tomogram, it is difficult to fully observe them with the naked eye, leading to low efficiency in the field. Therefore, developing an automatic detection or segmentation model as an aiding approach or tool becomes crucial.

Different from 2D images, cryo-ET tomograms are featured by the volumetric data, which contains an extra dimension. Fig. 1 provides the orthographic projection of a cryo-ET tomogram. Cryo-ET tomogram is a grey-scale 3D image, typically represented in voxel space. Inspired by recent development of the deep neural network (DNN)-based approaches, some methods have been applied to processing cryo-ET images [3–6], especially in the field of cryo-ET segmentation. Cryo-ET segmentation can be divided into two sub-tasks: subtomogram segmentation and tomogram segmentation. Subtomogram segmentation aims to identify a target structure, typically a single macromolecule, within a subtomogram. Since most subtomograms in cryo-ET datasets contain one primary target of interest, the task is often formulated as a binary segmentation problem. In contrast, tomogram segmentation seeks to segment and classify all particles and other structures (e.g., membranes, filaments, compartments) within a whole tomogram, which usually contains thousands of particles and heterogeneous cellular components, making it similar to a multi-class semantic segmentation task. Although not originally designed for tomogram segmentation, several studies have applied existing deep learning models to cryo-ET tomograms and subtomograms [7], for example, for particle picking [8] or as part of simulation pipelines [9]. More recently, dedicated methods have been proposed for macromolecular segmentation [10] as well as for tomogram segmentation [11–13], the latter addressing a broader range of structures beyond macromolecules, such as membranes, filaments, and compartments. However, these methods are generally built upon 2D network architectures or conventional 3D U-Net backbones, which may limit their ability to fully capture the complex 3D spatial relationships in cryo-ET data.

Regrading the DNN-based approaches for 3D image segmentation, the existing models can be broadly categorized into two types based on their structural characteristics: convolutional neural network (CNN)-based and transformer-based. CNNs leverage their powerful inductive bias ability to gather local information at low-levels by using 3D convolutional kernels, while enabling the extraction of global information at high levels, and integrating information from different levels to obtain segmentation results through hierarchical structures [14,15]. However, compared to local information, global information is more crucial for three-dimensional data like tomograms, as their data distribution is more extensive. Different from CNN approaches, transformer-based approaches [16,17] capture global information through multi-head self-attention, which enlarge the receptive filed of the framework at the early stage. However, existing transformer-based architectures are primarily designed for 2D images. Even methods designed for 3D images, such as Magnetic Resonance Imaging (MRI), often involve dividing the input images into 2D slices for processing and deploy 3D reconstruction for post-processing. It tends to overlook the structural characteristics of the 3D data. Although a few other methods [18–20] have attempted to address this issue by focusing on voxel-level inputs, they all use the ‘XY’ view of the 3D image only as the input, unfortunately omitting the multi-perspective observation information of the 3D data. In fact, orthographic projections along the three principal axes provide complementary cues:the ‘XY’ view preserves the in-plane spatial organization, the ‘XZ’ view reveals vertical continuity along the depth axis, and the ‘YZ’ view captures lateral structures. Each individual view also suffers from limitations, e.g., the XY view lacks depth information, while the ‘XZ’ and ‘YZ’ views may distort in-plane spatial relations. By integrating these complementary perspectives, our multi-view design mitigates the ambiguities of single-view analysis and enables a more complete representation of cryo-ET data. Additionally, the recently developed Segment Anything Model (SAM) [21] for 2D images can serve as a foundation model for various tasks. Similarly, we aim to propose a 3D transformer architecture to provide a base framework for future cryo-ET foundation models.

With this insight, in this paper, we propose the Multi-View Perspective with Graph-guided Transformer (MVGFormer), a multi-view perspective fusion transformer for cryo-ET segmentation. Our approach aims to fuse feature embedding from multi observation perspectives to fully leverage the characteristics of three-dimensional data. To further leverage contextual information to enhance segmentation performance, we employ a context encoder to generate a visual graph, which is then used to guide the attention process of the transformer. Additionally, inspired by recent 2D transformer-based segmentation approaches [22,23], we designed two types of 3D decoder to examine the feature presentations: multi-level feature fusion segmentor (MF) and parallel 3D atrous convolution segmentor (P3DA). MF obtains the final segmentation result by aggregating multi-level feature outputs from multiple stages of the transformer encoder. While P3DA restores the feature output from the last stage of transformer encoder to its original size and utilizes parallel 3D atrous convolutions with different dilation rates to gather multi-scale features. These features are then aggregated to generate the final segmentation mask. Furthermore, we conduct view-masked self-supervised learning to further validate the rationality and effectiveness of our multi-view design. To the best of our knowledge, this is the first work to extend vision transformers to cryo-ET segmentation task, and we believe the proposed MVGFormer can be served as a basic framework for the cryo-ET foundation models.

In a nutshell, our contributions are listed as follows:

We propose MVGFormer, a multi-view perspective fusion framework with a dual-stream encoder that integrates contextual information through a visual graph to guide the transformer’s attention process.
We design two different decoder variants to support MVGFormer: a multi-level feature fusion decoder (MF) that aggregates hierarchical features from multiple encoder stages, and a parallel 3D atrous convolution decoder (P3DA) that enhances multi-scale feature representation.
To further enhance the effectiveness of the multi-view design, we introduce a view-masked self-supervised learning strategy, enabling the model to reconstruct masked views from remaining observations.
We conduct extensive experiments to demonstrate the superior performance of MVGFormer over existing state-of-the-art cryo-ET segmentation methods.

2. Related work

2.1. Deep learning in cryo-ET

Deep learning approaches and their potential applications in cryo-electron tomography (cryo-ET) have been increasingly capturing the attention of the bioinformatics community. Some efforts have approached computer vision (CV) tasks and developed numerous new algorithms on cryo-ET data, i.e., segmentation [24–26], classification [27–29], and data augmentation [6,30]. Additionally, some researchers integrate cryo-ET with AI algorithms to address practical of processing cryo-ET data. Zeng et al. [2] proposes an unsupervised clustering approach for homogeneous structure mining and modeling. Liu et al. [31] introduces a dual-flow framework, which combines information from both tomograms and the generated mask for isotropic reconstruction. REST [32] introduces an U-Net based network to mine the relationship between the volumetric input and the ground-truth mask to enhance the model’s performance for the applications in cryo-ET (i.e.,article picking and subtomogram averaging). However, the aforementioned approaches are all convolutional structures-based, which have failed to effectively exploit the complex spatial information of the three-dimensional structure of cryo-ET. Hence, in this paper, we propose the first transformer-based framework (termed MVGFormer) for cryo-ET segmentation. Based on multi-view perspective transformer encoder and a multi-dilated rates atrous convolution decoder, our proposed MVGFormer can fully exploit the three-dimensional spatial information of cryo-ET.

2.2. Vision transformers

2.2.1. 2D vision transformers

Transformers were first introduced to computer vision by Dosovitskiy et al. [16], who proposed the Vision Transformer (ViT) that transforms an image into a sequence of tokens and applies self-attention layers for classification. Due to its large receptive field and ability to capture global context at an early stage [33,34], ViT has achieved excellent results in many visual recognition tasks [35]. Inspired by ViT, many novel approaches have been proposed for downstream 2D vision tasks, i.e., classification [36–38], segmentation [22,23,39–41], and object detection [42–44]. To enhance data diversity, Wang et al. [45] replaces the fixed positional embeddings commonly used in ViT-based models with shuffled position embeddings, improving generalization across datasets.

2.2.2. 3D vision transformers

With the success of transformers on 2D images, researchers have attempted to extend them to 3D volumetric data. Some studies [46–49] cut 3D images into multiple slices and treat them as inputs to deploy 2D transformers on 3D images. However, these approaches leave a significant portion of the spatial information in the third dimension uncomputed. To overcome this limitation, other methods [18,19] directly use voxel-level inputs to extend 2D transformers into 3D architectures. Although these methods have achieved promising results, they all assume a fixed observation perspective as the input view for the 3D image, neglecting the rich information observable from other perspectives. Motivated by this limitation, our work proposes a multi-view perspective fusion transformer encoder to fuse features from different observation perspectives to improve segmentation performance.

2.3. Multi-view fusion

Although a single-view image can provide sufficient information for model learning in 2D scenarios, in many complex situations, relying solely on the semantic information from one view is insufficient for the model to capture rich and comprehensive representations. PixelFusion-Net [50] fuses misaligned photographs of the same scene, significantly improving the network’s ability to learn large disparities. MVMP [51] achieves more efficient and faster person re-identification by integrating multi-view surveillance images. MFFN [52] fuses multi-view images obtained through flipping and enhancement of the input to more accurately identify the boundaries of camouflaged objects. EditSplat [53] leverages 3D Gaussian Splatting to fuse multi-view projections of objects in a 3D scene, enabling more view-consistent 3D scene editing. Unlike existing approaches, our method operates in the voxel space, processing different views of the same object to enable the model to learn richer three-dimensional contextual representations.

3. Problem definition

We tackle the cryo-ET segmentation task, which aims to predict the 3D mask for each particle in a voxel-level input image. The cryo-ET segmentation task can be divided into two sub-tasks: tomogram segmentation and subtomogram segmentation. For tomogram segmentation, the input is usually a large-scale 3D image (512³ in our unique experimentation data) in voxel-space and each input usually contains hundreds or thousands of particles. And the goal of tomogram segmentation is to segment and classify every particle contained in the input tomogram, similar to the semantic segmentation in 2D images. For subtomogram segmentation, the input is usually a small size 3D image (32³ in our case) in voxel-space and each input only contains very few particles and all the particles belong to the same category. The goal of subtomogram segmentation is to generate a 3D binary mask to localize each particle contained in the subtomogram.

Given an input cryo-ET tomogram/subtomogram $x \in R^{H \times W \times D \times 1}$ , and the corresponding ground-truth segmentation mask $y \in R^{H \times W \times D \times N}$ ( $N$ denotes the number of the classes and $N = 1$ for subtomogram input). The goal of cryo-ET segmentation task is to train a segmentation model which can obtain the most accurate predicted segmentation mask through

θ^{*} = arg min_{θ} ℒ_{CE} (f_{θ} (x), y),

(1)

where $θ^{*}$ denotes the optimal model parameters, $ℒ_{CE}$ denotes the cross-entropy loss, $y$ denotes the one-hot encoding of the label and $f_{θ} (x)$ denotes the prediction probability from the model $f_{θ}$ .

4. Method

As shown in Fig. 2, MVGFormer consists of a multi-view perspective fusion transformer encoder, a context encoder and a decoder to produce the segmentation mask. Given an input cryo-ET tomogram with size $H_{L} \times W_{L} \times D_{L} \times 1$ , to reduce the computational load and increase the quantity of data, we first apply pre-processing to cut it into non-overlapping patches with size $H \times W \times D \times 1$ and send each patch as the input of the context encoder and the transformer encoder. Secondly, the decoder takes the output feature embedding from the encoders for final prediction. We propose two different decoder designs: multi-level feature fusion segmentor (MF) and parallel 3D atrous convolution segmentor (P3DA). Next, we present the details of the proposed encoder and decoders.

4.1. Context encoder

Given an input $x \in R^{H \times W \times D \times 1}$ , we first send it to three convolutional blocks to obtain the context visual feature $f_{c} \in R^{C \times W_{p} \times H_{p} \times D_{p}}$ . Each convolutional block comprises a 3×3×3 3D convolution layer, a batch normalization layer, and a ReLU activation. To facilitate the use of context features in guiding the transformer’s attention process, we construct a visual graph via k-means clustering on $f_{c}$ , aiming to select more informative and representative graph nodes

F_{c} = [f_{1}, \dots, f_{j}] \in R^{C \times N}, N = W_{p} H_{p} D_{p},

(2)

min_{\{μ_{k}\}, R} \sum_{j = 1}^{N} \sum_{k = 1}^{K} R_{j k} {‖f_{j} - μ_{k}‖}_{2}^{2}, s.t. R_{j k} \in {0, 1}, \sum_{k = 1}^{K} R_{j k} = 1,

(3)

where $μ_{k} \in R^{C}$ denotes the $k_{th}$ cluster center, $K$ is a hyperparameter denotes the cluster number and $R_{j k}$ denotes the assignment of feature $f_{j}$ to cluster $k$

R_{j k} = \{\begin{array}{l} 1, & if k = arg {min}_{k^{'}} {‖f_{j} - μ_{k^{'}}‖}_{2}^{2}, \\ 0, & otherwise. \end{array}

(4)

Hence, the graph node can be obtained through

g_{k} = \frac{\sum_{j = 1}^{N} R_{j k} f_{j}}{\sum_{j = 1}^{N} R_{j k}}, for k = 1, \dots, K .

(5)

The graph nodes $𝒱 = {\{g_{k}\}}_{k = 1}^{K}$ is further send to the transformer encoder for graph guided attention.

4.2. Multi-view perspective fusion transformer encoder

Existing 3D transformer works [18,19] only consider the ‘XY’ view of the input and set single position embedding from this input perspective, which overlooks the spatiality of the 3D images. As shown in Fig. 1(a–c), the information being observed varies from different perspectives. Hence, we propose the multi-view perspective transformer to fuse the features obtained from different observation angles. Different from traditional 2D transformers which treat images to be “isotropic” and use a single positional embedding, we regard the 3D images as “anisotropic”, which indicates the order of information acquisition differs depending on the perspective.

Given an input $x \in R^{H \times W \times D \times 1}$ , we perform high-dimensional transpose to obtain the transposed inputs from ‘YZ’ view $x_{YZ} = x^{⊤} \in R^{H \times D \times W \times 1}$ and ‘XZ’ inputs from ‘YZ’ view $x_{XZ} = x^{⊤} \in R^{D \times W \times H \times 1}$ . Then each input is divided into $L = \frac{H}{H_{p}} \times \frac{W}{W_{p}} \times \frac{D}{D_{p}}$ patches with each patch has a shape of $x_{p} \in R^{H_{p} \times W_{p} \times D_{p} \times 1}$ . Those patches are further flattened and sent to a linear projection layer to obtain the sequence-level input $x_{seq} \in R^{C \times H_{p} W_{p} D_{p}}$ , where $C$ is the hidden dimension of the linear projection layer. As aforementioned, the information obtained from different observation perspectives will not share the same value, thus, we assign different learnable position embedding $p_{i}$ for $i_{th}$ position in the $x_{seq}$ obtained from different perspectives. Hence, the input $I_{obv}$ of the transformer can be formulated as:

\begin{array}{r} I_{obv} = \{x_{1}^{obv} + p_{1}^{obv}, x_{2}^{obv} + p_{2}^{obv}, \dots, x_{C}^{obv} + p_{C}^{obv}\}, \\ obv \in \{XY, XZ, YZ\}, \end{array}

(6)

where $x_{i}^{obv}$ denotes the input token embedding and $p_{i}^{obv}$ denotes the $i_{th}$ position embedding of the relevant observation perspective.

Different from existing 2D vision transformer [16,22], we set the graph nodes $𝒱$ obtained from the context encoder as the Query to guide the attention process. Given an input embedding sequence $I_{obv}$ , a $L_{n}$ -layer transformer which contains Multi-head Self-Attention (MSA) and Multiplayer Perceptron (MLP) (Fig. 2(b)) is used to obtain the output feature embedding:

F_{L_{1}}^{'} = MSA (Q = 𝒱, K = LN (I_{obv}), V = LN (I_{obv})) + I_{obv}, F_{L_{n}}^{'} = MSA (LN (F_{L_{n - 1}})) + F_{L_{n - 1}}, n \geq 2, F_{L_{n}} = MLP (LN (F_{L_{n}}^{'})) + F_{L_{n}}^{'},

(7)

where $LN (\cdot)$ represents the layer normalization. We denote $F_{obv}^{L_{n}}$ as the output sequence-level features of each layer from different observation perspectives.

4.3. Decoder designs

As mentioned in Section 1, two different decoder designs are introduced for voxel-level segmentation. We set the cross-entropy loss $ℒ_{C E}$ as the segmentation loss $ℒ_{seg}$ [54–56] for both decoder designs.

4.3.1. Multi-level feature fusion segmentor (MF)

As shown in Fig. 2(c), we propose a multi-level feature fusion segmentor to aggregate $N$ feature embeddings obtained from different encoder layers. Given a $L_{n}$ -layer transformer encoder, MF takes the output features from the $\{{\frac{L}{N}}_{t h}, 2 {\frac{L}{N}}_{t h}, \dots, N {\frac{L}{N}}_{t h}\}$ layer as the input, and reshapes each feature to $F_{N} \in R^{C \times H_{p} \times W_{p} \times D_{p}}$ . Each $F_{N}$ is further sent to a separate decoder blocks, which consists of two 3×3×3 convolution layers, two Batch Normalization (BN) layers, two rectified linear unit (ReLU)-layers and one up-sampling layer. Then we concatenate all $N$ outputs from the decoder blocks and use a 1×1×1 convolution layer to generate the segmentation mask from different observation perspectives ${\hat{y}}_{obv}$ . Consequently, we add those predicted masks together to obtain the final prediction mask

\hat{y} = {\hat{y}}_{XY} + {\hat{y}}_{XZ}^{⊤} + {\hat{y}}_{YZ}^{⊤} .

(8)

It should be noted that the predicted masks from the ‘YZ’ view, ${\hat{y}}_{YZ} \in R^{H \times D \times W}$ , and the ‘XZ’ view, ${\hat{y}}_{XZ} \in R^{D \times W \times H}$ , are rearranged into the canonical ( $H \times W \times D$ ) grid before the adding operation, so that they are spatially aligned with ${\hat{y}}_{XY} \in R^{H \times W \times D}$ . Since all views share the same spatial resolution and are normalized in the joint latent space, their features are comparable in scale and semantics. We therefore adopt element-wise addition for fusion, which reinforces complementary cues across views while remaining more lightweight and effective than concatenation + MLP, cross-attention, or FiLM-style gating.

4.3.2. Parallel 3D atrous convolution segmentor (P3DA)

We present our proposed P3DA in Fig. 2(d). P3DA takes the output feature $F_{L_{n}}$ from the final layer of the transformer encoder and reshape it to the original size $H \times W \times D$ as the input 3D feature map $F \in R^{C \times H \times W \times D}$ . $F$ is further sent to four parallel atrous decoder blocks. Each block consists of a 3×3×3 convolution layer (except the first block uses a 1×1×1 Conv) with different dilated rates to enlarge the receptive field [57,58], a BN layer and a ReLU layer. The dilation rates for four blocks are set to 1, 6, 12 and 18. Then, through a concatenation operation, the parallel output of the decoder layers can be fused and sent to a 2-layer (both with kernel size 1×1×1) to produce the segmentation mask. Similar to MF, we perform the same adding operation on the predicted masks ${\hat{y}}_{obv}$ from different observation perspectives to obtain the final result $\hat{y}$ .

4.4. View-masked self-supervised learning (VSL)

To facilitate a deeper understanding of inter-view relationships, we introduce a novel self-supervised learning strategy aimed at improving segmentation performance. In details, each time during training, we randomly select one input view as the masked view, and reconstruct the masked view from the remaining two views. Taken $I_{XY}$ as an example, we randomly mask $I_{XY}$ with a rate $η$ and then train the remaining part with the rest two views following the standard procedure described above. Then, following [59], we introduce an additional lightweight transformer as the decoder to reconstruct the masked regions, and use the mean squared error (MSE) loss $ℒ_{MSE}$ as the reconstruction loss $ℒ_{recon}$ .

4.5. Optimization

We separately calculated the segmentation loss for the predicted mask ${\hat{y}}_{obv}$ from each observation perspective, as well as for the fused segmentation mask $\hat{y}$ . Hence, the target function for training can be formulated as

ℒ = ℒ_{seg} (y, {\hat{y}}_{XY}) + ℒ_{seg} (y, {\hat{y}}_{YZ}^{⊤}) + ℒ_{seg} (y, {\hat{y}}_{XZ}^{⊤}) + ℒ_{seg} (y, \hat{y}) + ℒ_{recon},

(9)

where $y$ denotes the ground-truth mask. To help better understanding the proposed framework, we also include a concise overview of each component in Table 1.

Table 1.

Summary of the proposed framework.

Dimension	Component	Function / Output
Input	3D input with XY, XZ and YZ projections	Orthogonal cryo-ET views providing complementary cues
Encoder	Transformer encoder with multi-view tokens	Captures cross-view semantic consistency
Fusion	Graph-based aggregation module	Models spatial and frequency-level relationships
Decoder	Multi-scale convolutional layers	Produces voxel-wise segmentation output
Learning Objective	View-masked SSL + CE loss	Jointly optimizes reconstruction and segmentation

Open in a new tab

5. Experiments

5.1. Experimental settings

5.1.1. Datasets

We employ two types of datasets: cryo-ET tomogram dataset and cryo-ET subtomogram dataset.

Tomogram dataset.

We chose the tomogram dataset used in SHREC2021 [60] as the tomogram dataset. This dataset contains 10 cryo-ET tomograms simulated from 13 proteins (see details in Appendix A) of known structure with varying sizes, shapes and functions. To plus vesicle and fiducial, each tomogram contains 15 types of particles, and has a shape of 512³. Following [60], we set the first nine tomograms as the training set and the last tomogram as the test set. Due to the size of each input tomogram is too large for training, we cut each tomogram into multiple non-overlapping patches with size 32³, and use each patch as the input for the model. Hence, there are total 40,960 samples in the SHREC dataset (36,864 samples in training set and 4096 samples in test set).

To further validate the generalization ability of our method on real tomograms, due to the scarcity of datasets, there is currently no publicly available real Cryo-ET tomogram dataset with voxel-level annotations. Hence, we employed the particle picking task, an instance of voxel-level segmentation, to quantitatively evaluate the performance of MVGFormer on real Cryo-ET tomograms from the EMPIAR-10499 dataset [61] and CZII [62]. EMPIAR-10499 contains 65-tilt series of native M.pneumoniae cells with annotated ribosomes. CZII is an open challenge dataset which contains 6 particle types with different difficulty levels of prediction.

Subtomogram dataset.

Both simulated dataset and real dataset are used for experimental studies. Simulated dataset is generated following the same process as [24,64], which explicitly incorporates the tomographic reconstruction procedure with missing wedge effects and the contrast transfer function (CTF) [65,66]. For illustration, an example visualization of the input simulated data is provided in Fig. 3. The whole dataset contains 50 macromolecules (see details in Appendix A) and each macromolecule is simulated with three different noise levels, with SNR at 0.03, 0.05 and Infinity. Each noise level contains 500 samples. The simulated dataset contains 75,000 subtomograms in total. For real dataset, public dataset Poly-GA [67] and Erwinia [68] are chosen. PolyGA contains 1033 samples in total (66 26S subtomograms, 66 T RiC subtomograms and 901 Ribosome subtomograms). Following [69,70], all the input subtomograms are resized to 32³, and the dataset is randomly split into training set and test set with the ratio of (0.85 : 0.15). The ground-truth segmentation mask of Poly-GA is provided by [24,69], and we extracted the segmentation ground-truth masks for each subtomogram based on the coordinate locations provided in Guo et al. [67]. Erwinia contains 10 samples in total. The size of each subtomogram ranges from 72³ to 84³. Following [71], we resize each subtomogram to 64³ and partitioned it into multiple non-overlapping patches of size 32³. Each patch is then used as an input to the model, resulting in a total of 80 samples from the Erwinia dataset.

Fig. 3. — Example visualizations of the input cryo-ET subtomogram using UCSF Chimera with different filtering ranges.

5.1.2. Implementation details

We train our model on two NVIDIA A100 Tensor Core GPUs with a 80GB memory per card. For training, the layer number $L_{n}$ , hidden dimension $C$ , attention head number and mask rate $η$ are set to 12, 256, 16 and 50%. The cluster number $K$ is set to 16. The patch size $\times H_{p} \times W_{p} \times D_{p}$ is set to 4×4×4. We choose the Adam optimizer [72] with the initial learning rate set to 1e–3. The model is trained for 200 epochs with batch size 72 and the learning rate is decayed 90% for every 100 epochs.

5.1.3. Evaluation metrics

Following existing volumetric segmentation approaches, we choose the mean intersection of union (mIoU) and dice similarity coefficient (Dice) as the core evaluation metrics. For particle picking task, precision, recall, and F₁ score are set as the evaluation metrics.

5.1.4. Baselines

For tomogram segmentation, following [60], we set URFinder [60], U-CLSTM [73], MCDSNet [60], YOPO [60], CFN [60], TM-F [74], TM [74] and DeepFinder [8] as baselines. Besides, we add a commonly used voxel-level segmentation method VoxResNet [75] and three recent methods, MedNeXt [18], Swin UNETR [20] and SwiFT [19], as additional baseline methods. VoxResNet is a 3D convolution-based model designed for volumetric image segmentation. MedNeXt is hierarchical transformer architecture for CT and MRI modalities. Swin UNETR and SwiFT both extend 2D Swin transformer into higher dimension for segmentation. For tomogram particle picking, we set DeepFinder, crYOLO [76], EMAN2 [77], VoxResNet, and SwiFT as the baselines. For subtomogram segmentation on simulated dataset, we use U-CLSTM, DeepFinder, VoxResNet, Swin UNETR, MedNeXt and SwiFT as baselines. For subtomogram segmentation on real dataset, due to the small scale of the real dataset, we utilize VoxResNet pre-trained on simulated subtomogram dataset as the baseline.

5.2. Comparisons with state-of-the-art methods

5.2.1. Tomogram segmentation

The comparisons on SHREC2021 dataset are shown in Table 2. We report the results using both MVGFormer (MF) and MVGFormer (P3DA). For a fair comparison, we use the same training set for the training of VoxResNet, MedNeXt, Swin UNETR and SwiFT. All the results reported are the average results of five training runs, avoiding the occurrence of the randomness in results. We also include the standard deviation in the table. The results of other baselines are directly cited from their papers. As can be seen, our MVGFormer (MF) and MVGFormer (P3DA) perform better than all baselines. Compared with VoxResNet, our methods excels on both mIoU (i.e., 83.7% → 85.7% for MVGFormer (MF) and 83.7% → 86.9% for MVGFormer (P3DA)) and Dice (i.e., 91.1% → 92.6% for MVGFormer (MF) and 91.1% → 93.1% for MVGFormer (P3DA)). This is because our transformer encoder enables our model to gather global information at early stage to acquire more abundant structural information of the 3D data. Compared with SOTA transformer-based 3D segmentation approaches, MedNeXt and SwiFT, our methods still outperforms on all metrics (i.e., mIoU increased by 5.1% and Dice increased by 3.7%). This is due to the fact that they only consider the input from the ‘XY’ perspective, overlooking the spatiality of the information contained in 3D images. And we also compare the two decoder designs we proposed in Table 2. Because of the atrous convolutions can enlarge the receptive field and better capture contextual information, MVGFormer (P3DA) achieves better performance compared to MVGFormer (MF) (i.e., 84.7% → 86.9% in mIoU and 81.7% → 93.1% in Dice). Although the MF variant itself already outperforms existing baselines, it is mainly included as an ablation design, while our final model consistently adopts P3DA as the segmentation decoder due to its stronger and more robust performance. To break the strict voxel-wise correspondence among views, We have added a permutation by introducing Gaussian noise ( $σ = 0.05$ ) to the XZ-view inputs and perform the same experiment on the SHREC2021 dataset. As can be seen from Table 2, the permutation-based baseline’s performance remained comparable to the unperturbed setting, indicating that the model captures high-level semantic relationships rather than simple spatial re-indexing.

Table 2.

Experimental results on cryo-ET tomogram segmentation.

Method	mIoU	Dice
URFinder _[SHREC21]	66.8	80.1
U-CLSTM _[SHREC21]	77.6	87.4
MCDSNet _[SHREC21]	78.4	87.9
YOPO _[SHREC21]	68.1	81.0
CFN _[SHREC21]	75.6	86.1
TM _[SHREC21]	50.0	66.6
TM-F _[SHREC21]	51.8	68.2
DeepFinder _[SHREC21]	83.5	91.0
VoxResNet _{[NeuroImage18]}	83.7_±0.4	91.1_±0.3
Swin UNETR _[CVPR22]	79.4_±0.2	88.5_±0.1
MedNeXt _[miccai23]	82.7_±0.6	90.5_±0.3
SwiFT _[NeurIPS23]	81.5_±0.1	89.8_±0.1
MVGFormer(MLP)	84.6_±0.3	91.6_±0.2
MVGFormer(MF)	85.7 _±0.1	92.6 _±0.2
MVGFormer(P3DA)	86.9 _±0.2	93.1 _±0.4
MVGFormer(permutation)	86.8_±0.1	93.0_±0.3

Open in a new tab

We use the bold and the underlined to represent the best results and the runner-ups, respectively.

Fig. 4 shows the segmentation results of tomogram on the SHREC2021 dataset. To mitigate potential inconsistencies at patch borders, we employ an overlap-tile inference strategy rather than non-overlapping tiling. Specifically, during inference the tomogram is divided into 32³ subvolumes with a 50% overlap in each dimension. The network outputs per-voxel class probabilities for each subvolume, and predictions in overlapping regions are fused by weighted averaging using a smooth window function. For better visibility, we enlarged two local parts as close-ups and outlined them with yellow and red boxes respectively in the original figures. We provide the segmentation results from MVGFormer (P3DA) as ours. Compared to baselines, our proposed method demonstrates superior performance in terms of segmentation details. In contrast, it can be seen from the yellow boxes that both VoxResNet and SwiFT incorrectly segment some proteins of type “5MRC” as type “1BXN” (the color of the specific part on the mask should be blue instead of yellow). It is obviously shown on the red boxes that both VoxResNet and SwiFT inaccurately classify some proteins of type “5MRC” as type “4V94” (in the center part, where color blue is wrongly marked by red). To better showcase the segmentation results, we also provide close-up magnified visualizations in Fig. 5.

Fig. 5. — Magnified close-ups of the cryo-ET segmentation results. A ‘1QVR’ protein particle is located within the yellow box.

Inherently the classical transformer architecture lacks the strong inductive bias of convolutional structures, and thus requires a large amount of training data to achieve superior performance (i.e., 2D transformers typically use ImageNet for training, which contains approximately 1,300k training samples.). However, in cryo-ET, such large amounts of annotated image data are scarce. The 40k sample cryo-ET tomogram dataset SHREC and the 75k simulated subtomogram dataset used in our research are the largest known tomogram and subtomogram dataset with reliable annotations. Given that subtomograms and local regions of tomograms share the same imaging characteristics (voxel size, CTF, noise distribution, and missing-wedge geometry) and exhibit highly similar structural patterns for the target particles. Therefore, pre-training on subtomograms enables the model to learn particle-level structural and feature representations that transfer effectively to particle segmentation within whole tomograms. Therefore, to further prove the effectiveness of our proposed MVGFormer our larger scale datasets, we conduct an additional experiment by pre-training the proposed network using the simulated subtomogram dataset, and then fine-tuned the pre-trained encoder on the tomogram dataset. In detail, the model is pre-trained for 100 epochs on the simulated dataset with a batch size of 72 and a learning rate of 1e-3. After pre-training, we fully fine-tuned the best-performing checkpoint on the SHREC2021 dataset. The results are shown in Table 4. As can be seen from the table, pre-training on the large-scale simulated subtomogram dataset effectively increases the diversity and amount of data seen by the model, leading to improved performance of our proposed method. Compared to the baseline methods, the proposed MVGFormer shows a substantial improvement, with mIoU increasing by 5.5% and Dice rising from 91.4% to 95.1%.

Table 4.

Experimental results on cryo-ET tomogram segmentation (SHREC dataset) after pre-training on a simulated subtomogram dataset.

Method	mIoU	Dice
DeepFinder	83.6	91.3
MedNeXt	83.1	90.8
VoxResNet	85.3	91.4
SwiFT	82.8	91.0
MVGFormer(MF)	89.0	94.2
MVGFormer(P3DA)	91.1	95.1

Open in a new tab

5.2.2. Tomogram particle picking

Due to the absence of voxel-level annotations, we follow the standard practice in Cryo-ET tomogram analysis [78–80], where the predicted particle center positions are compared with the known center positions. A particle is considered correctly detected (true positive) if the Euclidean distance between the predicted and known centers is less than the particle radius. We train all the baselines and the proposed MVGFormer on the PolyGA dataset, as it contains voxel-level mask annotations of ribosome particles. Subsequently, we test all the methods on the EMPIAR-10499 dataset and CZII dataset, and report the average results in Table 5. As shown in the table, our proposed method outperforms a wide range of state-of-the-art models. For instance, MVGFormer (P3DA) achieves 61.2%, 78.4% and 69.8% in terms of Precision, Recall and F₁ score, surpassing the VoxResNet by 7.2%, 9.8% and 7.9% on the EMPIAR-10499 dataset. And on CZII dataset, our proposed MVGFormer (P3DA) still demonstrates outstanding performance (i.e., Precision increased by 15.8% compared to SwiFT and $F_{1}$ increased by 8.7% compared to VoxResNet). The results demonstrate that the proposed MVGFormer exhibits excellent generalization and adaptability on real Cryo-ET tomograms.

Table 5.

Experimental results for particle picking using EMPIAR-10499 and CZII dataset.

	EMPIAR-10499			CZII
Method	Precision	Recall	F₁	Precision	Recall	F₁
DeepFinder _{[Nat. Methods21]}	50.5	51.7	52.7	72.1	71.7	71.9
EMAN2 _{[J. Struct. Biol07]}	26.1	55.3	35.5	54.7	64.7	59.3
crYOLO _{[Commun. Biol.19]}	47.8	56.8	52.0	63.8	65.6	64.7
VoxResNet _{[NeuroImage18]}	57.1	71.4	64.7	74.0	73.0	73.5
SwiFT _[NeurIPS23]	54.0	69.4	63.5	67.6	58.1	62.5
MVGFormer(MLP)	58.7	57.7	58.2	75.6	74.4	75.0
MVGFormer(MF)	59.4	75.0	66.8	76.0	77.8	76.9
MVGFormer(P3DA)	61.2	78.4	69.8	78.3	81.6	79.9

Open in a new tab

We use the bold and the underlined to represent the best results and the runner-ups, respectively.

To further explore the generalization ability of our model on zero-shot tasks, we directly apply the model pretrained on the SHREC dataset to the particle picking task on EMPIAR-10499 and CZII datasets, and reported the results in Table 6. While the P3DA decoder offers larger receptive fields and richer contextual representations, its higher parameterization makes it prone to overfitting the source domain. In contrast, the lightweight MF design provides stable feature refinement with fewer parameters, which is particularly beneficial under zero-shot conditions where domain gaps are prominent. As shown in the table, our method (MVGFormer(MF)) achieves 53.2%, 63.5% and 57.9% in terms of Precision, Recall and F₁ score on the EMPIAR-10499 dataset, and 69.6% Precision, 63.7% Recall, and 66.5% F₁ on the CZII dataset, outperforming baseline methods and demonstrating promising generalization capability to unseen tasks.

Table 6.

Zero-shot learning performance for particle picking using EMPIAR-10499 and CZII dataset.

	EMPIAR-10499			CZII
Method	Precision	Recall	F₁	Precision	Recall	F₁
VoxResNet	48.5	46.3	47.4	64.3	56.6	60.2
SwiFT	43.8	62.8	51.6	54.7	64.7	59.3
MVGFormer(MF-ZS)	53.2	63.5	57.9	69.6	63.7	66.5
MVGFormer(P3DA-ZS)	50.3	57.6	53.7	67.1	60.8	63.8
MVGFormer(MF-Fully)	59.4	75.0	66.8	76.0	77.8	76.9
MVGFormer(P3DA-Fully)	61.2	78.4	69.8	78.3	81.6	79.9

Open in a new tab

We use the bold and the underlined to represent the best results and the runner-ups, respectively.

5.2.3. Subtomogram segmentation

Table 3 presents the segmentation results on both the simulated and real subtomogram datasets. Same to the tomogram segmentation, we also report the average results of five training runs. As can be seen, both MVGFormer (MF) and MVGFormer (P3DA) achieve SOTA performance compared to other baselines. For the results on simulated dataset, our MVGFormer (P3DA) leads to substantial performance gains in mIoU (i.e., 80.1% → 87.1% compared to SwiFT) and Dice (i.e., 91.4% → 93.2% compared to VoxResNet). We also provide the visualization of subtomogram segmentation results with the noise of 0.03 SNR, 0.05 SNR and Infinity SNR in Figs. 6, 7 and 8. As can be seen, our MVGFormer performs better in segmenting certain details (i.e., at the upper right corner particle of ‘2ane’, our segmentation result clearly adhere more closely to the ground truth mask; and for ‘4fb9’, compared to VoxResNet, MVGFormer fully segments out the particle on the left side, while VoxResNet leaves a gap in the middle part of the particle.).

Table 3.

Experimental results on cryo-ET subtomogram segmentation using simulated and real datasets.

Simulated dataset
Method	mIoU	Dice
U-CLSTM _[SHREC21]	76.8_±0.4	87.2_±0.1
DeepFinder _[SHREC21]	81.0_±0.1	89.5_±0.1
VoxResNet _{[Neurolmage18]}	84.3_±0.4	91.4_±0.2
Swin UNETR _[CVPR22]	78.8_±0.9	88.1_±0.6
MedNeXt _[miccai23]	82.7_±0.5	90.8_±0.3
SwiFT _[NeurIPS23]	80.1_±0.1	88.9_±0.7
MVGFormer(MLP)	83.9_±0.4	91.2_±0.2
MVGFormer(MF)	84.9 _±0.3	91.9 _±0.2
MVGFormer(P3DA)	87.1 _±0.2	93.2 _±0.2
PolyGA dataset
Method	mIoU	Dice
VoxResNet (w/o pre)	46.0_±1.5	61.6_±0.9
3D-UNet _[JBCB21]	49.4_±0.2	66.1_±0.1
DeepFinder _[SHREC21]	53.4_±0.3	69.6_±0.4
MedNeXt _[miccai23]	55.2_±0.5	71.1_±0.2
VoxResNet _{[Neurolmage18]}	59.8_±0.6	74.8_±0.4
SwiFT _[NeurIPS23]	57.0_±0.2	72.6_±0.3
MVGFormer(MLP)	61.5_±0.1	76.2_±0.3
MVGFormer(MF)	62.3 _±0.1	76.8 _±0.4
MVGFormer(P3DA)	63.7 _±0.2	78.5 _±0.3
Erwinia dataset
Method	mIoU	Dice
VoxResNet (w/o pre)	72.4_±0.7	80.3_±1.1
3D-UNet _[JBCB21]	76.7_±0.2	86.8_±0.3
DeepFinder _[SHREC21]	88.0_±0.3	93.6_±0.2
MedNeXt _[miccai23]	89.4_±1.1	94.4_±0.1
VoxResNet _{[Neurolmage18]}	95.4_±0.1	97.6_±0.1
SwiFT _[NeurIPS23]	91.7_±0.3	95.7_±0.1
MVGFormer(MF)	96.8 _±0.2	98.4 _±0.2
MVGFormer(P3DA)	97.9 _±0.4	98.9 _±0.1

Open in a new tab

We use the bold and the underlined to represent the best results and the runner-ups, respectively.

Fig. 6. — Visualization of cryo-ET subtomogram segmentation results at an SNR of 0.03. Regions of disagreement between the prediction and the ground truth are highlighted in yellow as error annotations. Best viewed in color.

Fig. 7. — Visualization of cryo-ET subtomogram segmentation results at an SNR of 0.05. Regions of disagreement between the prediction and the ground truth are highlighted in yellow as error annotations. Best viewed in color.

Fig. 8. — Visualization of cryo-ET subtomogram segmentation results with the noise of Infinity SNR. Regions of disagreement between the prediction and the ground truth are highlighted in yellow as error annotations. Best viewed in color.

For the results on the real datasets, due to the transformers lack the strong inductive bias of convolutional networks, they cannot directly exhibit good performance on small datasets [16]. As mentioned in Section 5.1.2, we use the MVGFormer (P3DA) pre-trained on the simulated dataset as the basic network and fine-tune it on the real dataset to test the performance of our model on the test set. We provide both pretrain VoxResNet and VoxResNet without pre-training (VoxResNet (w/o pre)) as the baselines. As shown in Table 3, our method yields 63.7% in mIoU and 78.5% in Dice, surpassing the other two baselines on both metrics. The experimental results in Table 3 on Erwinia dataset demonstrate the strong and consistent performance of our method (i.e., mIoU increased by 6.7% compared to SwiFT and Dice increased by 1.3% compared to VoxResNet), highlighting its robustness across different data domains.

5.2.4. Computational cost

We report the computational cost of the proposed MVGFormer and the three most comparable baselines in Table 7. Compared to purely convolutional methods (VoxResNet), the proposed transformer-based approach has more parameters indeed and requires somewhat longer training time. However, almost all transformer methods involve a trade-off between training cost and performance, whereas achieving performance improvements at the expense of higher training costs. Besides, during the inference stage, our method achieves a significant performance improvement with only a slight increase in inference time compared to MedNeXt and SwiFT.

Table 7.

Inference time and computational complexity comparison.

Method	Param.	Speed	secs/Epoch	Training hours
VoxResNet	6.8M	31fps	182	5.06h
MedNeXt	17.6M	9fps	1320	36.67h
SwiFT	10.7M	15fps	269	7.47h
MVGFormer(P3DA)	13.7M	25fps	908	25.22h

Open in a new tab

5.3. Ablation studies

5.3.1. Hyper-parameter selection

We further evaluate the hyper-parameters in our approach. As shown in Table 8, we evaluate the size of the hidden dimension $C$ , the number of the transformer layer $L_{n}$ and the patch size $H_{p} \times W_{p} \times D_{p}$ . As can be seen, the performance decreases when the hidden dimension $C$ increases (e.g., mIoU decreased 3.5% compared to $C = 512$ and Dice decreased 3.0% compared to $C = 128$ ). Although the model achieves decent performance when $C = 1024$ , it comes with an increase in computational complexity. Similarly, the other two parameters, $L_{n}$ and $H_{p} \times W_{p} \times D_{p}$ , will also simultaneously affect both the performance and the computational cost of the model. And the higher or lower values of these two parameters will not improve the model’s performance (i.e., 92.0% → 86.9% in mIoU compared to $l_{n} = 6$ and 85.3% → 54.8% in Dice compared to $H_{p} \times W_{p} \times D_{p} = 16 \times 16 \times 16$ ). Hence, the most competitive performance is provided by setting $C = 256, L_{n} = 12$ and $H_{p} \times W_{p} \times D_{p} = 4 \times 4 \times 4$ . We further evaluate the effect of the number of cluster centers $K$ , as reported in Table 9. To verify the generality of this choice, additional ablation studies are conducted on four datasets covering tomogram segmentation, subtomogram segmentation, and particle picking tasks, with the results summarized in Table 10. The experimental results show that deviations in the number of cluster centers negatively impact the model’s performance. Decreasing or increasing any of these three parameters will lower the model’s performance, especially when increasing them, which also increases the computational complexity.

Table 8.

Ablation study on the hyper-parameters of our proposed model on SHREC dataset.

$C$	mIoU	Dice	$L_{n}$	mIoU	Dice	$H_{p} \times W_{p} \times D_{p}$	mIoU	Dice
128	80.5	89.2	6	76.8	86.9	2×2×2	82.3	90.2
256	86.9	93.1	12	86.9	93.1	4×4×4	86.9	93.1
512	82.3	90.2	18	78.7	88.1	8×8×8	71.5	83.3
1024	84.1	91.3	24	81.6	89.9	16×16×16	54.8	70.8

Open in a new tab

$C, L_{n}$ and $H_{p} \times W_{p} \times D_{p}$ denote the size of the hidden dimension, the number of the transformer layer and the patch size, respectively.

Table 9.

Ablation study on the cluster number $K$ , graph construction methods (G.C.M.), fusion schemes and loss function of our proposed model on SHREC dataset.

$K$	mIoU	G.C.M.	mIoU	Fusion Schemes	mIoU	Loss Function	mIoU
8	83.6	spectral	85.3	norm + sum	85.7	$ℒ_{CE} + ℒ_{boundary}$	86.5
16	86.9	k-means	86.9	sum	86.9	$ℒ_{CE}$	86.9
32	84.7	MinCutPool	83.9	learnable-weight	86.4	$ℒ_{CE} + ℒ_{CL}$	84.1

Open in a new tab

Table 10.

Ablation study on the cluster number $K$ of our proposed model on SHREC2021, PolyGA and EMPIAR-10499 dataset.

	SHREC2021		PolyGA		EMPIAR-10499
$K$	mIoU	Dice	mIoU	Dice	Precision	Recall	F₁
8	83.6	91.2	59.1	74.3	58.7	75.2	65.9
12	84.7	91.7	60.2	75.9	59.1	76.0	66.5
16	86.9	93.1	63.7	78.5	61.2	78.4	69.8
20	86.5	92.7	62.0	76.5	60.3	77.3	67.8
24	85.3	92.0	61.4	76.1	59.5	76.1	66.8
28	84.6	91.6	61.5	76.3	59.7	76.2	67.0
32	84.7	91.7	60.9	75.7	58.6	74.6	65.7

Open in a new tab

5.3.2. Effectiveness of the multi-view perspective fusion strategy

We analyze the effectiveness of the proposed multi-view perspective fusion strategy (MVP). As aforementioned, existing transformer-based methods underperforms due to they only consider single observation perspective of the 3D image as the input. And the superiority of our approach partly comes from considering inputs from different perspectives and aggregate the output features. Hence, we deploy the MVP strategy to three recent 3D transformer-based approaches Swin UNETR, MedNeXt and SwiFT and re-conduct the experiments on the SHREC dataset. We report the results of both with and w/o MVP in Table 11. After incorporating our proposed MVP strategy, all three methods’ performance improved on both evaluation metrics (i.e., 81.5% → 83.4% in mIoU for SwiFT, and 90.5% → 91.2% in Dice for MedNeXt). This indicates that our designed MVP strategy indeed enhances model performance by aggregating multi-perspective information.

Table 11.

Ablation study of the proposed multi-view perspective fusion strategy (MVP) in our framework.

Method	mIoU	Dice
Swin UNETR + MVP	81.6 ↑ (79.4)	89.8 ↑ (88.5)
MedNeXt + MVP	83.9 ↑ (82.7)	91.2 ↑ (90.5)
SwiFT + MVP	83.4 ↑ (81.5)	91.0 ↑ (89.8)

Open in a new tab

We apply the proposed strategy on three recent 3D transformer-based approaches and re-conduct the experiments on SHREC dataset. The values inside the parentheses are the original results w/o MVP.

Additionally, to demonstrate that cryo-ET data is indeed anisotropic, we train each perspective as an independent primary viewpoint and provide the results in Table 12. Since each embedding is optimized separately, they learn different spatial priors that reflect the inherent anisotropy and spatial context of each view. As shown in the table, the proposed MVP strategy outperforms the use of individual viewpoints (i.e., 82.7% → 85.3% in mIoU), indicating that incorporating different views with distinct positional embeddings effectively captures diverse spatial priors. Hence, the results prove that different from 2D images, 3D images exhibits anisotropic characteristics, as mentioned in Section 4.2.

Table 12.

Ablation study of the proposed multi-view perspective fusion strategy (MVP) using each perspective as an independent primary viewpoint.

Method	mIoU	Dice
MVGFormer(XY)	82.2	90.2
MVGFormer(YZ)	82.7	90.8
MVGFormer(XZ)	81.9	90.0
MVGFormer(P3DA)	86.9	93.1

Open in a new tab

We further provide the ablation study using repeated single-view inputs with distinct positional embeddings in Table 13, and the feature diversity across different views in Table 14 using cosine similarity (CosSim) and the normalized feature diversity scores (FDS) as evaluation metrics.

CosSim (v 1, v 2) = \frac{f^{(v 1)} \cdot f^{(v 2)}}{‖f^{(v 1)}‖ ‖f^{(v 2)}‖},

(10)

FDS (v 1, v 2) = \frac{1}{N} \sum_{i = 1}^{N} {‖f_{i}^{(v 1)} - f_{i}^{(v 2)}‖}_{2}^{2} .

(11)

As can be seen from Table 13, simply increasing the input dimensionality by duplicating a single view does not lead to significant performance improvements. In contrast, our multi-view perspective fusion (P3DA) achieves clear gains in both mIoU and Dice, confirming that the improvements stem from the integration of complementary information across views rather than from trivial input expansion. And the results from Table 14 show lower cosine similarity and higher diversity scores across different view-pairs, indicating that the views provide complementary structural cues. Together, these findings validate the effectiveness of our multi-view design in leveraging structurally diverse perspectives for improved representation learning.

Table 13.

Ablation study of the proposed multi-view perspective fusion strategy (MVP) using repeated single-view inputs with distinct positional embeddings.

Method	mIoU	Dice
MVGFormer(XY*3)	82.4	90.3
MVGFormer(YZ*3)	83.1	90.5
MVGFormer(XZ*3)	82.6	89.9
MVGFormer(P3DA)	86.9	93.1

Open in a new tab

Table 14.

Feature diversity across different views.

view-pair	Cosine Similarity ↓	Feature Diversity ↑
XY & YZ	0.61	0.81
XY & XZ	0.67	0.75
XZ & YZ	0.74	0.73

Open in a new tab

5.3.3. Effectiveness of the design choices

We conduct comprehensive ablation experiments to assess the influence of different design choices, including graph construction methods, fusion schemes, loss function design and the decoder design, and report the results in Table 9 and Fig. 9.

Fig. 9. — Ablation studies of (a) The proposed decoder designs. We replace the decoders of three recent 3D transformer-based approaches with our proposed decoders. (b) Different positional encoding design. (c) Loss balancing. We set different weight $λ_{1}$ and $λ_{2}$ for different losses.

Effectiveness of the graph design.

To validate the effectiveness of our proposed graph-guided attention mechanism and the contextual graph construction, we perform ablation study on the module design and report the results in Fig. 10. Specifically, we extract the feature map output from the encoder of our model, compute the voxel-wise activation strength by taking the channel-wise L2 norm, and normalize the resulting 3D heatmap to [0,1]. As shown in Fig. 10, removing the graph-guided attention (Fig. 10(d)) leads to weak and diffused activations, where the encoder fails to capture the particle boundaries. When the graph is used to guide attention (Fig. 10(c)), the encoder becomes more sensitive to boundary regions. Furthermore, by incorporating contextual graph construction to enhance the cross-view understanding, the boundaries become even more distinct and well localized (Fig. 10(b)).

Fig. 10. — Heatmap visualization of feature activations in the ablation study of our graph design.

Graph construction methods.

We set two different construction methods, spectral clustering and Mincutpool, as the ablation choice for ablation study. As shown in Table 9, K-means clustering achieves the highest mIoU (86.9%), while spectral and MinCutPool perform slightly worse, validating that our simple clustering strategy is already effective. We also provide qualitative visualizations of the learned feature activations under the same processing method as Fig. 10. For visualization, we overlay the normalized heatmap onto the corresponding grayscale input slices. As shown in Fig. 11(a) displays the original grayscale input slice; (b) shows our method, where the feature map activations (computed as the voxel-wise L2 norm across channels) are normalized and mapped back to the input space. Strong activations are clearly concentrated on the structural boundary of the target object. In contrast, the spectral clustering (c) and MinCutPool (d) baselines produce noisy and less localized activations.

Fig. 11. — Heatmap visualization of feature activations in the ablation study comparing our k-means graph construction with spectral and mincutpool methods.

Fusion scheme.

To validate the effectiveness of our fusion scheme choice, we choose two separate fusion methods to replace the direct summation used in our method as the comparison baselines. As can be seen from Table 9, direct summation of logits provides a strong baseline (86.9%), outperforming normalization or learnable weighting, which suggests that complex weighting is unnecessary.

Loss function design.

As can be seen from Table 9, using cross-entropy loss alone yields the best performance (86.9% mIoU). We set the Boundary Dice Loss as the boundary-aware loss, but its combination with CE does not bring further improvements (86.5%), while adding contrastive regularization even degrades the results (84.1%). This indicates that the proposed model, when trained with CE alone, already captures sufficient structural and boundary cues, and that additional loss terms may introduce redundant or noisy constraints that hinder optimization.

Decoder designs.

We validate the effectiveness of each decoder design in our method. We validate the effectiveness of each decoder design in our method. In Tables 2, 3, and 5, we additionally provide results using a simple MLP as the decoder (as “MVPGFormer (MLP)”), from which it can be observed that both of our proposed decoder designs achieve consistently superior performance. We replace the decoders of Swin UNETR, MedNeXt and SwiFT with our proposed decoder designs. As Swin UNETR and MedNeXt share a similar decoder design, for fair comparison, we also combine our transformer encoder with the decoder proposed in Swin UNERT as the “Base” framework and re-conduct experiments on the SHREC dataset. As show in Fig. 9(a), we present the mIoU scores for different combinations of methods and decoders. “Base” denotes the methods’ original framework without any changes, “MF” denotes the methods’ decoders are replaced with the MF and “P3DA” denotes the methods’ decoders are replaced with the P3DA. As our proposed MF is similar to their existing decoder design, which are all based on multi-layer feature fusion, the improvement brought about is not significant (refer to blue and red dashed lines in Fig. 9(a)). Our proposed P3DA significantly improves all methods’ performance, which demonstrates that the P3DA can effectively expand the receptive field and enhance the performance of the 3D transformer encoders (refer to green dashed lines in Fig. 9(a)).

Positional encoding design.

In addition to our default multi-view positional encoding, we conducted ablations with three common alternatives: learned 3D positional encodings, sinusoidal 3D encodings, and rotary encodings. As shown in Fig. 9(b), these variants yield only marginal differences compared to our design. This demonstrates that assigning distinct encodings to each orthogonal view already provides sufficient cross-view geometry, validating that our proposed encoding strategy is both effective and efficient without introducing extra parameterization.

Loss balancing.

Our training objective combines two terms: segmentation loss $ℒ_{CE}$ and reconstruction loss $ℒ_{MSE}$ . Both are defined as voxel-wise averages, so their magnitudes are naturally comparable and no additional weighting coefficients are required. To verify robustness, we further conducted a sensitivity analysis by introducing two coefficients, $λ_{1}$ and $λ_{2}$ , to balance the two losses. As shown in Fig. 9(c), the model achieves the best performance when $λ_{1} = λ_{2} = 1.0$ , while maintaining stable results across a wide range of values, confirming that our optimisation is not sensitive to the precise weighting scheme.

6. Conclusion

In this work, we introduce the first transformer-based framework Multi-View Perspective Transformer (termed as MVGFormer) for cryo-ET segmentation. Different from current CNN-based cryo-ET segmentation methods, we use a multi-view perspective fusion transformer encoder to enforce the model to focus more on the global information of the input data, thereby acquiring richer spatial information. Unlike other 3D vision transformer encoders, which only use the single view of the 3D image as the input and omit the multi-perspective observation information, our transformer encoder fuses feature embeddings from multi observation perspectives to fully leverage the characteristics of the 3D inputs. Additionally, two decoder designs are proposed to evaluate the effectiveness of our MVGFormer’s encoder feature representation. Extensive experimental results demonstrate that our MVGFormer set the new state-of-the-art on both cryo-ET tomogram and subtomogram datasets. For future work, we will curate more cryo-ET data, extend 2D prompting and clustering techniques, and propose a segmentation foundation model based on MVGFormer for cryo-ET to address biological challenges in real-world scenarios.

Limitations of our work

The main issue with our approach is the increase in computational complexity without a significant improvement in effectiveness. This is because the transformer structure lacks strong inductive bias, thus requiring larger amounts of data support (for example, ViT [16] is pretrained on ImageNet22k). Therefore, we believe that the size of the dataset limits the performance of our method. In future work, we will focus on collecting more data to train the model. Meanwhile, we will also attempt to improve the model to enhance efficiency and reduce computational costs.

Acknowledgments

The authors acknowledge NVIDIA and its research support team for the help provided to conduct this work. This work was partially supported by the Australian Research Council (ARC) Industrial Transformation Training Centres (ITTC) for Innovative Composites for the Future of Sustainable Mining Equipment under Grant IC220100028. The joint collaboration between authors was also supported by UOW’s internal funding for strategic development of competitive grant applications in the 2023 round. This work was also supported in part by U.S. NIH grants R35GM158094 and R01GM134020, NSF grants DBI2238093, DBI2422619, IIS2211597, and MCB2205148. DK acknowledges fundings from NIH (R01GM133840, R35GM158267, R21AI187928) and from NSF (IIS2211598, DBI2146026, and DBI2422620).

Appendix A. Dataset description

In this section, we provide more details about the dataset we used to conduct the experiments.

A.1. Tomogram dataset

As aforementioned, each tomogram contains particles from 13 proteins plus vesicle and fiducial. Here we list all the categories of the proteins: 4V94, 4CR2, 1QVR, 1BXN, 3CF3, 1U6G, 3D2F, 2CG9, 3H84, 3DL1, 3QM1, 1S3X and 5MRC. Other than particles, the background consists mostly of water and structural noise.

A.2. Subtomogram dataset

Cryo-ET subtomogram segmentation is a binary segmentation task. The simulation process we used, however, can only generate grey-scale ground-truth masks. Hence, following [24], we use a threshold to turn the grey-scale mask into a binary one. The threshold we used in our paper is set to 300. As mentioned in Section 5.1.1, the simulated subtomogram dataset is simulated from 50 macromolecules. We provide all the macromolecules used for simulation in Table A.15.

Table A. 15.

Macromolecules contained in simulated cryo-ET subtomogram dataset. Each macromolecule is simulated with three different noise levels.

Macromolecules
1bxn	1c5u	1cf5	1e0p	1f1b	1hak	1hwn	1hwo	1o6c	1qoq
1vim	1viy	1yg6	1z30	2ane	2byu	2fuv	2h12	21db	2oqa
2pcr	2qes	2wiz	2z04	2zzw	3g11	3hhb	3ku0	3le7	4d4r
4dxy	4fb9	4leh	4yov	5ddz	5wde	5wvm	5x1i	5xls	6afk
6i7c	6jaw	6jhh	6jhi	6jvz	6lov	6loy	6oo1	6t3e	7cgt

Open in a new tab

Footnotes

^★

Code available at https://github.com/HaoranLi525/MVGFormer.

CRediT authorship contribution statement

Haoran Li: Writing – review & editing, Writing – original draft, Visualization, Software, Methodology, Investigation, Formal analysis, Data curation; Xingjian Li: Writing – review & editing, Methodology, Data curation; Huan Wang: Writing – review & editing, Methodology; Jiahua Shi: Writing – review & editing, Supervision, Funding acquisition; Huaming Chen: Writing – review & editing, Supervision; Yizhou Zhao: Writing – review & editing; Bo Du: Writing – review & editing, Supervision; Johan Barthelemy: Writing – review & editing, Resources; Daisuke Kihara: Writing – review & editing, Funding acquisition; Jun Shen: Writing – review & editing, Supervision; Min Xu: Writing – review & editing, Supervision, Funding acquisition, Data curation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

Data will be made available on request.

References

[1].Doerr A, Cryo-electron tomography, Nat. Methods 14 (1) (2017) 34. [Google Scholar]
[2].Zeng X, Kahng A, Xue L, Mahamid J, Chang Y-W, Xu M, High-throughput cryo-ET structural pattern mining by unsupervised deep iterative subtomogram clustering, Proc. Natl. Acad. Sci 120 (15) (2023) e2213149120. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Han R, Xu M, The advance of computational methods in cryo-electron tomography, in: Frontiers in Bioimage Informatics Methodology, World Scientific, 2024, pp. 3–54. [Google Scholar]
[4].Wang T, Li B, Zhang J, Zeng X, Uddin MR, Wu W, Xu M, Deep active learning for cryo-electron tomography classification, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 1611–1615. [Google Scholar]
[5].Zhu H, Wang C, Wang Y, Fan Z, Uddin MR, Gao X, Zhang J, Zeng X, Xu M, Unsupervised multi-task learning for 3D subtomogram image alignment, clustering and segmentation, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 2751–2755. [Google Scholar]
[6].Bandyopadhyay H, Deng Z, Ding L, Liu S, Uddin MR, Zeng X, Behpour S, Xu M, Cryo-shift: reducing domain shift in cryo-electron subtomograms with unsupervised domain adaptation and randomization, Bioinformatics 38 (4) (2022) 977–984. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Luo Z, Zeng X, Bao Z, Xu M, Deep learning-based strategy for macromolecules classification with imbalanced data from cellular electron cryotomography, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–8. [Google Scholar]
[8].Moebel E, Martinez-Sanchez A, Lamm L, Righetto RD, Wietrzynski W, Albert S, Larivière D, Fourmentin E, Pfeffer S, Ortiz J, et al. , Deep learning improves macromolecule identification in 3D cellular cryo-electron tomograms, Nat. Methods 18 (11) (2021) 1386–1394. [DOI] [PubMed] [Google Scholar]
[9].Purnell C, Heebner J, Swulius MT, Hylton R, Kabonick S, Grillo M, Grigoryev S, Heberle F, Waxham MN, Swulius MT, Rapid synthesis of cryo-ET data for training deep learning models, bioRxiv (2023). [Google Scholar]
[10].Rice G, Wagner T, Stabrin M, Sitsel O, Prumbaum D, Raunser S, TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining, Nat. Methods 20 (6) (2023) 871–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Kiewisz R, Fabig G, Conway W, Johnston J, Kostyuchenko VA, Bařinka C, Clarke O, Magaj M, Yazdkhasti H, Vallese F, et al. , Accurate and fast segmentation of filaments and membranes in micrographs and tomograms with TARDIS, bioRxiv (2024). [Google Scholar]
[12].Lamm L, Zufferey S, Righetto RD, Wietrzynski W, Yamauchi KA, Burt A, Liu Y, Zhang H, Martinez-Sanchez A, Ziegler S, et al. , MemBrain v2: an end-to-end tool for the analysis of membranes in cryo-electron tomography, bioRxiv (2024) 2024–01. [Google Scholar]
[13].de Teresa-Trueba I, Goetz SK, Mattausch A, Stojanovska F, Zimmerli CE, Toro-Nahuelpan M, Cheng DWC, Tollervey F, Pape C, Beck M, et al. , Convolutional networks for supervised mining of molecular patterns within cellular context, Nat. Methods 20 (2) (2023) 284–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O, 3D U-Net: learning dense volumetric segmentation from sparse annotation, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 424–432. [Google Scholar]
[15].Hara K, Kataoka H, Satoh Y, Can spatiotemporal 3D cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555. [Google Scholar]
[16].Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N, An image is worth 16×16 words: transformers for image recognition at scale, in: Proceedings of the International Conference on Learning Representations, 2021. [Google Scholar]
[17].Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, et al. , A survey on vision transformer, IEEE Trans. Pattern Anal. Mach. Intell 45 (1) (2022) 87–110. [Google Scholar]
[18].Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jaeger PF, Maier-Hein KH, Mednext: transformer-driven scaling of convnets for medical image segmentation, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp. 405–415. [Google Scholar]
[19].Kim P, Kwon J, Joo S, Bae S, Lee D, Jung Y, Yoo S, Cha J, Moon T, SwiFT: swin 4D fMRI transformer, Adv. Neural Inf. Process. Syst 36 (2023), 42015–42037 [Google Scholar]
[20].Tang Y, Yang D, Li W, Roth HR, Landman B, Xu D, Nath V, Hatamizadeh A, Self-supervised pre-training of swin transformers for 3D medical image analysis, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20730–20740. [Google Scholar]
[21].Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, White-head S, Berg AC, Lo W-Y, et al. , Segment anything, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026. [Google Scholar]
[22].Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PHS, et al. , Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890. [Google Scholar]
[23].Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P, SegFormer: simple and efficient design for semantic segmentation with transformers, Proc. Adv. Neural Inf. Process. Syst 34 (2021) 12077–12090. [Google Scholar]
[24].Zhu X, Chen J, Zeng X, Liang J, Li C, Liu S, Behpour S, Xu M, Weakly supervised 3D semantic segmentation using cross-image consensus and inter-voxel affinity relations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2834–2844. [Google Scholar]
[25].Heebner JE, Purnell C, Hylton RK, Marsh M, Grillo MA, Swulius MT, Deep learning-based segmentation of cryo-electron tomograms, JoVE (J. Vis. Exp.) (2022) e64435. [Google Scholar]
[26].Khosrozadeh A, Seeger R, Witz G, Radecke J, Sorensen JB, Zuber B, CryoVes-Net: A Dedicated Framework for Synaptic Vesicle Segmentation in Cryo Electron Tomograms, bioRxiv (2024) 2024–02. [Google Scholar]
[27].Liu S, Du X, Xi R, Xu F, Zeng X, Zhou B, Xu M, Semi-supervised macromolecule structural classification in cellular electron cryo-tomograms using 3D autoencoding classifier, in: Proceedings of the British Machine Vision Conference (BMVC), 30, 2019. [Google Scholar]
[28].Gao S, Han R, Zeng X, Cui X, Liu Z, Xu M, Zhang F, Dilated-densenet for macromolecule classification in cryo-electron tomography, in: Proceedings of the International Symposium on Bioinformatics Research and Applications, Springer, 2020, pp. 82–94. [Google Scholar]
[29].Gupta T, He X, Uddin MR, Zeng X, Zhou A, Zhang J, Freyberg Z, Xu M, Self-supervised learning for macromolecular structure classification based on cryo-electron tomograms, Front. Physiol 13 (2022) 957484. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Zeng Y, Howe G, Yi K, Zeng X, Zhang J, Chang Y-W, Xu M, Unsupervised domain alignment based open set structural recognition of macromolecules captured by cryo-electron tomography, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2021, pp. 106–110. [Google Scholar]
[31].Liu Y-T, Zhang H, Wang H, Tao C-L, Bi G-Q, Zhou ZH, Isotropic reconstruction for electron tomography with deep learning, Nat. Commun 13 (1) (2022) 6482. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Zhang H, Li Y, Liu Y, Li D, Wang L, Song K, Bao K, Zhu P, A method for restoring signals and revealing individual macromolecule states in cryo-ET, REST, Nat. Commun 14 (1) (2023) 2937. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Xiao Y, Yuan Q, Jiang K, He J, Lin C-W, Zhang L, TTST: A top-k token selective transformer for remote sensing image super-resolution, IEEE Trans. Image Process 33 (2024) 738–752. [DOI] [PubMed] [Google Scholar]
[34].Hou Z, Shang Y, Yan Y, FBPT: A Fully Binary Point Transformer, (2024). arXiv:2403.09998
[35].Xiao Y, Yuan Q, Jiang K, Jin X, He J, Zhang L, C.-w. Lin, Local-global temporal difference learning for satellite video super-resolution, IEEE Trans. Circuits Syst. Video Technol 34 (4) (2023) 2789–2802. [Google Scholar]
[36].Yao J, Zhang B, Li C, Hong D, Chanussot J, Extended vision transformer (ExViT) for land use and land cover classification: a multimodal deep learning framework, IEEE Trans. Geosci. Remote Sens 61 (2023), 1–15. [Google Scholar]
[37].Roy SK, Deria A, Hong D, Rasti B, Plaza A, Chanussot J, Multimodal fusion transformer for remote sensing image classification, IEEE Trans. Geosci. Remote Sens 61 (2023), 1–20. [Google Scholar]
[38].Zhou W, Kamata S-I, Wang H, Xue X, Multiscanning-based rnn-transformer for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens 61 (2023), 1–19 [Google Scholar]
[39].Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R, Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299. [Google Scholar]
[40].Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H, OneFormer: one transformer to rule universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2989–2998. [Google Scholar]
[41].Yuan F, Zhang Z, Fang Z, An effective CNN and transformer complementary network for medical image segmentation, Pattern Recognit. 136 (2023) 109228. [Google Scholar]
[42].Yu H, Tian Y, Ye Q, Liu Y, Spatial transform decoupling for oriented object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 38, 2024, pp. 6782–6790. [Google Scholar]
[43].Li F, Zhang H, Xu H, Liu S, Zhang L, Ni LM, Shum H-Y, Mask DINO: towards a unified transformer-based framework for object detection and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 3041–3050. [Google Scholar]
[44].Yan J, Liu Y, Sun J, Jia F, Li S, Wang T, Zhang X, Cross modal transformer: towards fast and robust 3D object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 18268–18278. [Google Scholar]
[45].Wang Z, Chetouani A, Jarraya M, Hans D, Jennane R, Transformer with selective shuffled position embedding and key-patch exchange strategy for early detection of knee osteoarthritis, Expert Syst. Appl 255 (2024) 124614. [Google Scholar]
[46].Dhinagar NJ, Thomopoulos SI, Laltoo E, Thompson PM, Efficiently training vision transformers on structural MRI scans for Alzheimer’s disease detection, in: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2023, pp. 1–6. [Google Scholar]
[47].Alp S, Akan T, Bhuiyan MS, Disbrow EA, Conrad SA, Vanchiere JA, Kevil CG, Bhuiyan MAN, Joint transformer architecture in brain 3D MRI classification: its application in Alzheimer’s disease classification, Sci. Rep 14 (1) (2024) 8996. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Yan Q, Liu S, Xu S, Dong C, Li Z, Shi JQ, Zhang Y, Dai D, 3D Medical image segmentation using parallel transformers, Pattern Recognit. 138 (2023) 109432. [Google Scholar]
[49].Wu Y, Heng Y, Niranjan M, Kim H, SliceFormer: deep dense depth estimation from a single indoor omnidirectional image using a slice-based transformer, in: Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC), IEEE, 2024, pp. 1–4. [Google Scholar]
[50].Trinidad MC, Brualla RM, Kainz F, Kontkanen J, Multi-view image fusion, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4101–4110. [Google Scholar]
[51].Xu Y, Jiang Z, Men A, Wang H, Luo H, Multi-view feature fusion for person re-identification, Knowledge-Based Syst. 229 (2021) 107344. [Google Scholar]
[52].Zheng D, Zheng X, Yang LT, Gao Y, Zhu C, Ruan Y, Mffn: multi-view feature fusion network for camouflaged object detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6232–6242. [Google Scholar]
[53].Lee DI, Park H, Seo J, Park E, Park H, Baek HD, Shin S, Kim S, Kim S, Editsplat: multi-view fusion and attention-guided optimization for view-consistent 3D scene editing with 3D Gaussian splatting, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11135–11145. [Google Scholar]
[54].Xu M, Zhang Z, Wei F, Hu H, Bai X, Side adapter network for open-vocabulary semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945–2954. [Google Scholar]
[55].Li J, Dai H, Han H, Ding Y, Mseg3d: multi-modal 3D semantic segmentation for autonomous driving, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21694–21704. [Google Scholar]
[56].Chen J, Lu J, Zhu X, Zhang L, Generative semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7111–7120. [Google Scholar]
[57].Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H, Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818. [Google Scholar]
[58].Li H, Shi J, Chen H, Du B, Maksour S, Phillips G, Dottori M, Shen J, FDNet: frequency domain denoising network for cell segmentation in astrocytes derived from induced pluripotent stem cells, in: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), IEEE, 2024, pp. 1–5. [Google Scholar]
[59].He K, Chen X, Xie S, Li Y, Dollár P, Girshick R, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009. [Google Scholar]
[60].Gubins I, Chaillet ML, White T, Bunyak F, Papoulias G, Gerolymatos S, Zacharaki EI, Moustakas K, Zeng X, Liu S, Xu M, Wang Y, van der SG, Chen C, Cui X, Zhang F, Trueba MC, Veltkamp RC, Förster F, Wang X, Kihara D, Moebel E, Nguyen NP, SHREC 2021: Classification in cryo-electron tomograms, in: Biasotti S, Dyke RM, Lai Y, Rosin PL, Veltkamp RC (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2021. [Google Scholar]
[61].Tegunov D, Xue L, Dienemann C, Cramer P, Mahamid J, Multi-particle cryo-EM refinement with M visualizes ribosome-antibiotic complex at 3.5 Å in cells, Nat. Methods 18 (2) (2021) 186–193. [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Ermel U, Cheng A, Ni JX, Gadling J, Venkatakrishnan M, Evans K, Asuncion J, Sweet A, Pourroy J, Wang ZS, et al. , A data portal for providing standardized annotations for cryo-electron tomography, Nat. Methods 21 (12) (2024) 2200–2202. [DOI] [PubMed] [Google Scholar]
[63].Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE, UCSF Chimera—a visualization system for exploratory research and analysis, J. Comput. Chem 25 (13) (2004) 1605–1612. [DOI] [PubMed] [Google Scholar]
[64].Galaz-Montoya JG, Flanagan J, Schmid MF, Ludtke SJ, Single particle tomography in EMAN2, J. Struct. Biol 190 (3) (2015) 279–290. [DOI] [PMC free article] [PubMed] [Google Scholar]
[65].Liu S, Ban X, Zeng X, Zhao F, Gao Y, Wu W, Zhang H, Chen F, Hall T, Gao X, et al. , A unified framework for packing deformable and non-deformable subcellular structures in crowded cryo-electron tomogram simulation, BMC Bioinform. 21 (1) (2020a) 399. [Google Scholar]
[66].Liu S, Ma Y, Ban X, Zeng X, Nallapareddy V, Chaudhari A, Xu M, Efficient cryo-electron tomogram simulation of macromolecular crowding with application to sars-cov-2, in: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2020b, pp. 80–87. [Google Scholar]
[67].Guo Q, Lehmer C, Martínez-Sánchez A, Rudack T, Beck F, Hartmann H, Pérez-Berlanga M, Frottin F, Hipp MS, Hartl FU, et al. , In situ structure of neuronal C9orf72 poly-GA aggregates reveals proteasome recruitment, Cell 172 (4) (2018) 696–705. [DOI] [PMC free article] [PubMed] [Google Scholar]
[68].Prichard A, Lee J, Laughlin TG, Lee A, Thomas KP, Sy AE, Spencer T, Asavavimol A, Cafferata A, Cameron M, et al. , Identifying the core genome of the nucleus-forming bacteriophage family and characterization of Erwinia phage RAY, Cell Rep. 42 (5) (2023), 112432. [DOI] [PMC free article] [PubMed] [Google Scholar]
[69].Zeng X, Xu M, Gum-net: unsupervised geometric matching for fast and accurate 3D subtomogram image alignment and averaging, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4073 −4084. [Google Scholar]
[70].Liao X, Li W, Xu Q, Wang X, Jin B, Zhang X, Wang Y, Zhang Y, Iteratively=-refined interactive 3D medical image segmentation with multi-agent reinforcement learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9394–9402. [Google Scholar]
[71].Li H, Li X, Shi J, Chen H, Du B, Kihara D, Barthelemy J, Shen J, Xu M, Vox-UDA: voxel-wise unsupervised domain adaptation for cryo-electron subtomogram segmentation with denoised pseudo-labeling, in: Proceedings of the AAAI Conference on Artificial Intelligence, 39, 2025, pp. 406–414. [Google Scholar]
[72].Kingma DP, Ba J, Adam: A method for stochastic optimization, (2014). arXiv:1412.6980
[73].Jeong JG, Choi S, Kim YJ, Lee W-S, Kim KG, Deep 3D attention CLSTM U-Net based automated liver segmentation and volumetry for the liver transplantation in abdominal CT volumes, Sci. Rep 12 (1) (2022) 6370. [DOI] [PMC free article] [PubMed] [Google Scholar]
[74].Hrabe T, Chen Y, Pfeffer S, Cuellar LK, Mangold A-V, Förster F, PyTom: a python-based toolbox for localization of macromolecules in cryo-electron tomograms and subtomogram analysis, J. Struct. Biol 178 (2) (2012) 177–188. [DOI] [PubMed] [Google Scholar]
[75].Chen H, Dou Q, Yu L, Qin J, Heng P-A, VoxResNet: deep voxelwise residual networks for brain segmentation from 3D MR images, NeuroImage 170 (2018) 446–455. [DOI] [PubMed] [Google Scholar]
[76].Wagner T, Merino F, Stabrin M, Moriya T, Antoni C, Apelbaum A, Hagel P, Sitsel O, Raisch T, Prumbaum D, et al. , SPHIRE-CrYOLO is a fast and accurate fully automated particle picker for cryo-EM, Commun. Biol 2 (1) (2019) 218. [DOI] [PMC free article] [PubMed] [Google Scholar]
[77].Tang G, Peng L, Baldwin PR, Mann DS, Jiang W, Rees I, Ludtke SJ, EMAN2: an extensible image processing suite for electron microscopy, J. Struct. Biol 157 (1) (2007) 38–46. [DOI] [PubMed] [Google Scholar]
[78].Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J, DeepCryoPicker: fully automated deep neural network for single protein particle picking in cryo-EM, BMC Bioinform. 21 (2020) 1–38. [Google Scholar]
[79].Liu G, Niu T, Qiu M, Zhu Y, Sun F, Yang G, DeepETPicker: fast and accurate 3D particle picking for cryo-electron tomography using weakly supervised deep learning, Nat. Commun 15 (1) (2024) 2090. [DOI] [PMC free article] [PubMed] [Google Scholar]
[80].Dhakal A, Gyawali R, Wang L, Cheng J, Artificial intelligence in cryo-EM protein particle picking: recent advances and remaining challenges, Brief. Bioinform 26 (1) (2025) bbaf011. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[R1] [1].Doerr A, Cryo-electron tomography, Nat. Methods 14 (1) (2017) 34. [Google Scholar]

[R2] [2].Zeng X, Kahng A, Xue L, Mahamid J, Chang Y-W, Xu M, High-throughput cryo-ET structural pattern mining by unsupervised deep iterative subtomogram clustering, Proc. Natl. Acad. Sci 120 (15) (2023) e2213149120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Han R, Xu M, The advance of computational methods in cryo-electron tomography, in: Frontiers in Bioimage Informatics Methodology, World Scientific, 2024, pp. 3–54. [Google Scholar]

[R4] [4].Wang T, Li B, Zhang J, Zeng X, Uddin MR, Wu W, Xu M, Deep active learning for cryo-electron tomography classification, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 1611–1615. [Google Scholar]

[R5] [5].Zhu H, Wang C, Wang Y, Fan Z, Uddin MR, Gao X, Zhang J, Zeng X, Xu M, Unsupervised multi-task learning for 3D subtomogram image alignment, clustering and segmentation, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 2751–2755. [Google Scholar]

[R6] [6].Bandyopadhyay H, Deng Z, Ding L, Liu S, Uddin MR, Zeng X, Behpour S, Xu M, Cryo-shift: reducing domain shift in cryo-electron subtomograms with unsupervised domain adaptation and randomization, Bioinformatics 38 (4) (2022) 977–984. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Luo Z, Zeng X, Bao Z, Xu M, Deep learning-based strategy for macromolecules classification with imbalanced data from cellular electron cryotomography, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–8. [Google Scholar]

[R8] [8].Moebel E, Martinez-Sanchez A, Lamm L, Righetto RD, Wietrzynski W, Albert S, Larivière D, Fourmentin E, Pfeffer S, Ortiz J, et al. , Deep learning improves macromolecule identification in 3D cellular cryo-electron tomograms, Nat. Methods 18 (11) (2021) 1386–1394. [DOI] [PubMed] [Google Scholar]

[R9] [9].Purnell C, Heebner J, Swulius MT, Hylton R, Kabonick S, Grillo M, Grigoryev S, Heberle F, Waxham MN, Swulius MT, Rapid synthesis of cryo-ET data for training deep learning models, bioRxiv (2023). [Google Scholar]

[R10] [10].Rice G, Wagner T, Stabrin M, Sitsel O, Prumbaum D, Raunser S, TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining, Nat. Methods 20 (6) (2023) 871–880. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Kiewisz R, Fabig G, Conway W, Johnston J, Kostyuchenko VA, Bařinka C, Clarke O, Magaj M, Yazdkhasti H, Vallese F, et al. , Accurate and fast segmentation of filaments and membranes in micrographs and tomograms with TARDIS, bioRxiv (2024). [Google Scholar]

[R12] [12].Lamm L, Zufferey S, Righetto RD, Wietrzynski W, Yamauchi KA, Burt A, Liu Y, Zhang H, Martinez-Sanchez A, Ziegler S, et al. , MemBrain v2: an end-to-end tool for the analysis of membranes in cryo-electron tomography, bioRxiv (2024) 2024–01. [Google Scholar]

[R13] [13].de Teresa-Trueba I, Goetz SK, Mattausch A, Stojanovska F, Zimmerli CE, Toro-Nahuelpan M, Cheng DWC, Tollervey F, Pape C, Beck M, et al. , Convolutional networks for supervised mining of molecular patterns within cellular context, Nat. Methods 20 (2) (2023) 284–294. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O, 3D U-Net: learning dense volumetric segmentation from sparse annotation, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 424–432. [Google Scholar]

[R15] [15].Hara K, Kataoka H, Satoh Y, Can spatiotemporal 3D cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555. [Google Scholar]

[R16] [16].Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N, An image is worth 16×16 words: transformers for image recognition at scale, in: Proceedings of the International Conference on Learning Representations, 2021. [Google Scholar]

[R17] [17].Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, et al. , A survey on vision transformer, IEEE Trans. Pattern Anal. Mach. Intell 45 (1) (2022) 87–110. [Google Scholar]

[R18] [18].Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jaeger PF, Maier-Hein KH, Mednext: transformer-driven scaling of convnets for medical image segmentation, in: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp. 405–415. [Google Scholar]

[R19] [19].Kim P, Kwon J, Joo S, Bae S, Lee D, Jung Y, Yoo S, Cha J, Moon T, SwiFT: swin 4D fMRI transformer, Adv. Neural Inf. Process. Syst 36 (2023), 42015–42037 [Google Scholar]

[R20] [20].Tang Y, Yang D, Li W, Roth HR, Landman B, Xu D, Nath V, Hatamizadeh A, Self-supervised pre-training of swin transformers for 3D medical image analysis, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20730–20740. [Google Scholar]

[R21] [21].Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, White-head S, Berg AC, Lo W-Y, et al. , Segment anything, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026. [Google Scholar]

[R22] [22].Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PHS, et al. , Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890. [Google Scholar]

[R23] [23].Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P, SegFormer: simple and efficient design for semantic segmentation with transformers, Proc. Adv. Neural Inf. Process. Syst 34 (2021) 12077–12090. [Google Scholar]

[R24] [24].Zhu X, Chen J, Zeng X, Liang J, Li C, Liu S, Behpour S, Xu M, Weakly supervised 3D semantic segmentation using cross-image consensus and inter-voxel affinity relations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2834–2844. [Google Scholar]

[R25] [25].Heebner JE, Purnell C, Hylton RK, Marsh M, Grillo MA, Swulius MT, Deep learning-based segmentation of cryo-electron tomograms, JoVE (J. Vis. Exp.) (2022) e64435. [Google Scholar]

[R26] [26].Khosrozadeh A, Seeger R, Witz G, Radecke J, Sorensen JB, Zuber B, CryoVes-Net: A Dedicated Framework for Synaptic Vesicle Segmentation in Cryo Electron Tomograms, bioRxiv (2024) 2024–02. [Google Scholar]

[R27] [27].Liu S, Du X, Xi R, Xu F, Zeng X, Zhou B, Xu M, Semi-supervised macromolecule structural classification in cellular electron cryo-tomograms using 3D autoencoding classifier, in: Proceedings of the British Machine Vision Conference (BMVC), 30, 2019. [Google Scholar]

[R28] [28].Gao S, Han R, Zeng X, Cui X, Liu Z, Xu M, Zhang F, Dilated-densenet for macromolecule classification in cryo-electron tomography, in: Proceedings of the International Symposium on Bioinformatics Research and Applications, Springer, 2020, pp. 82–94. [Google Scholar]

[R29] [29].Gupta T, He X, Uddin MR, Zeng X, Zhou A, Zhang J, Freyberg Z, Xu M, Self-supervised learning for macromolecular structure classification based on cryo-electron tomograms, Front. Physiol 13 (2022) 957484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Zeng Y, Howe G, Yi K, Zeng X, Zhang J, Chang Y-W, Xu M, Unsupervised domain alignment based open set structural recognition of macromolecules captured by cryo-electron tomography, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2021, pp. 106–110. [Google Scholar]

[R31] [31].Liu Y-T, Zhang H, Wang H, Tao C-L, Bi G-Q, Zhou ZH, Isotropic reconstruction for electron tomography with deep learning, Nat. Commun 13 (1) (2022) 6482. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Zhang H, Li Y, Liu Y, Li D, Wang L, Song K, Bao K, Zhu P, A method for restoring signals and revealing individual macromolecule states in cryo-ET, REST, Nat. Commun 14 (1) (2023) 2937. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Xiao Y, Yuan Q, Jiang K, He J, Lin C-W, Zhang L, TTST: A top-k token selective transformer for remote sensing image super-resolution, IEEE Trans. Image Process 33 (2024) 738–752. [DOI] [PubMed] [Google Scholar]

[R34] [34].Hou Z, Shang Y, Yan Y, FBPT: A Fully Binary Point Transformer, (2024). arXiv:2403.09998

[R35] [35].Xiao Y, Yuan Q, Jiang K, Jin X, He J, Zhang L, C.-w. Lin, Local-global temporal difference learning for satellite video super-resolution, IEEE Trans. Circuits Syst. Video Technol 34 (4) (2023) 2789–2802. [Google Scholar]

[R36] [36].Yao J, Zhang B, Li C, Hong D, Chanussot J, Extended vision transformer (ExViT) for land use and land cover classification: a multimodal deep learning framework, IEEE Trans. Geosci. Remote Sens 61 (2023), 1–15. [Google Scholar]

[R37] [37].Roy SK, Deria A, Hong D, Rasti B, Plaza A, Chanussot J, Multimodal fusion transformer for remote sensing image classification, IEEE Trans. Geosci. Remote Sens 61 (2023), 1–20. [Google Scholar]

[R38] [38].Zhou W, Kamata S-I, Wang H, Xue X, Multiscanning-based rnn-transformer for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens 61 (2023), 1–19 [Google Scholar]

[R39] [39].Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R, Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299. [Google Scholar]

[R40] [40].Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H, OneFormer: one transformer to rule universal image segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2989–2998. [Google Scholar]

[R41] [41].Yuan F, Zhang Z, Fang Z, An effective CNN and transformer complementary network for medical image segmentation, Pattern Recognit. 136 (2023) 109228. [Google Scholar]

[R42] [42].Yu H, Tian Y, Ye Q, Liu Y, Spatial transform decoupling for oriented object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, 38, 2024, pp. 6782–6790. [Google Scholar]

[R43] [43].Li F, Zhang H, Xu H, Liu S, Zhang L, Ni LM, Shum H-Y, Mask DINO: towards a unified transformer-based framework for object detection and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 3041–3050. [Google Scholar]

[R44] [44].Yan J, Liu Y, Sun J, Jia F, Li S, Wang T, Zhang X, Cross modal transformer: towards fast and robust 3D object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 18268–18278. [Google Scholar]

[R45] [45].Wang Z, Chetouani A, Jarraya M, Hans D, Jennane R, Transformer with selective shuffled position embedding and key-patch exchange strategy for early detection of knee osteoarthritis, Expert Syst. Appl 255 (2024) 124614. [Google Scholar]

[R46] [46].Dhinagar NJ, Thomopoulos SI, Laltoo E, Thompson PM, Efficiently training vision transformers on structural MRI scans for Alzheimer’s disease detection, in: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2023, pp. 1–6. [Google Scholar]

[R47] [47].Alp S, Akan T, Bhuiyan MS, Disbrow EA, Conrad SA, Vanchiere JA, Kevil CG, Bhuiyan MAN, Joint transformer architecture in brain 3D MRI classification: its application in Alzheimer’s disease classification, Sci. Rep 14 (1) (2024) 8996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Yan Q, Liu S, Xu S, Dong C, Li Z, Shi JQ, Zhang Y, Dai D, 3D Medical image segmentation using parallel transformers, Pattern Recognit. 138 (2023) 109432. [Google Scholar]

[R49] [49].Wu Y, Heng Y, Niranjan M, Kim H, SliceFormer: deep dense depth estimation from a single indoor omnidirectional image using a slice-based transformer, in: Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC), IEEE, 2024, pp. 1–4. [Google Scholar]

[R50] [50].Trinidad MC, Brualla RM, Kainz F, Kontkanen J, Multi-view image fusion, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4101–4110. [Google Scholar]

[R51] [51].Xu Y, Jiang Z, Men A, Wang H, Luo H, Multi-view feature fusion for person re-identification, Knowledge-Based Syst. 229 (2021) 107344. [Google Scholar]

[R52] [52].Zheng D, Zheng X, Yang LT, Gao Y, Zhu C, Ruan Y, Mffn: multi-view feature fusion network for camouflaged object detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6232–6242. [Google Scholar]

[R53] [53].Lee DI, Park H, Seo J, Park E, Park H, Baek HD, Shin S, Kim S, Kim S, Editsplat: multi-view fusion and attention-guided optimization for view-consistent 3D scene editing with 3D Gaussian splatting, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11135–11145. [Google Scholar]

[R54] [54].Xu M, Zhang Z, Wei F, Hu H, Bai X, Side adapter network for open-vocabulary semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945–2954. [Google Scholar]

[R55] [55].Li J, Dai H, Han H, Ding Y, Mseg3d: multi-modal 3D semantic segmentation for autonomous driving, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21694–21704. [Google Scholar]

[R56] [56].Chen J, Lu J, Zhu X, Zhang L, Generative semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7111–7120. [Google Scholar]

[R57] [57].Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H, Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818. [Google Scholar]

[R58] [58].Li H, Shi J, Chen H, Du B, Maksour S, Phillips G, Dottori M, Shen J, FDNet: frequency domain denoising network for cell segmentation in astrocytes derived from induced pluripotent stem cells, in: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), IEEE, 2024, pp. 1–5. [Google Scholar]

[R59] [59].He K, Chen X, Xie S, Li Y, Dollár P, Girshick R, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009. [Google Scholar]

[R60] [60].Gubins I, Chaillet ML, White T, Bunyak F, Papoulias G, Gerolymatos S, Zacharaki EI, Moustakas K, Zeng X, Liu S, Xu M, Wang Y, van der SG, Chen C, Cui X, Zhang F, Trueba MC, Veltkamp RC, Förster F, Wang X, Kihara D, Moebel E, Nguyen NP, SHREC 2021: Classification in cryo-electron tomograms, in: Biasotti S, Dyke RM, Lai Y, Rosin PL, Veltkamp RC (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2021. [Google Scholar]

[R61] [61].Tegunov D, Xue L, Dienemann C, Cramer P, Mahamid J, Multi-particle cryo-EM refinement with M visualizes ribosome-antibiotic complex at 3.5 Å in cells, Nat. Methods 18 (2) (2021) 186–193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Ermel U, Cheng A, Ni JX, Gadling J, Venkatakrishnan M, Evans K, Asuncion J, Sweet A, Pourroy J, Wang ZS, et al. , A data portal for providing standardized annotations for cryo-electron tomography, Nat. Methods 21 (12) (2024) 2200–2202. [DOI] [PubMed] [Google Scholar]

[R63] [63].Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE, UCSF Chimera—a visualization system for exploratory research and analysis, J. Comput. Chem 25 (13) (2004) 1605–1612. [DOI] [PubMed] [Google Scholar]

[R64] [64].Galaz-Montoya JG, Flanagan J, Schmid MF, Ludtke SJ, Single particle tomography in EMAN2, J. Struct. Biol 190 (3) (2015) 279–290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] [65].Liu S, Ban X, Zeng X, Zhao F, Gao Y, Wu W, Zhang H, Chen F, Hall T, Gao X, et al. , A unified framework for packing deformable and non-deformable subcellular structures in crowded cryo-electron tomogram simulation, BMC Bioinform. 21 (1) (2020a) 399. [Google Scholar]

[R66] [66].Liu S, Ma Y, Ban X, Zeng X, Nallapareddy V, Chaudhari A, Xu M, Efficient cryo-electron tomogram simulation of macromolecular crowding with application to sars-cov-2, in: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2020b, pp. 80–87. [Google Scholar]

[R67] [67].Guo Q, Lehmer C, Martínez-Sánchez A, Rudack T, Beck F, Hartmann H, Pérez-Berlanga M, Frottin F, Hipp MS, Hartl FU, et al. , In situ structure of neuronal C9orf72 poly-GA aggregates reveals proteasome recruitment, Cell 172 (4) (2018) 696–705. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] [68].Prichard A, Lee J, Laughlin TG, Lee A, Thomas KP, Sy AE, Spencer T, Asavavimol A, Cafferata A, Cameron M, et al. , Identifying the core genome of the nucleus-forming bacteriophage family and characterization of Erwinia phage RAY, Cell Rep. 42 (5) (2023), 112432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] [69].Zeng X, Xu M, Gum-net: unsupervised geometric matching for fast and accurate 3D subtomogram image alignment and averaging, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4073 −4084. [Google Scholar]

[R70] [70].Liao X, Li W, Xu Q, Wang X, Jin B, Zhang X, Wang Y, Zhang Y, Iteratively=-refined interactive 3D medical image segmentation with multi-agent reinforcement learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9394–9402. [Google Scholar]

[R71] [71].Li H, Li X, Shi J, Chen H, Du B, Kihara D, Barthelemy J, Shen J, Xu M, Vox-UDA: voxel-wise unsupervised domain adaptation for cryo-electron subtomogram segmentation with denoised pseudo-labeling, in: Proceedings of the AAAI Conference on Artificial Intelligence, 39, 2025, pp. 406–414. [Google Scholar]

[R72] [72].Kingma DP, Ba J, Adam: A method for stochastic optimization, (2014). arXiv:1412.6980

[R73] [73].Jeong JG, Choi S, Kim YJ, Lee W-S, Kim KG, Deep 3D attention CLSTM U-Net based automated liver segmentation and volumetry for the liver transplantation in abdominal CT volumes, Sci. Rep 12 (1) (2022) 6370. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] [74].Hrabe T, Chen Y, Pfeffer S, Cuellar LK, Mangold A-V, Förster F, PyTom: a python-based toolbox for localization of macromolecules in cryo-electron tomograms and subtomogram analysis, J. Struct. Biol 178 (2) (2012) 177–188. [DOI] [PubMed] [Google Scholar]

[R75] [75].Chen H, Dou Q, Yu L, Qin J, Heng P-A, VoxResNet: deep voxelwise residual networks for brain segmentation from 3D MR images, NeuroImage 170 (2018) 446–455. [DOI] [PubMed] [Google Scholar]

[R76] [76].Wagner T, Merino F, Stabrin M, Moriya T, Antoni C, Apelbaum A, Hagel P, Sitsel O, Raisch T, Prumbaum D, et al. , SPHIRE-CrYOLO is a fast and accurate fully automated particle picker for cryo-EM, Commun. Biol 2 (1) (2019) 218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R77] [77].Tang G, Peng L, Baldwin PR, Mann DS, Jiang W, Rees I, Ludtke SJ, EMAN2: an extensible image processing suite for electron microscopy, J. Struct. Biol 157 (1) (2007) 38–46. [DOI] [PubMed] [Google Scholar]

[R78] [78].Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J, DeepCryoPicker: fully automated deep neural network for single protein particle picking in cryo-EM, BMC Bioinform. 21 (2020) 1–38. [Google Scholar]

[R79] [79].Liu G, Niu T, Qiu M, Zhu Y, Sun F, Yang G, DeepETPicker: fast and accurate 3D particle picking for cryo-electron tomography using weakly supervised deep learning, Nat. Commun 15 (1) (2024) 2090. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] [80].Dhakal A, Gyawali R, Wang L, Cheng J, Artificial intelligence in cryo-EM protein particle picking: recent advances and remaining challenges, Brief. Bioinform 26 (1) (2025) bbaf011. [Google Scholar]

PERMALINK

MVGFormer: Multi-view perspective with graph-guided transformer for cryo-ET segmentation★

Haoran Li

Xingjian Li

Huan Wang

Jiahua Shi

Huaming Chen

Yizhou Zhao

Bo Du

Johan Barthelemy

Daisuke Kihara

Jun Shen

Min Xu

Abstract

1. Introduction

Fig. 1.

2. Related work

2.1. Deep learning in cryo-ET

2.2. Vision transformers

2.2.1. 2D vision transformers

2.2.2. 3D vision transformers

2.3. Multi-view fusion

3. Problem definition

4. Method

Fig. 2.

4.1. Context encoder

4.2. Multi-view perspective fusion transformer encoder

4.3. Decoder designs

4.3.1. Multi-level feature fusion segmentor (MF)

4.3.2. Parallel 3D atrous convolution segmentor (P3DA)

4.4. View-masked self-supervised learning (VSL)

4.5. Optimization

Table 1.

5. Experiments

5.1. Experimental settings

5.1.1. Datasets

Tomogram dataset.

Subtomogram dataset.

Fig. 3.

5.1.2. Implementation details

5.1.3. Evaluation metrics

5.1.4. Baselines

5.2. Comparisons with state-of-the-art methods

5.2.1. Tomogram segmentation

Table 2.

Fig. 4.

Fig. 5.

Table 4.

5.2.2. Tomogram particle picking

Table 5.

Table 6.

5.2.3. Subtomogram segmentation

Table 3.

Fig. 6.

Fig. 7.

Fig. 8.

5.2.4. Computational cost

Table 7.

5.3. Ablation studies

5.3.1. Hyper-parameter selection

Table 8.

Table 9.

Table 10.

5.3.2. Effectiveness of the multi-view perspective fusion strategy

Table 11.

Table 12.

Table 13.

Table 14.

5.3.3. Effectiveness of the design choices

Fig. 9.

Effectiveness of the graph design.

Fig. 10.

Graph construction methods.

Fig. 11.

Fusion scheme.

Loss function design.

Decoder designs.

Positional encoding design.

Loss balancing.

6. Conclusion

MVGFormer: Multi-view perspective with graph-guided transformer for cryo-ET segmentation^★