Table 1.
Summary of the proposed framework.
| Dimension | Component | Function / Output |
|---|---|---|
| Input | 3D input with XY, XZ and YZ projections | Orthogonal cryo-ET views providing complementary cues |
| Encoder | Transformer encoder with multi-view tokens | Captures cross-view semantic consistency |
| Fusion | Graph-based aggregation module | Models spatial and frequency-level relationships |
| Decoder | Multi-scale convolutional layers | Produces voxel-wise segmentation output |
| Learning Objective | View-masked SSL + CE loss | Jointly optimizes reconstruction and segmentation |