Fig. 2.

Overview of our proposed MVGFormer. (a) We cut the whole tomogram into patches and send each patch to the MVGFormer as the input. For each input, multi-view transform and linear projection are applied to obtain sequence-level feature embeddings from three different observation perspectives. Each feature embedding is added with its unique corresponding position embedding and then sent to transformer encoder. Each input is also sent to the context encoder to generate the visual graph as the attention-guidance. (b) is the structure of the transformer layer. To obtain voxel-level segmentation, we design two different decoders: (c) multi-level feature fusion segmentor and (d) parallel 3D atrous convolution segmentor. Best viewed in color.