Skip to main content
. 2023 Jul 4;17:1219363. doi: 10.3389/fnins.2023.1219363

Figure 3.

Figure 3

Architecture of the proposed approach. The end-to-end framework contains three modules: (1) Feature extraction from input images, (2) Image features are projected to BEV features, and (3) BEV semantic representation prediction. A Transformer-based network encodes global-local spatial features powered by multi-head attention. The image-plane features can be transformed to BEV-plane features with the smallest possible loss of feature information. A generative network further processes BEV global-local spatial features and predicts the final classification probabilities.