Skip to main content
. 2024 Aug 2;4:156. doi: 10.1038/s43856-024-00581-0

Fig. 4. Schematic network architecture of the three-stage model.

Fig. 4

The architecture is composed of three feature extractor backbones, whose outputs are concatenated in the feature vector F. The feature vector is refined by a temporal multi frame model consisting of multiple stages of transformer-based LTContext blocks for anticipation of the next required instrument and the corresponding surgical phase. A subsequent informed model validates the prediction of the instrument using a phase compatibility matrix and adjusts uncertain predictions. The focal loss is calculated after each stage and aggregated for the joint training of the model.