Table 3.
Model | Potential advantages | Potential limitations |
---|---|---|
2D CNN | Simple to use, useful when region of interest is unique to a single frame. Leverages large-scale ImageNet dataset | Lacks temporal dependency |
3D CNN | Captures full spatial and temporal dependency, useful when # of frames are large. Leverages large-scale Kinetics video dataset | Typically larger in model size than an equivalent 2D model due to the added kernel dimension |
Stacked 2D CNN model | Simple to use (with limited frames), easy to interpret DSA over a few individual frames. Leverages ImageNet pretraining on 2D feature extractors | Feature-level temporal dependency only. There are no joint spatial and temporal dependencies |
2D vision transformer | Robust to frame-level distortions, relaxed inductive bias |
2D features only Limited training dataset size may limit the final performance |