. 2026 Jan 21;16:5925. doi: 10.1038/s41598-026-36095-z

Table 1.

Cross-modal feature dimension comparison.

Modality type	Original feature dimension	Reduced feature dimension	Semantic correlation analysis
Visual (video)	2048 × T (T = frames)	512	0.763 (with knowledge graph entities)
Textual description	768 (BERT-base)	512	0.821 (with knowledge graph entities)
Motion sequence	3 J × T (J = joints)	512	0.795 (with knowledge graph entities)
Integrated features	–	768	0.879 (with knowledge graph entities)