The architecture of the proposed hierarchical attention network. It contains three main components: a) convolutional feature extractor, , b) beat encoder, , and c) window encoder, . The input Doppler signal is divided into windows of 3.75 s . The scalogram of each window is calculated before feeding the network where window has time samples after the time-frequency feature construction.