Abstract
We describe an automatic natural language processing (NLP)-based image captioning method to describe fetal ultrasound video content by modelling the vocabulary commonly used by sonographers and sonologists. The generated captions are similar to the words spoken by a sonographer when describing the scan experience in terms of visual content and performed scanning actions. Using full-length second-trimester fetal ultrasound videos and text derived from accompanying expert voice-over audio recordings, we train deep learning models consisting of convolutional neural networks and recurrent neural networks in merged configurations to generate captions for ultrasound video frames. We evaluate different model architectures using established general metrics (BLEU, ROUGE-L) and application-specific metrics. Results show that the proposed models can learn joint representations of image and text to generate relevant and descriptive captions for anatomies, such as the spine, the abdomen, the heart, and the head, in clinical fetal ultrasound scans.
Keywords: Image Description, Image Captioning, Deep Learning, Natural Language Processing, Recurrent Neural Networks, Fetal Ultrasound
1. Introduction
Automatic image captioning combines computer vision with natural language processing to generate a textual statement, called a caption, to represent image content. Image captioning has been widely explored for natural images with benchmark datasets [1], however, most established image-captioning datasets do not include medical images. Preparing medical image captioning benchmarks is challenging for two reasons: (a) describing medical images with specific terminology requires expert knowledge of medical professionals; and (b) the sensitive nature of medical images prevents wide-scale annotation, for instance, using crowd-sourcing services. Therefore, automatic image captioning has not been widely studied on ultrasound images before, the challenge being enhanced by: (1) more subtle differences between ultrasound images than between natural ones; (2) the lack of readily available large datasets of ultrasound images with captions. To the best of our knowledge, this is the first attempt to perform automatic image captioning on fetal ultrasound video frames, using sonographer spoken words to describe their scanning experience.
As part of routine care, pregnant women are offered a detailed fetal anomaly ultrasound scan at approximately 20 weeks of gestation to identify any fetal malformations. While developing an ultrasound image-captioning model, we can analyse the vocabulary used by sonographers to describe the scans, reflecting their experience during the scan process in terms of visual content and performed scanning actions. The aim of our work is to learn joint image-text representations to describe ultrasound images with rich vocabulary consisting of nouns, verbs, and adjectives. A potential application of the work is its use as an educational tool that communicates descriptions of anatomical views of interest to the subjects and sonography trainees. An example of an image and its caption is shown in Fig. 1a. The word cloud in Fig. 1b shows the most common spoken sonographer words used to describe fetal ultrasound scans in our work.
Fig. 1.
(a) Example of a fetal ultrasound image with sonographer description: “we can see the midline of the brain where we can see the cavum septum pellucidum” (b) Word cloud of most frequently occurring words in sonographer vocabulary. Red, green, and blue represent nouns, adjectives, and verbs, respectively. The size of a word in the word cloud is proportional to its frequency of use.
Related Work
There are mainly two different ways to perform image captioning [1,20]: (a) text retrieval where descriptions are stored beforehand and retrieved using scores between stored and queried images [15] and (b) text generation where novel textual descriptions are generated. The latter is achieved using top-down or bottom-up approaches [23]. In the top-down approach, an image is described by translating visual representations to text, and in the bottom-up approach, constituent objects and concepts in an image are described with words that are then combined into sentences using language models [3]. In both types, deep learning methods include convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to represent images and text, respectively [20,22].
We are aware of only two previous ultrasound image captioning works [12,24]. In [12], captions are generated for the ImageCLEF dataset including other radiology images, which also uses a top down deep-learning based text generation approach. In our work, a reduced complexity is achieved using a merged configuration in which image feature vectors are not included as part of the input sequence to the recurrent network. In [24], the image captioning task is performed on adult abdominal ultrasound with a focus on diseases of the kidney and gallbladders, where a structure and an associated disease are classified before generating a description with an RNN trained specifically on words of that structure. In contrast, we propose models where representations are jointly learned in a single step. Both [12] and [24] use text reports as a raw source of data. We use sonographer voice-over recordings to describe the videos in real-time, thereby providing a richer description of the spatiotemporal video contents.
2. Methods
Data Acquisition and Processing
Full-length routine fetal anomaly ultrasound scan videos were acquired by an expert sonographer. The sonographer retrospectively recorded voice-overs in English for five anonymised videos with a mean duration of 37 minutes (range: 20-56 minutes). A total of 160 minutes of audiovisual content was recorded. From the full-length videos, freeze frames were automatically detected. The sonographers freeze a frame when they find a suitable view of interest for diagnostic examination, which are the anatomical standard planes. The display frame was automatically cropped to include only the anatomical view. The speech recordings were pre-processed for anonymisation and then transcribed using Google Cloud Speech (GCS) API [5]. GCS is designed for natural language, but the recordings contain additional medical vocabulary, and as the sonographer is a non-native English speaker, the transcriptions contained a few errors which were corrected by manual post-processing. ELAN, a multimedia annotator for audiovisual content, was used to synchronize video contents with generated transcriptions and to correct erroneous text [19]. After the transcribed words were manually checked, grouped, and synced, a file containing the captions with start and end times was produced to automatically align video frames with captions. The process of creating image-caption pairs is shown in Fig. 2. The raw text was cleaned by removing punctuation, replacing numeric characters with their word equivalents, and removing stall words (e.g. ‘so yeah’, ‘ well’). Special tokens denoted the start and end of a caption. The resulting caption length varied between 1–22 words, with a vocabulary of 158 unique words and distribution of adjectives, determiners, nouns, and verbs is 12.7%, 22.2%, 28.0%, and 16.0%, respectively. The remaining 21.1% are prepositions, pronouns, adverbs, and other parts-of-speech. Hence, the combined dataset was composed of real-world fetal anomaly ultrasound video freeze frames and their associated captions.
Fig. 2.
Data acquisition and processing pipeline
Model Architecture
Image captioning often involves a CNN to encode image information followed by an RNN as a decoder to generate text [22]. However, to reduce computational complexity, an RNN was used solely as the textual feature extractor and the encoded image information from a CNN was combined with the textual features in merged configurations [20,21]. The model diagram is shown in Fig. 3. One branch of the model is a CNN based on VGGNet16 [18] architecture, pre-trained on the ImageNet dataset and fine-tuned on fetal ultrasound standard planes of the abdomen, face, brain, femur, heart, spine, and placenta. The other branch represents the text encoding part of the model, including an embedding layer, which embeds the words in the sequence into a vector, followed by an RNN.
Fig. 3.
Image captioning model (concatenation configurations)
Features are extracted from the ultrasound video frames using the fine-tuned CNN. A textual caption is encoded by an embedding layer followed by a recurrent layer. The branches are merged, followed by a fully connected and decision-making layer. The model configurations generate the next word in a caption at every step as the probability distribution over the words in the vocabulary. We comparatively evaluated different embeddings, namely, word2vec embedding trained on the Google News corpus [6], GloVe embedding trained on Wikipedia-2014 corpus [17], and a plain random initialization. Word2vec is a shallow neural network trained to predict the context around a given word in a skip-gram model [14]. GloVe incorporates word co-occurrence probabilities with the idea that words occurring together often enough are likely to hold underlying semantic meaning. The embedded word vectors are learnt by an RNN consisting of a Long Short-Term Memory (LSTM) unit [8] or a Gated Recurrent Unit (GRU) [2]. GRUs have less trainable parameters than LSTMs and require fewer operations, which makes GRUs more efficient to train, scaling down well to smaller datasets. The two branches produce tensors of different lengths (200 and 300, respectively) that are joined together by merging. We compare two merge methods, namely, concatenation and addition. In concatenation configurations, text and image features vectors of unequal length are combined to deliberately force the model towards relying more on the text branch when generating the next word in a sequence to have a textually well-structured generated caption. However, output vectors of equal length of 300 are used in addition configurations.
Training Process
Sixty-five percent of the total data was used for training and thirty-five percent for testing. To address class imbalance, an equal number of unique captions were included for each anatomical class, namely, abdomen, head, heart, and spine. In addition to class imbalance, caption imbalance is observed as some captions correspond to more than one video frame because sonographers spend different amounts of time looking at different fetal structures. For training the deep learning models, we excluded captions that do not describe one of the four anatomical structures of highest interest, i.e., head, heart, spine, and abdomen. These anatomical classes were selected due to having the most representation in the collected dataset compared to other anatomical classes, and they form 40%, 22%, 20%, and 18% of the data, respectively. From caption pre-processing, vocabularies of each of the four anatomical classes of interest were obtained. Lexical diversity scores, specifically MTLD [13], to measure word variety in each vocabulary were obtained as 21.2, 20.3, 17.9, and 21.8, respectively.
The training set consisted of 2,240 image-caption pairs and the validation set consisted of 560 image-caption pairs. The images were resized to 224×224 pixels. Each image in the dataset was augmented twice; first by rotating by an angle between -30° and 30°, and second by horizontally flipping the image. Pre-trained VGGNet16 was first fine-tuned on ultrasound images. During training of image captioning models, ‘teacher forcing’ was applied where the ground truth sequences of increasing length were used at every step rather than the sequence of the words generated by the model at previous steps [4]. The model was called in a recursive fashion with the sequence of generated words so far being iteratively fed into the model at every time-step, along with the corresponding image. This process continued until the model generated a special end token or the maximum caption length was reached. Adam optimization [9] and categorical cross-entropy loss were applied during training. Early stopping was used to stop training when validation loss did not improve for more than five epochs. Dropout (rate between 0.4 and 0.5) was used to reduce overfitting. During inference, the model relied on its previously generated words to generate the next word.
Evaluation Metrics
Different model configurations were compared using the established general metrics BLEU (Bilingual Evaluation Understudy) and ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence), a grammar score GB (GrammarBot) [7], a classification metric Class. F1 (F1 score), and an anatomical description metric ARS (Anatomical Relevance Score).
Objective metrics BLEU [16] and ROUGE-L [11] are calculated between the ground truth captions and the generated captions. These two metrics are commonly used when evaluating image captioning models but may lead to lower values when the pair of captions do not show exact matches. Hence, for our caption generation task, grammar-based, classification-and description-based, and subjective metrics were additionally evaluated. To evaluate captions grammatically, the average number of grammatical mistakes in a generated caption were calculated. Classification F1 scores were measured by determining the caption class as the class of the vocabulary that has the highest overlap with the predicted caption. We devised an anatomical relevance score (ARS) by matching words in a generated caption with the terminology of the anatomical class of interest. For example, an image of an abdomen may have a ground truth caption about the ribs, but if the generated caption describes the stomach, it is not an erroneous caption. ARS is calculated using Equations 1, 2 and 3
| (1) |
| (2) |
| (3) |
where CSk is a score that a caption has in relation to the anatomical class k, K is the set of four anatomical classes, Vk is the vocabulary set of class k, L is the length of a caption Wc consisting of words with softmax probabilities pi, 1V (·) is an indicator function which returns 1 if wi is in V and 0 otherwise, SSc is a score that only considers CSk if it has the ground truth anatomical class GTc, and C is the total number of captions in the set.
3. Results
Quantitative Evaluation
Table 1 shows the quantitative evaluation results for different model configurations. An overall score is obtained by calculating the mean of the scores (GB was normalised and inverted). The overall best performing model was the Fine-tunable-Word2vec-LSTM-Concatenation configuration, which is used to demonstrate anatomical evaluation in Table 2 and Fig. 4. For a subjective measure, Likert Scale [10] based evaluations are performed where a medical professional was asked to give a score from 0 (‘No’) to 2 (‘Yes’) in response to the following statements about a generated caption, namely it: (1) accurately describes the image; (2) has no incorrect information; (3) is grammatically correct; (4) is relevant for this image. For each caption, the responses were averaged. These scores are reported in Table 2 as LSS (Likert Scale Scores). Knowing the original image class and resulting caption classes, we plot the confusion matrix for the best performing configuration in Fig. 4.
Table 1. Evaluation Results of Model Configurations.
| Word Embedding | RNN | Merge Mode | BLEU-4 | ROUGE-L | GB↓ | Class. F1 | ARS | Overall |
|---|---|---|---|---|---|---|---|---|
| Fine-Tunable GloVe | LSTM | Concatenation | 0.066 | 0.536 | 1.091 | 0.809 | 0.680 | 0.385 |
| Addition | 0.081 | 0.580 | 0.9 | 0.948 | 0.686 | 0.397 | ||
| GRU | Concatenation | 0.081 | 0.585 | 0.889 | 0.502 | 0.455 | 0.261 | |
| Addition | 0.094 | 0.561 | 0.923 | 0.529 | 0.449 | 0.268 | ||
| Fine-Tunable Word2vec | LSTM | Concatenation | 0.105 | 0.594 | 1.214 | 0.970 | 0.536 | 0.427 |
| Addition | 0.045 | 0.546 | 0.929 | 0.679 | 0.506 | 0.297 | ||
| GRU | Concatenation | 0.080 | 0.523 | 1.200 | 0.764 | 0.594 | 0.376 | |
| Addition | 0.086 | 0.539 | 1.077 | 0.609 | 0.476 | 0.307 | ||
| Pretrained Word2vec | LSTM | Concatenation | 0.085 | 0.574 | 1.200 | 0.921 | 0.567 | 0.413 |
| Addition | 0.063 | 0.529 | 1.267 | 0.641 | 0.537 | 0.348 | ||
| GRU | Concatenation | 0.066 | 0.530 | 1.100 | 0.768 | 0.718 | 0.385 | |
| Addition | 0.062 | 0.545 | 0.917 | 0.714 | 0.648 | 0.334 | ||
| Random Initialisation | LSTM | Concatenation | 0.075 | 0.560 | 1.222 | 0.975 | 0.564 | 0.422 |
| Addition | 0.091 | 0.536 | 1.188 | 0.805 | 0.539 | 0.362 | ||
| GRU | Concatenation | 0.067 | 0.507 | 1.308 | 0.763 | 0.632 | 0.394 | |
| Addition | 0.084 | 0.525 | 0.857 | 0.625 | 0.547 | 0.287 |
Table 2. Evaluation results for the different anatomical structures.
| Structure | BLEU-3 | BLEU-4 | ROUGE-L | GB↓ | Class. F1 | ARS | LSS |
|---|---|---|---|---|---|---|---|
| Abdomen | 0.000 | 0.000 | 0.533 | 0.667 | 0.886 | 0.316 | 0.625 |
| Head | 0.122 | 0.058 | 0.479 | 2.000 | 1.000 | 0.213 | 0.625 |
| Heart | 0.252 | 0.140 | 0.581 | 0.857 | 0.993 | 0.843 | 0.500 |
| Spine | 0.319 | 0.000 | 0.789 | 0.000 | 1.000 | 0.771 | 1.000 |
Fig. 4.
Confusion Matrix
Discussion
Table 1 does not really show a clear superior configuration across the different metrics, but overall, the Fine-tunable-Word2vec-LSTM-Concatenation configuration performs the best across the different metrics. Its generated captions are shown in the supplementary material. It is marginally outperformed in anatomical classification scores by the Random-Initialisation-LSTM-Concatenation configuration but scores higher in BLEU-4 and ROUGE-L, implying the usefulness of pre-trained embedded to ensure superior caption structuring compared to randomly initialised vectors. Word2vec embeddings were found to be more effective than GloVe embeddings for the fetal ultrasound datasets. It is interesting to note that, in most cases, concatenation performed better than addition, and LSTM units outperformed GRUs, even for our limited datasets. Among the anatomical classes, from Table 2, abdomen and head show low scores in BLEU-4 and ROUGE-L due to having the highest lexical diversity. Spine does well in ROUGE-L and GB because of its lower lexical diversity, however, BLEU-4 is zero due to the absence of 4-gram overlaps but BLEU-3=0.319 is achieved. From LSS, we can see that the heart class is more challenging. In clinical practice, a fetal heart is typically identified by its beating motion (a video clip) rather than a still image. Further, the current captioning system is not trained to distinguish between the different heart views, but the textual description can be heart view specific. Adding more image-caption pairs of distinct heart views would solve this problem. In Fig. 4, it can be seen that all classes are accurately identified; however, the model struggles with 11% of abdomen images, misclassifying them as heart. On investigation, it was found that for these specific images the stomach bubble has an elongated appearance, which has some resemblance to a heart view or chamber.
4. Conclusions
We proposed an automatic image captioning method to describe fetal ultrasound video content from four types of anatomical structures using real-world sonographer vocabularies. The Fine-tunable-Word2vec-LSTM-Concatenation performed best among the different evaluated model configurations. Richer vocabularies and extensions to spatio-temporal models will be considered in future work.
Supplementary Material
Acknowledgement
We acknowledge ERC (ERC-ADG-2015 694581, project PULSE), EPSRC (EP/MO13774/1), Rhodes Trust, and NIHR Biomedical Research Centre funding scheme.
References
- 1.Bernardi R, et al. Automatic description generation from images: A survey of models, datasets, and evaluation measures. IJCAI; 2017. pp. 4970–4. [Google Scholar]
- 2.Cho K, et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP; ACL; 2014. pp. 1724–34. [Google Scholar]
- 3.Elliott D, et al. Image description using visual dependency representations. EMNLP; 2013. pp. 1292–1302. [Google Scholar]
- 4.Goodfellow I, et al. Deep learning. 2016 [Google Scholar]
- 5.Google Cloud. Cloud Speech-to-Text. cloud.google.com/speech-to-text/
- 6.Google Code Archive. Word2Vec. 2013 code.google.com/archive/p/word2vec/
- 7.GrammarBot. Grammar Check API. https://www.grammarbot.io/
- 8.Hochreiter S, et al. Long short-term memory. NC. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 9.Kingma DP, et al. Adam: A method for stochastic optimization. CoRR; 2015. abs/1412.6980. [Google Scholar]
- 10.Likert R. A technique for the measurement of attitudes. Archives of psychology. 1932 [Google Scholar]
- 11.Lin CY. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out. 2004 [Google Scholar]
- 12.Lyndon D, et al. Neural captioning for the ImageCLEF 2017 medical image challenges. CEUR Workshop Proceedings; 2017. [Google Scholar]
- 13.McCarthy PM, Jarvis S. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods. 2010;42(2):381–92. doi: 10.3758/BRM.42.2.381. [DOI] [PubMed] [Google Scholar]
- 14.Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013 [Google Scholar]
- 15.Ordonez V, et al. Im2text: Describing images using 1 million captioned photographs. Advances in NIPS. 2011:1143–51. [Google Scholar]
- 16.Papineni K, et al. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on ACL; ACL; 2002. pp. 311–8. [Google Scholar]
- 17.Pennington, et al. Glove: Global vectors for word representation. EMNLP; 2014. pp. 1532–43. [Google Scholar]
- 18.Simonyan K, et al. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations; 2015. [Google Scholar]
- 19.Sloetjes H, et al. Annotation by category-ELAN and ISO DCR. LREC; 2008. [Google Scholar]
- 20.Tanti M, et al. What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? ACL; 2017. pp. 51–60. [Google Scholar]
- 21.Tanti M, et al. Where to put the image in an image caption generator. Natural Language Engineering. 2018;24(3):467–89. [Google Scholar]
- 22.Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. pp. 3156–3164. [Google Scholar]
- 23.You Q, et al. Image captioning with semantic attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 4651–9. [Google Scholar]
- 24.Zeng XH, et al. Understanding and generating ultrasound image description. Journal of Computer Science and Technology. 2018;33(5):1086–100. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




