Enhancing cross-modal retrieval via label graph optimization and hybrid loss functions

Lin Wang; Chenchen Wang; Simin Peng

doi:10.1038/s41598-026-37525-8

. 2026 Jan 27;16:6400. doi: 10.1038/s41598-026-37525-8

Enhancing cross-modal retrieval via label graph optimization and hybrid loss functions

Lin Wang ^1,^✉, Chenchen Wang ^1,^#, Simin Peng ^1,^#

PMCID: PMC12909813 PMID: 41593348

Abstract

Cross-modal retrieval, particularly image-text matching, is crucial in multimedia analysis and artificial intelligence, with applications in intelligent search and human-computer interaction. Current methods often overlook the rich semantic relationships between labels, leading to limited discriminability. We introduce a Two-Layer Graph Convolutional Network (L2-GCN) to model label correlations and a hybrid loss function, Circle-Soft, to enhance alignment and discriminability. Extensive experiments on the NUS-WIDE, MIRFlickr, and MS-COCO datasets demonstrate the effectiveness of our approach. The results show that the proposed method consistently outperforms current baselines, achieving accuracy improvements of 0.5%, 0.5%, and 1.0%, respectively. The source code is accessible via https://github.com/buzzcut619/L2-GCN-CIRCLE-SOFT.

Keywords: Cross-modal retrieval, L2-GCN, Circle-Soft

Subject terms: Computational biology and bioinformatics, Engineering, Mathematics and computing

Introduction

The rapid advancement of artificial intelligence (AI) has led to the widespread deployment of AI-driven solutions for complex challenges^1,2. A critical innovation in this domain is cross-modal retrieval, which allows for semantic-based search across different data formats (e.g., using text to find images or videos). By leveraging techniques from computer vision, natural language processing, and representation learning, this technology is revolutionizing applications in intelligent video surveillance, interactive multimedia platforms, and AI-assisted content creation, where understanding the semantic correlation between heterogeneous data is paramount.

To address the challenges in this field, extensive research has been conducted to enhance feature alignment and model discriminability. Representative methods, such as Stacked Cross Attention (SCAN)³, Iterative Matching with Recurrent Attention Memory (IMRAM)⁴, Semantically Supervised Maximal Correlation (S2MC)⁵, and the Unified Multi-level Cross-Modal Similarity (MCMS) framework⁶, have made significant strides in bridging the semantic gap. Additionally, self-supervised approaches like Self-Align⁷ have been proposed to force alignment at both concept and context levels. More recently, sophisticated frameworks leveraging contrastive learning and deep interaction have further pushed the boundaries of retrieval performance. For instance, IConE⁸ introduces an instance contrastive embedding method that combines instance loss with contrastive loss to extract fine-grained representations. To enhance information interaction, CDISA⁹ proposes a cross-modal deep interaction and semantic aligning module, utilizing bidirectional cosine matching to improve feature differentiation. Furthermore, to tackle inconsistent learning objectives, a multimodal embedding transfer approach¹⁰ has been developed, which employs a soft contrastive loss to realize selective optimization of multimodal pairs.

Despite these advancements, adapting retrieval models to complex, real-world scenarios remain a major challenge. Specifically, individual image-text pairs often exhibit multi-label characteristics due to the intricate interconnections of semantic concepts—a factor that most existing methods fail to fully capture. This “multi-label” issue manifests in two key aspects: first, a single image-text pair frequently contains multiple parallel or hierarchical semantic labels; second, distinct image-text pairs may share common semantic labels, thereby forming complex many-to-one relationships in the semantic space. For instance, as illustrated in Fig. 1, an image focusing on marine life might simultaneously contain labels such as “Whale,” “Sea,” “Wind,” and “Sky”. Meanwhile, another visually distinct image (e.g., a coastal sunset) could also share the “ocean” label. However, existing methods (e.g.^3–10) primarily focus on optimizing cross-modal alignment while treating labels as semantically independent entities. This oversight prevents the construction of comprehensive semantic graphs and weakens the semantic bridges between heterogeneous modalities, limiting retrieval accuracy.

Fig. 1 — Illustration of the multi-label characteristics and semantic correlations in cross-modal retrieval. The original image on the left displays a complex natural scene, while the right side lists its corresponding semantic labels: “1. Sky,” “2. Whale,” “3. Sea,” and “4. Wind.” These labels are not isolated but share rich semantic relationships (e.g., the co-occurrence of “Whale” and “sea,” where “sea” can also co-occur with other labels), which collectively describe the image content.

To overcome this limitation, we propose a novel Two-Layer Graph Convolutional Network (L2-GCN) for modeling label correlations. In this framework, labels are represented as nodes in a semantic graph, with initial features obtained from semantic embeddings. By iteratively aggregating features from each node and its first-order neighbors, L2-GCN captures high-order semantic dependencies, thereby enhancing the structural consistency and discriminative power of cross-modal representations.

Moreover, single loss functions demonstrate limited effectiveness in addressing multidimensional challenges. Modal heterogeneity (i.e., distribution discrepancies between image and text features) and suboptimal feature discriminability (characterized by insufficient intra-class compactness and rigid inter-class distance adjustment) constitute two fundamental bottlenecks. To address these, we propose a hybrid loss function termed Circle-Soft, which integrates the adaptive margin property of Circle Loss with the sample weighting strategy of contrastive learning. A learnable parameter is incorporated to dynamically balance the contributions of these two components. This design effectively mitigates modal heterogeneity while enhancing intra-class compactness and inter-class separation in the embedding space.

The main contributions of this work are summarized as follows:

Proposing a Two-Layer Graph Convolutional Network (L2-GCN) for label relationship modeling. Unlike existing methods that treat labels as independent, L2-GCN explicitly models label semantics by constructing a label graph (nodes = labels, edges = semantic relationships). It first processes node features (labels) and their first-order neighbor features separately, then fuses these processed features to generate comprehensive label representations. This design enables the capture of high-order semantic dependencies between labels, strengthening semantic connections across modalities.
Developing a Circle-Soft Loss function for joint optimization of alignment and discriminability. Circle-Soft Loss integrates the dynamic margin adjustment mechanism of Circle Loss (to optimize intra-class/inter-class relationships) and the cross-modal alignment capability of Contrastive Loss (to mitigate modal heterogeneity). A learnable weight parameter balances the contribution of the two components, while an adversarial loss is further incorporated to refine feature consistency, addressing both challenges simultaneously.
Validating the proposed approach on three public benchmarks. Extensive experiments on the NUS-WIDE, MIRFlickr, and MS-COCO datasets demonstrate that our method consistently outperforms baselines (e.g., achieving 0.5%, 0.5%, and 1.0% improvements, respectively) and shows strong robustness, verifying the effectiveness of the proposed innovations.

The remainder of this paper is organized as follows: “Related works” reviews related work on cross-modal retrieval and label modeling; “Proposed method” details the proposed L2-GCN and Circle-Soft Loss; “Experiments” presents experimental settings and results; “Limitations and future work” concludes the Limitation and Future Work.; “Conclusion” concludes the work.

Related works

Cross-modal retrieval methods

Generally, cross-modal retrieval approaches can be categorized into two main streams: traditional methods and deep learning-based methods. Traditional techniques primarily utilize linear projections to map multi-modal data into a common feature space, where similarity calculations are performed to achieve image-text matching,^11,12.Hardoon et al.¹³ utilized Canonical Correlation Analysis (CCA) and its kernelized version (KCCA) to map heterogeneous data into a common subspace by maximizing inter-modal correlations, thereby effectively handling both linear and non-linear alignments.

Sharma et al.¹⁴ proposed Generalized Multiview Analysis (GMA), a supervised extension of CCA. By leveraging class labels, this model effectively minimizes the distance between samples of the same class in the latent space, thereby enhancing the discriminability of cross-modal features. Similarly, Joint Representation Learning (JRL)¹⁵ utilizes both label information and correlation maximization within a unified optimization framework. Specifically, JRL incorporates semantic constraints into the subspace learning process, ensuring that the learned common representation preserves the underlying semantic structure of the multi-modal data.

With the rapid development of deep learning, various deep learning-based cross-modal retrieval methods have been proposed. In AFA¹⁶, single-modal features are adaptively aggregated by a Feature Enhancement Module, and the retrieval reliability is enhanced by an improved ranking loss function. In TGDT¹⁷, coarse and fine-grained representations are combined into a unified framework, and semantic consistency is ensured by a Consistent Multimodal Contrastive loss. In Deep Supervised Cross-modal Retrieval (DSCMR)¹⁸, discriminative features are supervised by minimizing losses in both label and common spaces, and cross-modal discrepancy is eliminated via a weight sharing strategy. However, the aforementioned methods largely overlook label correlations, which are pivotal for learning discriminative representations. To address this limitation, we propose the L2-GCN framework. Furthermore, we introduce a Circle-Soft Loss function designed to enhance intra-class compactness and inter-class separability, thereby significantly improving cross-modal matching accuracy.

Graph convolutional neural networks

Graph Convolutional Neural Networks(GCNs)^19,20, are widely applied in image recognition^21,22, social network analysis^23,24, and image re-ranking^25,26 because of their powerful graph feature expression capabilities. GCNs can effectively capture the relationships between different modalities and leverage the structure of the data to improve the representation learning. By processing information through GCNs, the system can better understand the semantic relationships and shared meanings across modalities. Naturally, GCNs can also be applied to the processing of label data and modality data. Qian et al.²⁷ proposed ALGCN, which employs GCNs on an adaptive correlation matrix to explicitly model label dependencies. This mechanism optimizes inter-dependent classifiers, thereby generating discriminative representations for cross-modal retrieval. Qian et al.²⁸ proposed a new method called DAGNN, which utilizes dual generative adversarial networks to create a common representation space and employs multi-hop graph neural networks to capture label correlations, where a layer aggregation mechanism is proposed to leverage multi-hop propagation information, capturing label correlations and learning interdependent classifiers. Qian et al.²⁹ proposed I-GNN. Compared to DAGNN, I-GNN utilizes an iterative approach to process label information, enabling it to adaptively learn a better graph and further capture fine-grained relationships between different labels. Wang et al.³⁰ proposed FGAP, which utilizes graph attention and pooling networks to learn graph-level relevance and transform fragment-level features into a shared subspace. Similarly, Wang et al.³¹ introduced a framework combining Dynamic Semantic Graph Enhancement (DSGE) and Progressive Semantic Alignment (PSA), which adaptively adjusts graph connectivity to prevent over-connection and captures hierarchical cross-modal dependencies.

These methods have achieved good results; however, they overlook the relationships between nearest neighbors within the label graph, failing to explore the deeper, fine-grained relationships within the labels. Different from these methods, we propose a novel framework that explicitly models high-order label dependencies via L2-GCN and optimizes feature alignment using a hybrid Circle-Soft Loss. The details of our approach are presented in the following section.

Proposed method

This section provides a detailed explanation of the proposed framework, covering the processing of raw feature data, the generation of common space representations, the optimization of label representations via the L2-GCN model, and the enhancement of final matching through the proposed hybrid loss function.

Overview

As illustrated in Fig. 2, the proposed framework is designed to bridge the heterogeneity gap between vision and language by integrating adversarial learning with fine-grained label modeling. The overall architecture comprises two mutually reinforcing sub-modules: the Modality Discrimination Adversarial Network and the L2-GCN Label Set Optimization Model.

First, in the feature extraction stage, raw image and text inputs are processed by their respective encoders to generate initial feature representations ( Inline graphic and ). These features are then passed through a Feature Generator and subsequently challenged by a Modality Discriminator. Under the supervision of the Adversarial Loss, this adversarial process eliminates modality-specific patterns, rendering the features from both modalities more similar. Simultaneously, to enhance the discriminability of these features, we apply the Circle-Soft Loss to the initial feature representations, which optimizes intra-class compactness and inter-class separability.

Parallel to this, the L2-GCN module operates on the label space. By aggregating features from label nodes and their first-order neighbors, it explicitly models the high-order semantic dependencies between labels, resulting in a semantically enriched label set.

Finally, the optimized label representations are used to supervise the visual and textual features via a Classification Loss. This step ensures that the learned common representations align accurately with the refined semantic categories, completing the end-to-end training process.

To provide a high-level perspective of the workflow, Fig. 3 presents a simplified overview, encompassing the core components detailed in the subsequent subsections of “Proposed method”.

Feature extraction

Introduce the notation used throughout the paper. We used a pair of networks (VGG19 and BoW) to coarsely extract the raw features of images and text. The extracted features are denoted as V and T, respectively. Additionally, we denote the training dataset as Inline graphic , where n represents the total number of image-text pairs in this dataset. denotes the i-th image-text pair in the dataset S, and represents the feature dimension of the raw input data. is the label set used for the corresponding training set S. m is the total number of categories in the label set. If the i-th image-text pair Inline graphic in S belongs to the m-th category in , then , and all other entries set to 0.

Our approach consists of two main parts:

Modality Discriminator Adversarial Network, which will be explained in detail in the following sections, specifically addressing common feature extraction and the adversarial alignment process.
Detailed Process of L2-GCN, which will be explained in detail in the following sections, specifically addressing the processing of the label set.

After completing sections (a) and (b), a classification loss is applied using the common representations created in part (a) and the processed label set, which is also a classic operation in supervised learning.

Modality discriminator adversarial network

The main goal of this part is to extract common features and perform modality discrimination. By using VGG19, BoW, and MLP to extract the raw features of images and text, these features are projected into a shared space. In this shared space, the similarity between the two modalities can be directly computed, which is a significant branch of cross-modal retrieval and is fundamental to its implementation.

For images, we employ a seven-layer VGG-19 network¹⁵ pre-trained on a public dataset³² to extract raw image features. The output from the seventh layer (fc7) typically yields a feature vector Inline graphic with a dimension of 1024.

MLP refers to a Multilayer Perceptron consisting of several stacked fully connected linear layers. Here, d represents the dimension of the image features after projection into the common space.

For text, similar to image feature extraction, we utilize the widely used Bag-of-Words (BoW) approach³³, to obtain raw text features. These features are then mapped into the common embedding space.

Similar to image representation, MLP refers to a Multilayer Perceptron consisting of several stacked fully connected linear layers. Here, d represents the dimension of the text features after projection into the common space.

Modality discrimination: the common representation of images and text is fed into the respective decoders of the feature generator. After obtaining these generated representations, they are passed to the modality discriminator for modality classification. This adversarial interplay between the Feature Generator and Modality Discriminator is the essence of GAN networks, ensuring modality-invariant feature learning.

where Inline graphic and are the features of image and text common representations obtained after passing through the generator. These features are later used in the GAN loss.

L2-GCN pre-optimization of the label set

In cross-modal datasets, labels are rarely independent; instead, they exhibit strong statistical co-occurrence and correlations. For instance, “sky” frequently co-occurs with “airplane” but rarely with “trees” (in certain contexts), yet it can also co-occur with “dolphin,” implying that a single object concept can appear across diverse thematic contexts.

The L2-GCN is designed to capture the topological structure of these semantic correlations. As illustrated in the L2-GCN module in Fig. 2, we construct a label graph where the adjacency matrix represents these co-occurrence and correlation probabilities (indicated by values such as 0.7 in the figure). By propagating information within this graph, the model enriches individual label representations using complementary context from correlated neighbors, thereby enhancing the accuracy of cross-modal retrieval.

The label set is denoted as Inline graphic , N represents a node , a node represents a category, each category has a corresponding feature matrix, the feature matrix of all nodes denoted as F, E represents edge , The connections between nodes are represented by a matrix that links these relationships together, forming an adjacency matrix, denoted as A.

First, update the adjacency matrix A, Diag is a square matrix with zeros on the main diagonal and all other elements being one. Subtract this matrix from a matrix of all ones to obtain a matrix with ones on the diagonal and zeros elsewhere, denoted as Inline graphic .

Next, use a weight matrix and bias to update the feature matrix, and then multiply it with Inline graphic , resulting in . This feature matrix retains the features of all nodes themselves.

Inline graphic is a square matrix with zeros on the diagonal, while retaining the original values elsewhere.

For matrix Inline graphic , in the first dimension, any value greater than 0 is treated as indicating a relationship with a nearest neighbor, and thus these values are set to 1. The sum along the first dimension is then calculated to get the number of one-hop neighbors for each node, denoted as neighbor.

Next, multiply Inline graphic by and divide by neighbor to obtain the mean-pooled feature matrix .

Finally, pass Inline graphic and through one linear layer each, and then add the results to obtain the final feature matrix.

Circle-soft-loss-category loss and modality loss

Circle-Soft-Loss is an innovative loss function that integrates the strengths of Circle Loss and Soft Contrastive Loss. These two algorithms excel in handling intra-class, inter-class, and modality-specific relationships, respectively. By combining the advantages of both approaches and introducing a threshold for fine-tuning, Circle-Soft-Loss achieves superior performance in global similarity matching.

Category loss-circle loss

One of the core essences of cross-modal retrieval is to make data from the same category more similar in a shared embedding space while preserving their intrinsic characteristics. To achieve this goal, we adopt Circle Loss. Circle Loss reweights each similarity score to emphasize those that are under-optimized and provides a unified formulation for both category-level and pair-wise labels. Through this method, Circle Loss demonstrates significant advantages in handling complex cross-modal data, particularly in tasks that require fine-grained differentiation between categories. Below is the process by which Circle Loss processes data:

Normalize the obtained image common features and text common features V and T, where V and T are denoted as Inline graphic and from the previous context:

where Inline graphic denote the raw feature vectors of the image and text for the i-th sample, respectively.

Based on the normalized features, we calculate the pairwise cosine similarity within each modality. The similarity matrices for the image modality Inline graphic and text modality are defined as:

where Inline graphic represents the cosine similarity between the ith and jth images.

We directly utilize the label adjacency matrix Inline graphic to determine positive and negative pairs. Here, indicates that sample i and sample j are positive pair, while , indicates they are negative pair.

Based on A, we define the set of positive pairs P and negative pairs N as:

Circle Loss simultaneously optimizes class relationships via an adaptive weighting mechanism. Taking the image modality as an example, the loss function Inline graphic is formulated as:

where Inline graphic is the scale factor, and are the preset margins. The adaptive weighting coefficients and are calculated as:

Similarly, the loss for the text modality Inline graphic is calculated in the same manner. The final total Circle Loss is the sum of the losses from both modalities:

Modality loss-soft contrastive loss

Inspired by I-GNN, we utilize its contrastive loss as the modality Loss. Specifically, Soft Contrastive Loss optimizes cross-modal alignment by exploiting the similarity matrix between views. Its normalization mechanism plays a critical role in enhancing the model’s discriminability across different modalities. Compared to traditional contrastive loss functions, the Soft Contrastive Loss places greater emphasis on the similarities between modalities, thereby offering stronger adaptability and robustness in multi-modal tasks.

Ultimately, integrate the category loss and modality loss to obtain the final formula.

Classification loss

The classification loss serves to calculate the distance difference between the predicted labels and the actual labels. Classification loss is typically used to help the model learn semantic consistency between different modalities, ensuring that the same concept or entity can be correctly associated across modalities.

Inline graphic represents the predicted label, R represents the actual label, for generality, we use the (Euclidean norm) to calculate the distance between them, leading to the final loss function.

Adversarial loss

Similar to AGCN, we made a small improvement by introducing a weight Inline graphic to merge the losses from both views, reducing redundancy and balancing the impact of different modalities on the final loss. We use GAN Loss to further update the features of both modalities. In the field of cross-modal retrieval, GAN Loss (the loss function of Generative Adversarial Networks) plays a crucial role. It encourages the model to deeply learn the intrinsic connections between modalities through the adversarial game between the generator and discriminator, achieving effective mapping and transformation between different data types such as images and text.

Inline graphic is the feature obtained from the common representation of images after passing through the decoder, is the feature obtained from the common representation of text after passing through the decoder, and denotes the cross-entropy loss function.

Overall objective

Based on the above losses, the final objective loss is defined as follows, where the values of Inline graphic , and will be discussed in “Experiments”.

Retrieval process

While the training phase involves complex optimization objectives and label graph modeling (L2-GCN) to learn discriminative representations, the actual retrieval stage operates in a deterministic, feed-forward manner. To provide a clear understanding of how the trained model is deployed for cross-modal tasks, we illustrate the detailed retrieval workflow in Fig. 4.

As illustrated in Fig. 4, the retrieval process operates in a deterministic, feed-forward manner. Specifically, taking the Image-to-Text task as an example, given a query image and a text database, we first extract their raw features using the pre-trained VGG-19 and BoW models, respectively. These features are then projected into the common embedding space via the trained Linear Layers, resulting in the final feature vectors. Subsequently, we calculate the semantic similarity between the query image vector and each vector in the text database using Cosine Similarity. Finally, all items in the database are sorted in descending order based on these similarity scores to generate the final ranked retrieval list.

Experiments

To validate the effectiveness of our method, we tested it on three different and representative datasets. First, we introduce the evaluation metrics, then we describe the datasets used, and finally, we conduct ablation experiments and various parameter analyses.

Evaluation metrics

We use mAP^34,35 (Mean Average Precision) as the evaluation metric. mAP is a widely used metric in information retrieval and recommendation systems to measure a model’s overall retrieval performance across multiple query tasks. After calculating the mAP values for both modalities, the final value is obtained by averaging them, which reflects the effectiveness of our method.

For a single query q, the formula for calculating average precision is as follows:

N is the number of retrieved results. P(k) is the precision of the top k retrieved results. Inline graphic is an indicator function that represents whether the k-th result is relevant (1 if relevant, 0 if not). is the total number of relevant results for the query q. Q denotes the total number of queries in the test set.

Datasets

The experiments were conducted on three widely used datasets. To ensure a fair comparison with existing state-of-the-art methods (e.g., DSCMR³⁴, DAGNN²⁸, I-GNN²⁹), we strictly follow the standard data partition protocol widely used in the literature, and the detailed descriptions of these datasets are as follows:

MIRFlickr: The MIRFlickr dataset contains 25,000 image-text pairs collected from the Flickr website, covering a wide range of scenes and subjects. Each image and its corresponding text are annotated with tags that describe the content of both the image and the text. These tags can be used for image category classification and retrieval tasks. The dataset includes a variety of images, such as natural landscapes, urban scenes, animals, people, and events. This diversity makes the dataset highly useful for various research tasks. We selected 2000 image-text pairs as the test set, with the remaining pairs used as the training set.

NUS-WIDE: NUS-WIDE is a large-scale dataset for multimedia information retrieval and image annotation, primarily provided by the National University of Singapore (NUS). This dataset is widely used in research on image classification, retrieval, and cross-modal retrieval. We selected 190,421 image-text pairs for our experiments. Among these, we randomly chose 2000 image-text pairs as the test set, with the remainder serving as the training set.

MS-COCO: This dataset contains over 320,000 images, which are richly annotated with information about objects, scenes, and their contextual relationships. The MS-COCO dataset covers 80 common object categories, including humans, animals, everyday items, vehicles, and more. The images are diverse in content, often containing multiple objects with complex interactions and background information. To obtain the mAP evaluation metric, 2,000 image-text pairs were randomly selected as the test set, with the remaining pairs used as the training set.

Baseline methods

We compared our method with thirteen baseline methods, including three traditional multivariate statistical methods: CCA³⁶, ML-CCA³⁷, and PLS-C2A³⁸; three cross-modal hashing methods: DCDH³⁹, GCH⁴⁰, MGCH⁴¹;and eleven deep learning-based cross-modal retrieval methods: DCCA⁴², ACMR⁴³, DSCMR¹⁸, DAVAE⁴⁴, ALGCN²⁷, DAGNN²⁸, IConE⁸, FGAP³⁰, DSGE³¹ ,CDISA⁹ and I-GNN-CON²⁹. We directly referenced the results from the respective papers or reproduced these methods’ codes based on their models to obtain comparative results.

Implementation details

All experiments were implemented using the PyTorch framework and conducted on a single NVIDIA GeForce RTX 4060 Ti GPU. We utilized the Adam optimizer to train the entire model. The learning rate was universally set to 0.00005 across all datasets, with a total training duration of 40 epochs. To accommodate different dataset scales, the batch sizes were set to 2048 for NUS-WIDE, 100 for MIRFlickr, and 512 for MS-COCO.

Regarding the specific hyperparameters used in our proposed method, the loss weights Inline graphic and were set to 1.1 and 2.2, respectively, for all three datasets. The threshold parameter was empirically set to 0.6. Additionally, the depth of the L2-GCN was configured to 5 layers for the MIRFlickr and MS-COCO datasets, whereas it was set to 6 layers for the NUS-WIDE dataset.

In terms of network architecture, both the image features (processed by VGG-19) and text features (processed by BoW) are mapped into a common embedding space with a dimension of 1,024 via their respective MLPs. The Image Decoder consists of two fully connected layers with 512 and 300 hidden units, respectively, activated by LeakyReLU with a negative slope of 0.2. Conversely, the Text Decoder comprises two fully connected layers with 2,048 and 4,096 hidden units, respectively, activated by Rectified Linear Unit (ReLU). For the L2-GCN module, the output dimension of each layer is fixed at 1,024. The construction of the adjacency matrix and input vertex features for the label graph follows the protocol described in²⁶.

Results and discussion

All experimental results and comparison results are shown in Table 1. The first three experiments use traditional multivariate statistical methods, the middle three methods utilize cross-modal hashing approaches, and the remaining methods use deep learning-based cross-modal retrieval techniques.

Table 1.

mAP scores on NUS-WIDE, MIRFlickr, and MS-COCO Datasets.

Method	NUS-WIDE			MIRFlickr			MS-COCO
	I2T	T2I	Average	I2T	T2I	Average	I2T	T2I	Average
CCA	0.656	0.664	0.660	0.712	0.722	0.717	0.652	0.656	0.654
PLS-C2A	0.632	0.631	0.631	0.730	0.740	0.735	0.643	0.637	0.640
ml-CCA	0.669	0.668	0.668	0.734	0.742	0.738	0.637	0.634	0.635
DCCA	0.637	0.649	0.643	0.736	0.746	0.741	0.635	0.630	0.632
ACMR	0.684	0.675	0.680	0.736	0.748	0.742	0.706	0.708	0.707
DCDH	0.684	0.680	0.682	0.742	0.758	0.750	0.610	0.604	0.607
MGCH	0.684	0.690	0.687	0.751	0.757	0.754	0.661	0.669	0.665
GCH	0.677	0.697	0.687	0.762	0.786	0.774	0.559	0.560	0.560
DSCMR	0.706	0.739	0.722	0.752	0.799	0.775	0.813	0.810	0.811
DAVAE	0.728	0.728	0.728	0.760	0.801	0.781	0.821	0.809	0.815
IConE	0.736	0.747	0.741	0.788	0.799	0.793	0.811	0.822	0.816
ALGCN	0.747	0.758	0.753	0.792	0.819	0.806	0.831	0.810	0.811
FGAP	0.750	0.766	0.758	0.791	0.830	0.810	0.833	0.828	0.830
DSGE	0.762	0.754	0.757	0.810	0.810	0.810	0.839	0.822	0.830
DAGNN	0.755	0.761	0.758	0.806	0.820	0.812	0.836	0.830	0.833
CDISA	0.766	0.782	0.773	0.793	0.799	0.796	0.834	0.854	0.844
I-GNN-CON	0.757	0.763	0.760	0.807	0.825	0.816	0.843	0.840	0.841
L2-GCN (our)	0.760	0.770	0.765	0.813	0.827	0.821	0.850	0.852	0.851

Open in a new tab

From the experimental results in Table 1, it can be observed that, firstly, among these three methods, models using deep learning techniques achieve the best results. This is because deep learning strategies are effective in detecting both linear and nonlinear relationships between data. Secondly, the results of the three cross-modal hashing methods are generally inferior to those of deep learning methods. This is because hashing methods map high-dimensional real-valued features to low-dimensional binary hash codes, inevitably resulting in some information loss. This information loss may lead to a decrease in feature discriminability, thus affecting retrieval accuracy. Thirdly, the results of the DSCMR and DAVAE models are slightly behind those of the other three deep learning methods because ALGCN, DAGNN, and I-GNN-CON all use their respective GCN models to explore the semantic relationships between labels, achieving significant data improvements. This indicates that the exploration of label processing methods in cross-modal retrieval is far from complete and warrants further research. While CDISA achieves superior accuracy on the NUS-WIDE dataset compared to other methods, its performance on the MIRFlickr and MS-COCO datasets is suboptimal. Ultimately, L2-GCN achieves the best performance on three datasets compared to other methods.

For example, in the MIRFlickr dataset, Table 1 shows that our method (L2-GCN) achieves the best mAP scores of 0.813 and 0.827 for the Image2Text and Text2Image tasks, respectively. Compared to the traditional multivariate statistical method CCA, L2-GCN improves performance by 10% and 10.5% for Image2Text and Text2Image tasks, respectively. Compared to the best cross-modal hashing method GCH, L2-GCN improves average performance by 4.6%. Compared to the best deep cross-modal retrieval method I-GNN-CON, our method improves performance by 0.6% and 0.2% for the Image2Text and Text2Image tasks, respectively.

To evaluate the effectiveness of the proposed L2-GCN module and the impact of the loss component Inline graphic , we conducted comprehensive ablation studies comparing performance across different layer depths and loss combinations. Specifically, the number of L2-GCN layers was varied from 1 to 8, and five distinct loss scenarios were evaluated: retaining all losses, retaining only , removing , removing Inline graphic , and removing . These five scenarios are denoted as conditions 1 to 5, respectively. Based on these combinations, we generated a 3D surface plot (Fig. 5) and a bar chart (Fig. 6) to visually illustrate the joint impact of layer depth and loss components on retrieval performance.

Fig. 6 — Ablation study on the MIRFlickr dataset illustrating the impact of different loss component configurations on retrieval performance. The vertical axis represents the Precision score, while the horizontal axis denotes various ablation settings.

The results in Fig. 5 reveal distinct trends regarding both model depth and loss composition.

First, concerning the L2-GCN layer depth, the performance exhibits a clear convex trend. With shallow networks (1–2 layers), the model fails to capture sufficient high-order semantic correlations, resulting in suboptimal precision. The accuracy peaks at 5 layers, indicating the optimal balance for feature aggregation. However, increasing the depth further (6–8 layers) leads to a performance degradation. This is likely due to the “over-smoothing” phenomenon common in deep GCNs, where repeated aggregations cause node features to converge and lose discriminability, while simultaneously increasing computational complexity. Consequently, we set the optimal layer count to 5 for the MIRFlickr dataset.

Second, regarding the loss configurations, the plot demonstrates that the highest accuracy is consistently achieved when all losses are retained, regardless of the layer count. Notably, removing Inline graphic causes a sharp decline in performance. These findings underscore the pivotal role of in enhancing the overall data representation and retrieval accuracy.

Figure 6 further illustrates the impact of loss Inline graphic on accuracy. The removal of losses and results in only a minor decline in overall accuracy, but removing our proposed loss leads to a 2% decrease, demonstrating the indispensable role of loss in affecting accuracy. For the MIRFlickr dataset, when the number of layers is 5 and all loss conditions are retained, the accuracy reaches the highest value of 0.82.

To further investigate the impact of L2-GCN on experimental results, we replaced L2-GCN with GCN, GAT and BaGFN to explore the semantic relationships between labels. Figures 7 and 8 show the experimental results of L2-GCN, GCN, GAT and BaGFN⁴⁵ on the MIRFlickr datasets. It can be seen that L2-GCN outperforms both GAT and GCN in the Image2Text and Text2Image retrieval tasks. Despite BaGFN’s superior I2T performance on both datasets, L2-GCN holds a slight edge in the overall average, a finding that serves to confirm its effectiveness.

Fig. 7 — Bar chart illustrating the retrieval accuracy of different label processing models on the MIRFlickr dataset. The horizontal axis represents the retrieval tasks, while the vertical axis represents the prediction precision (percentage). The colored bars distinguish between the baseline models—GCN (red), GAT (blue), and BaGFN (green)—and the proposed L2-GCN (purple).

Fig. 8 — Bar chart illustrating the retrieval accuracy of different label processing models on the NUS-WIDE dataset. The horizontal axis represents the retrieval tasks, while the vertical axis represents the prediction precision (percentage). The colored bars distinguish between the baseline models—GCN (red), GAT (blue), and BaGFN (green)—and the proposed L2-GCN (purple).

From Figs. 7 and 8, it can be seen that our L2-GCN model shows significant improvement compared to the three traditional foundational models, indicating notable progress in exploring the fine-grained relationships between labels.

Considering the critical role of GAN Loss, we analyzed the sensitivity of the retrieval accuracy to the hyperparameter Inline graphic . To identify the optimal configuration, we jointly optimized and through repeated experiments on two datasets. The corresponding results are visualized in Figs. 9, 10, 11 and 12.

Fig. 9 — Parameter sensitivity analysis on the MIRFlickr dataset for the Image-to-Text (i2t) task. The 3D plot illustrates the joint impact of hyperparameters and on retrieval precision.

Fig. 10 — Parameter sensitivity analysis on the MIRFlickr dataset for the Text-to-Image (t2i) task. The 3D plot illustrates the joint impact of hyperparameters and on retrieval precision.

Fig. 11 — Parameter sensitivity analysis on the NUS-WIDE dataset for the Image-to-Text (i2t) task. The 3D plot illustrates the joint impact of hyperparameters and on retrieval precision.

Fig. 12 — Parameter sensitivity analysis on the NUS-WIDE dataset for the Text-to-Image (t2i) task. The 3D plot illustrates the joint impact of hyperparameters and on retrieval precision.

As shown in Figs. 9, 10, 11 and 12, for both datasets, the accuracy decreases dramatically as the Inline graphic parameter is adjusted from 2.2 towards 0, reaching its lowest value when is 0. For the parameter, the data distribution forms a normal distribution with a peak around 1.1, creating an arch-like shape. Consequently, the optimal accuracy is achieved when and are set to 1.1 and 2.2, respectively.

As shown in Fig. 13. To better balance the trade-off between Soft Contrastive Loss and Circle Loss, we introduce a threshold Inline graphic between them using an empirical modulation strategy. Under optimal prior conditions, we conducted experiments with set to 0.3, 0.4, 0.5, 0.6, and 0.7. The highest accuracy across three datasets was achieved when .

To provide a more intuitive illustration of the superior performance of the L2 GCN model in retrieval tasks, we present qualitative examples from several randomly selected image-to-text queries. Figure 14 displays the top-two retrieved results for each query, showing that our model achieves more accurate matches than the DSCRM baseline in almost every case.

Fig. 14 — Visualization of top two retrieval results for several randomly selected image queries.

To better demonstrate the convergence of the objective function, Figs. 15 and 16 show that during the training process on the MIRFlickr and NUS-WIDE datasets, the loss decreases rapidly and eventually stabilizes. We set the number of iterations to 40, with the final loss function value converging to approximately 0.153 for the MIRFlickr dataset and around 0.01 for the NUS-WIDE dataset.

Fig. 15 — Visualization of the training loss curve on the MIRFlickr dataset. The horizontal axis represents the number of training epochs, while the vertical axis represents the value of the objective function.

Fig. 16 — Visualization of the training loss curve on the NUS-WIDE dataset. The horizontal axis represents the number of training epochs, while the vertical axis represents the value of the objective function.

To demonstrate the practical deployability and the balance between efficiency and performance of the proposed L2-GCN, we conducted a comparative analysis of computational complexity against the baseline method(I-GNN). The experiments were performed on a hardware platform equipped with a single NVIDIA RTX 4060 Ti GPU. We adopted Parameter Count, Floating-point Operations (FLOPs), and Training Time as the evaluation metrics.

The results are summarized in Table 2. Compared to the baseline, our method introduces a marginal increase in model size of 5.49 M parameters and a slight rise in computational cost of 0.07 GFLOPs. This slight overhead is attributed to the additional graph convolution layers and the enhanced feature aggregation mechanism designed to capture richer cross-modal correlations.

Table 2.

Comparison of computational complexity and performance on MIRFlickr.

Method	Params (M)	FLOPs (G)	Training time	mAP gain
Baseline	29.92	0.3010	7 m 19 s
L2-GCN	35.41	0.3744	8 m 01 s	+0.5 %

Open in a new tab

However, it is worth noting that the total FLOPs of our method remain at a very low level of 0.3744 G, which is significantly lower than standard deep visual backbones. Furthermore, the increase in training time is negligible, amounting to less than 1 minute. Considering the 0.5% improvement in retrieval accuracy, we argue that L2-GCN achieves a favorable balance between efficiency and performance. It provides enhanced retrieval capabilities while keeping computational costs within an acceptable range for real-time applications.

Limitations and future work

While the proposed framework demonstrates promising cross-modal retrieval capabilities, we acknowledge certain limitations regarding computational complexity, potential overfitting risks, and adaptability that merit further investigation.

First, regarding computational complexity: The inherent matrix multiplications and feature aggregations of the L2-GCN inevitably increase computational overhead. Although operating in a low-dimensional label space mitigates computational demands, it may still pose a deployment bottleneck on resource-constrained edge devices. Furthermore, the composite objective function—integrating Circle-Soft, adversarial, and classification losses—constructs a complex optimization landscape, potentially leading to local optima or overfitting on small-scale datasets. Finally, its performance on ultra-long texts or high-resolution images has not been fully explored, and the current scope is limited to image-text retrieval, excluding scenarios like video-text or cross-lingual tasks.

To address these challenges, future work will focus on the following directions: We plan to employ knowledge distillation techniques and decouple static graph computations from dynamic training to reduce inference latency and enhance efficiency. To address overfitting risks, we intend to introduce an adaptive loss weighting strategy to ensure stable optimization. Finally, for complex inputs, we plan to integrate Transformer-based backbones and extend the framework to broader scenarios, such as video-text and cross-lingual retrieval, thereby facilitating its wider application.

Conclusion

In this paper, we addressed the limitations of existing cross-modal retrieval methods, particularly their oversight of complex label correlations and insufficient feature discriminability. To this end, we proposed a novel Two-Layer Graph Convolutional Network (L2-GCN), which explicitly models high-order semantic dependencies by aggregating neighbor information within the label graph, rather than treating labels as independent entities. Furthermore, we introduced a hybrid Circle-Soft Loss function that integrates the benefits of Circle Loss and Contrastive Learning. This design dynamically balances intra-class compactness and inter-class separability, effectively mitigating modal heterogeneity. Extensive experiments on the NUS-WIDE, MIRFlickr, and MS-COCO datasets validate the robustness of our framework. Experiments confirm that our method outperforms baselines, achieving accuracy improvements of 0.5%, 0.5%, and 1.0%, respectively.

Author contributions

W L coordinated the overall progress of the article and contributed to some of the innovative points. WCC proposed innovative ideas and was responsible for text and chart production. P conducted research for the article.

Funding

This work was supported by the National Natural Science Foundation of China (52477226).

Data availability

The source code and datasets generated during this study are available in the GitHub repository, [https://github.com/buzzcut619/L2-GCN-CIRCLE-SOFT],Considering that the manuscript is currently under submission, we will release the full core code upon its formal acceptance.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Chenchen Wang and Simin Peng contributed equally to this work.

References

1.Wang, Q., Dai, W., Zhang, C., Zhu, J., Ma, X.: A compact constraint incremental method for random weight networks and its application. IEEE Trans. Neural Netw. Learning Syst. 1–9 (2023). 10.1109/TNNLS.2023.3289798 [DOI] [PubMed]
2.Peng, S. et al. State of health estimation of lithium-ion batteries based on multi-health features extraction and improved long short-term memory neural network. Energy282, 128956. 10.1016/j.energy.2023.128956 (2023). [Google Scholar]
3.Lee, K.-H., Chen, X., Hua, G., Hu, H. & He, X. Stacked cross attention for image-text matching. arXiv arXiv:1803.08024 (2018).
4.Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J. & Han, J. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12652–12660. 10.1109/CVPR42600.2020.01267 (2020).
5.Li, M., Li, Y., Huang, S.-L. & Zhang, L. Semantically supervised maximal correlation for cross-modal retrieval. In 2020 IEEE International Conference on Image Processing (ICIP). 2291–2295. 10.1109/ICIP40778.2020.9190873 (2020).
6.Huang, Y., Wang, Q., Zhang, Y. & Hu, B. A unified perspective of multi-level cross-modal similarity for cross-modal retrieval. In 2022 5th International Conference on Information Communication and Signal Processing (ICICSP). 466–471. 10.1109/ICICSP55539.2022.10050678 (2022).
7.Zhuang, J., Yu, J., Ding, Y., Qu, X. & Hu, Y. Towards fast and accurate image-text retrieval with self-supervised fine-grained alignment. IEEE Trans. Multimed.26, 1361–1372. 10.1109/TMM.2023.3280734 (2024). [Google Scholar]
8.Zeng, R., Ma, W., Wu, X., Liu, W. & Liu, J. Image text cross-modal retrieval with instance contrastive embedding. Electronics13(2), 300. 10.3390/electronics13020300 (2024). [Google Scholar]
9.Chen, R., Qiang, B., Yang, X., Zhang, S. & Xie, Y. Cross-modal deep interaction and semantic aligning for image-text retrieval. IEICE Trans. Inf. Syst.E108.D(10), 1230–1238. 10.1587/transinf.2024EDP7279 (2025).
10.Zeng, Z., He, S., Zhang, Y. & Mao, W. A multimodal embedding transfer approach for consistent and selective learning processes in cross-modal retrieval. Inf. Sci.704, 121974. 10.1016/j.ins.2025.121974 (2025). [Google Scholar]
11.Wang, K., He, R., Wang, L., Wang, W. & Tan, T. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell.38(10), 2010–2023. 10.1109/TPAMI.2015.2505311 (2016). [DOI] [PubMed] [Google Scholar]
12.Rasiwasia, N. et al. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260. 10.1145/1873951.1873987(2010).
13.Hardoon, D. R., Szedmak, S. & Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput.16(12), 2639–2664. 10.1162/0899766042321814 (2004). [DOI] [PubMed] [Google Scholar]
14.Sharma, A., Kumar, A., Daume, H. & Jacobs, D.W. Generalized multiview analysis: A discriminative latent space. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2160–2167. 10.1109/CVPR.2012.6247923 (2012).
15.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv arXiv:1409.1556 (2015).
16.Wang, Z., Yin, Y. & Ramakrishnan, I.V. Enhancing image-text matching with adaptive feature aggregation. arXiv arXiv:2401.09725 (2024).
17.al., C.L.: Efficient token-guided image-text retrieval with consistent multimodal contrastive training. IEEE Trans. Image Process.32, 3622–3633. 10.1109/TIP.2023.3286710 (2023). [DOI] [PubMed]
18.Zhen, L., Hu, P., Wang, X. & Peng, D. Deep supervised cross-modal retrieval. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10386–10395. 10.1109/CVPR.2019.01064 (2019).
19.Kipf, T.N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv arXiv:1609.02907 (2017).
20.Bruna, J., Zaremba, W., Szlam, A. & LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv:1312.6203 (2014).
21.Jiang, H. et al. Dynamic multi-stream graph neural networks for efficient interactive action recognition. Vis. Comput.41(12), 10467–10480. 10.1007/s00371-025-04048-8 (2025). [Google Scholar]
22.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39(6), 1137–1149. 10.1109/TPAMI.2016.2577031 (2017). [DOI] [PubMed] [Google Scholar]
23.Fan, W. et al. Graph neural networks for social recommendation. arXiv:1902.07243 (2019).
24.Wu, L., Sun, P., Hong, R., Fu, Y., Wang, X. & Wang, M. Socialgcn: An efficient graph convolutional network based model for social recommendation. arXiv:1811.02815 (2019).
25.Yu, J., Rui, Y. & Chen, B. Exploiting click constraints and multi-view features for image re-ranking. IEEE Trans. Multimed.16(1), 159–168. 10.1109/TMM.2013.2284755 (2014). [Google Scholar]
26.Yu, J., Rui, Y. & Tao, D. Click prediction for web image reranking using multimodal sparse coding. IEEE Trans. Image Process.23(5), 2019–2032. 10.1109/TIP.2014.2311377 (2014). [DOI] [PubMed] [Google Scholar]
27.Qian, S., Xue, D., Fang, Q. & Xu, C. Adaptive label-aware graph convolutional networks for cross-modal retrieval. IEEE Trans. Multimed.24, 3520–3532. 10.1109/TMM.2021.3101642 (2022).
28.Qian, S., Xue, D., Zhang, H., Fang, Q. & Xu, C. Dual adversarial graph neural networks for multi-label cross-modal retrieval. AAAI35(3), 2440–2448. 10.1609/aaai.v35i3.16345 (2021). [Google Scholar]
29.Qian, S., Xue, D., Fang, Q. & Xu, C. Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 1–18. 10.1109/TPAMI.2022.3188547 (2022). [DOI] [PubMed]
30.Sun, H., Qin, X. & Liu, X. Flexible graph-based attention and pooling network for image-text retrieval. Multimed. Tools Appl.83(19), 57895–57912. 10.1007/s11042-023-17798-1 (2023). [Google Scholar]
31.Advanced intelligent computing technology and applications. In Lecture Notes in Computer Science (Huang, D.-S., Chen, H., Li, B., Zhang, Q. eds.). Vol. 2573. 10.1007/978-981-95-0009-3 (Springer, 2025).
32.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. Imagenet: A large-scale hierarchical image database.
33.Rumelhart, D.E., Hinton, G.E. & McClelland, L. A general framework for parallel distributed processing.
34.Zhen, L., Hu, P., Wang, X. & Peng, D. Deep supervised cross-modal retrieval. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10386–10395. 10.1109/CVPR.2019.01064 (2019).
35.Chen, Z.-D. et al. Scratch: A scalable discrete matrix factorization hashing framework for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol.30(7), 2262–2275. 10.1109/TCSVT.2019.2911359 (2020). [Google Scholar]
36.Akaho, S. A kernel method for canonical correlation analysis. arXiv:cs/0609071 (2007).
37.Ranjan, V., Rasiwasia, N. & Jawahar, C.V. Multi-label cross-modal retrieval. In 2015 IEEE International Conference on Computer Vision (ICCV). 4094–4102. 10.1109/ICCV.2015.466 (2015).
38.Shen, Z., Hong, A. & Chen, A. Many-to-many comprehensive relative importance analysis and its applications to analysis of semiconductor electrical testing parameters. Adv. Eng. Inform.48, 101283. 10.1016/j.aei.2021.101283 (2021). [Google Scholar]
39.Wang, Z., Zhang, Z., Luo, Y., Huang, Z. & Shen, H. T. Deep collaborative discrete hashing with semantic-invariant structure construction. IEEE Trans. Multimed.23, 1274–1286. 10.1109/TMM.2020.2995267 (2021). [Google Scholar]
40.Xu, R., Li, C., Yan, J., Deng, C. & Liu, X. Graph convolutional network hashing for cross-modal retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 982–988. 10.24963/ijcai.2019/138 (2019).
41.Shen, X., Zhang, H., Li, L., Yang, W. & Liu, L. Semi-supervised cross-modal hashing with multi-view graph representation. Inf. Sci.604, 45–60. 10.1016/j.ins.2022.05.006 (2022). [Google Scholar]
42.Andrew, G., Arora, R., Bilmes, J. & Livescu, K. Deep canonical correlation analysis.
43.Wang, B., Yang, Y., Xu, X., Hanjalic, A. & Shen, H.T. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154–162. 10.1145/3123266.3123326 (2017).
44.Jing, M., Li, J., Zhu, L., Lu, K., Yang, Y. & Huang, Z. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the 28th ACM International Conference on Multimedia. 3283–3291. 10.1145/3394171.3413676 (2020).
45.Xie, Z., Zhang, W., Sheng, B., Li, P. & Chen, C. L. P. Bagfn: Broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst.34(8), 4499–4513. 10.1109/TNNLS.2021.3116209 (2023). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Wang, Q., Dai, W., Zhang, C., Zhu, J., Ma, X.: A compact constraint incremental method for random weight networks and its application. IEEE Trans. Neural Netw. Learning Syst. 1–9 (2023). 10.1109/TNNLS.2023.3289798 [DOI] [PubMed]

[CR2] 2.Peng, S. et al. State of health estimation of lithium-ion batteries based on multi-health features extraction and improved long short-term memory neural network. Energy282, 128956. 10.1016/j.energy.2023.128956 (2023). [Google Scholar]

[CR3] 3.Lee, K.-H., Chen, X., Hua, G., Hu, H. & He, X. Stacked cross attention for image-text matching. arXiv arXiv:1803.08024 (2018).

[CR4] 4.Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J. & Han, J. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12652–12660. 10.1109/CVPR42600.2020.01267 (2020).

[CR5] 5.Li, M., Li, Y., Huang, S.-L. & Zhang, L. Semantically supervised maximal correlation for cross-modal retrieval. In 2020 IEEE International Conference on Image Processing (ICIP). 2291–2295. 10.1109/ICIP40778.2020.9190873 (2020).

[CR6] 6.Huang, Y., Wang, Q., Zhang, Y. & Hu, B. A unified perspective of multi-level cross-modal similarity for cross-modal retrieval. In 2022 5th International Conference on Information Communication and Signal Processing (ICICSP). 466–471. 10.1109/ICICSP55539.2022.10050678 (2022).

[CR7] 7.Zhuang, J., Yu, J., Ding, Y., Qu, X. & Hu, Y. Towards fast and accurate image-text retrieval with self-supervised fine-grained alignment. IEEE Trans. Multimed.26, 1361–1372. 10.1109/TMM.2023.3280734 (2024). [Google Scholar]

[CR8] 8.Zeng, R., Ma, W., Wu, X., Liu, W. & Liu, J. Image text cross-modal retrieval with instance contrastive embedding. Electronics13(2), 300. 10.3390/electronics13020300 (2024). [Google Scholar]

[CR9] 9.Chen, R., Qiang, B., Yang, X., Zhang, S. & Xie, Y. Cross-modal deep interaction and semantic aligning for image-text retrieval. IEICE Trans. Inf. Syst.E108.D(10), 1230–1238. 10.1587/transinf.2024EDP7279 (2025).

[CR10] 10.Zeng, Z., He, S., Zhang, Y. & Mao, W. A multimodal embedding transfer approach for consistent and selective learning processes in cross-modal retrieval. Inf. Sci.704, 121974. 10.1016/j.ins.2025.121974 (2025). [Google Scholar]

[CR11] 11.Wang, K., He, R., Wang, L., Wang, W. & Tan, T. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell.38(10), 2010–2023. 10.1109/TPAMI.2015.2505311 (2016). [DOI] [PubMed] [Google Scholar]

[CR12] 12.Rasiwasia, N. et al. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260. 10.1145/1873951.1873987(2010).

[CR13] 13.Hardoon, D. R., Szedmak, S. & Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput.16(12), 2639–2664. 10.1162/0899766042321814 (2004). [DOI] [PubMed] [Google Scholar]

[CR14] 14.Sharma, A., Kumar, A., Daume, H. & Jacobs, D.W. Generalized multiview analysis: A discriminative latent space. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2160–2167. 10.1109/CVPR.2012.6247923 (2012).

[CR15] 15.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv arXiv:1409.1556 (2015).

[CR16] 16.Wang, Z., Yin, Y. & Ramakrishnan, I.V. Enhancing image-text matching with adaptive feature aggregation. arXiv arXiv:2401.09725 (2024).

[CR17] 17.al., C.L.: Efficient token-guided image-text retrieval with consistent multimodal contrastive training. IEEE Trans. Image Process.32, 3622–3633. 10.1109/TIP.2023.3286710 (2023). [DOI] [PubMed]

[CR18] 18.Zhen, L., Hu, P., Wang, X. & Peng, D. Deep supervised cross-modal retrieval. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10386–10395. 10.1109/CVPR.2019.01064 (2019).

[CR19] 19.Kipf, T.N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv arXiv:1609.02907 (2017).

[CR20] 20.Bruna, J., Zaremba, W., Szlam, A. & LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv:1312.6203 (2014).

[CR21] 21.Jiang, H. et al. Dynamic multi-stream graph neural networks for efficient interactive action recognition. Vis. Comput.41(12), 10467–10480. 10.1007/s00371-025-04048-8 (2025). [Google Scholar]

[CR22] 22.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39(6), 1137–1149. 10.1109/TPAMI.2016.2577031 (2017). [DOI] [PubMed] [Google Scholar]

[CR23] 23.Fan, W. et al. Graph neural networks for social recommendation. arXiv:1902.07243 (2019).

[CR24] 24.Wu, L., Sun, P., Hong, R., Fu, Y., Wang, X. & Wang, M. Socialgcn: An efficient graph convolutional network based model for social recommendation. arXiv:1811.02815 (2019).

[CR25] 25.Yu, J., Rui, Y. & Chen, B. Exploiting click constraints and multi-view features for image re-ranking. IEEE Trans. Multimed.16(1), 159–168. 10.1109/TMM.2013.2284755 (2014). [Google Scholar]

[CR26] 26.Yu, J., Rui, Y. & Tao, D. Click prediction for web image reranking using multimodal sparse coding. IEEE Trans. Image Process.23(5), 2019–2032. 10.1109/TIP.2014.2311377 (2014). [DOI] [PubMed] [Google Scholar]

[CR27] 27.Qian, S., Xue, D., Fang, Q. & Xu, C. Adaptive label-aware graph convolutional networks for cross-modal retrieval. IEEE Trans. Multimed.24, 3520–3532. 10.1109/TMM.2021.3101642 (2022).

[CR28] 28.Qian, S., Xue, D., Zhang, H., Fang, Q. & Xu, C. Dual adversarial graph neural networks for multi-label cross-modal retrieval. AAAI35(3), 2440–2448. 10.1609/aaai.v35i3.16345 (2021). [Google Scholar]

[CR29] 29.Qian, S., Xue, D., Fang, Q. & Xu, C. Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 1–18. 10.1109/TPAMI.2022.3188547 (2022). [DOI] [PubMed]

[CR30] 30.Sun, H., Qin, X. & Liu, X. Flexible graph-based attention and pooling network for image-text retrieval. Multimed. Tools Appl.83(19), 57895–57912. 10.1007/s11042-023-17798-1 (2023). [Google Scholar]

[CR31] 31.Advanced intelligent computing technology and applications. In Lecture Notes in Computer Science (Huang, D.-S., Chen, H., Li, B., Zhang, Q. eds.). Vol. 2573. 10.1007/978-981-95-0009-3 (Springer, 2025).

[CR32] 32.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. Imagenet: A large-scale hierarchical image database.

[CR33] 33.Rumelhart, D.E., Hinton, G.E. & McClelland, L. A general framework for parallel distributed processing.

[CR34] 34.Zhen, L., Hu, P., Wang, X. & Peng, D. Deep supervised cross-modal retrieval. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10386–10395. 10.1109/CVPR.2019.01064 (2019).

[CR35] 35.Chen, Z.-D. et al. Scratch: A scalable discrete matrix factorization hashing framework for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol.30(7), 2262–2275. 10.1109/TCSVT.2019.2911359 (2020). [Google Scholar]

[CR36] 36.Akaho, S. A kernel method for canonical correlation analysis. arXiv:cs/0609071 (2007).

[CR37] 37.Ranjan, V., Rasiwasia, N. & Jawahar, C.V. Multi-label cross-modal retrieval. In 2015 IEEE International Conference on Computer Vision (ICCV). 4094–4102. 10.1109/ICCV.2015.466 (2015).

[CR38] 38.Shen, Z., Hong, A. & Chen, A. Many-to-many comprehensive relative importance analysis and its applications to analysis of semiconductor electrical testing parameters. Adv. Eng. Inform.48, 101283. 10.1016/j.aei.2021.101283 (2021). [Google Scholar]

[CR39] 39.Wang, Z., Zhang, Z., Luo, Y., Huang, Z. & Shen, H. T. Deep collaborative discrete hashing with semantic-invariant structure construction. IEEE Trans. Multimed.23, 1274–1286. 10.1109/TMM.2020.2995267 (2021). [Google Scholar]

[CR40] 40.Xu, R., Li, C., Yan, J., Deng, C. & Liu, X. Graph convolutional network hashing for cross-modal retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 982–988. 10.24963/ijcai.2019/138 (2019).

[CR41] 41.Shen, X., Zhang, H., Li, L., Yang, W. & Liu, L. Semi-supervised cross-modal hashing with multi-view graph representation. Inf. Sci.604, 45–60. 10.1016/j.ins.2022.05.006 (2022). [Google Scholar]

[CR42] 42.Andrew, G., Arora, R., Bilmes, J. & Livescu, K. Deep canonical correlation analysis.

[CR43] 43.Wang, B., Yang, Y., Xu, X., Hanjalic, A. & Shen, H.T. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154–162. 10.1145/3123266.3123326 (2017).

[CR44] 44.Jing, M., Li, J., Zhu, L., Lu, K., Yang, Y. & Huang, Z. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the 28th ACM International Conference on Multimedia. 3283–3291. 10.1145/3394171.3413676 (2020).

[CR45] 45.Xie, Z., Zhang, W., Sheng, B., Li, P. & Chen, C. L. P. Bagfn: Broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst.34(8), 4499–4513. 10.1109/TNNLS.2021.3116209 (2023). [DOI] [PubMed] [Google Scholar]

PERMALINK

Enhancing cross-modal retrieval via label graph optimization and hybrid loss functions

Lin Wang

Chenchen Wang

Simin Peng

Abstract

Introduction

Fig. 1.

Related works

Cross-modal retrieval methods

Graph convolutional neural networks

Proposed method

Overview

Fig. 2.

Fig. 3.

Feature extraction

Modality discriminator adversarial network

L2-GCN pre-optimization of the label set

Circle-soft-loss-category loss and modality loss

Category loss-circle loss

Modality loss-soft contrastive loss

Classification loss

Adversarial loss

Overall objective

Retrieval process

Fig. 4.

Experiments

Evaluation metrics

Datasets

Baseline methods

Implementation details

Results and discussion

Table 1.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

Table 2.

Limitations and future work

Conclusion

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases