Skip to main content
Computational Intelligence and Neuroscience logoLink to Computational Intelligence and Neuroscience
. 2021 Jul 2;2021:9968716. doi: 10.1155/2021/9968716

Quadruplet-Based Deep Cross-Modal Hashing

Huan Liu 1, Jiang Xiong 1,, Nian Zhang 2, Fuming Liu 1, Xitao Zou 1
PMCID: PMC8270718  PMID: 34306059

Abstract

Recently, benefitting from the storage and retrieval efficiency of hashing and the powerful discriminative feature extraction capability of deep neural networks, deep cross-modal hashing retrieval has drawn more and more attention. To preserve the semantic similarities of cross-modal instances during the hash mapping procedure, most existing deep cross-modal hashing methods usually learn deep hashing networks with a pairwise loss or a triplet loss. However, these methods may not fully explore the similarity relation across modalities. To solve this problem, in this paper, we introduce a quadruplet loss into deep cross-modal hashing and propose a quadruplet-based deep cross-modal hashing (termed QDCMH) method. Extensive experiments on two benchmark cross-modal retrieval datasets show that our proposed method achieves state-of-the-art performance and demonstrate the efficiency of the quadruplet loss in cross-modal hashing.

1. Introduction

With the advent of the era of big data, there are surging massive multimedia data on the Internet, such as images, videos, and texts. These data usually exist in diversified modalities, for example, there may exist a textual data and an audio data describing a video data or an image data. As data from different modalities may have compact semantic relevance, cross-modal retrieval [1, 2] is proposed to retrieve semantic similar data from one modality while the querying data is from a distinct modality. Benefitting from the high efficiency and low cost, hashing-based cross-modal retrieval (cross-modal hashing) [36] has drew extensive attention. The goal of cross-modal hashing is to map the modal heterogeneous data into a common binary space and ensure that semantic similar/dissimilar cross-modal data have similar/dissimilar hash codes. Cross-modal hashing methods can usually achieve superior performance; nonetheless, most of existing cross-modal hashing methods (such as cross-modal similarity sensitive hashing (CMSSH) [7], semantic correlation maximization (SCM) [8], semantics-preserving hashing (SePH) [9], and generalized semantic preserving hashing (GSPH) [10]) are based on handcrafted feature learning, which cannot effectively capture the heterogeneous relevance between different modalities and thus may result in inferior performance.

In the last decade, deep convolutional neural networks [11, 12] have been successfully utilized in many computer vision tasks, and therefore, some researchers also deploy it in cross-modal hashing, such as deep cross-modal hashing (DCMH) [13], pairwise relationship guided deep hashing (PRDH) [14], self-supervised adversarial hashing (SSAH) [15], and triplet-based deep hashing (TDH) [16]. Cross-modal hashing methods with deep neural networks efficiently integrate the hash representation learning and the hash function learning into an end-to-end framework, which can capture heterogeneous cross-modal relevance more effectively and thus acquire better cross-modal retrieval performance.

To date, most deep cross-modal hashing methods utilize the pairwise loss (such as [1315]) or the triplet loss (such as [16]) to preserve semantic relevance during the hash representation learning procedure. Nevertheless, the pairwise loss- and triplet loss-based hash methods suffer from a weak generalization capacity from the training set to the testing set [17, 18], as shown in Figure 1(a). On the contrary, quadruplet loss is proposed and has been utilized in image hashing retrieval [17] and person reidentification [18], and in these works, it has been proved that the quadruplet loss-based model can enhance the generalization capability. Therefore, cross-modal hashing combines quadruplet loss as a natural solution to enhance the performance of cross-modal hashing, as shown in Figure 1(b).

Figure 1.

Figure 1

(a) Triplet loss-based cross-modal hashing methods suffer from a weak generalization capacity from the training set to the testing set because the test instances belong to the category Inline graphic and cannot be mapped into compact binary codes (see the lower-right corner). (b) Triplet loss-based cross-modal hashing methods can project the test instances, which belong to the category Inline graphic, into compact binary space (see the lower right corner).

To this end, in this paper, we introduce quadruplet loss into cross-modal hashing and propose a quadruplet-based deep cross-modal hashing method (QDCMH). Specifically, QDCMH firstly defines a quadruplet-based cross-modal semantic preserving module. Afterwards, QDCMH integrates this module, hash representation learning, and hash code generation into an end-to-end framework. Finally, experiments on two benchmark cross-modal retrieval datasets are conducted to validate the performance of the proposed method. The main contributions of our proposed QDCMH include the following:

  1. We introduce quadruplet loss into cross-modal retrieval and propose a novel deep cross-modal hashing method. To the best of our knowledge, this is the first work to introduce quadruplet loss into cross-modal hashing retrieval.

  2. We conduct extensive experiments on benchmark cross-modal retrieval datasets to investigate the performance of our proposed QDCMH.

The remainder of this paper is organized as follows. Section 2 elaborates our proposed quadruplet-based deep cross-modal hashing method. Section 3 presents the learning algorithm of QDCMH. Section 4 is the experimental results and the corresponding analysis. Section 5 concludes our work.

2. Proposed Method

In this section, we elaborate our proposed quadruplet-based deep cross-modal hashing (QDCMH) method with the following sections: notations, quadruplet-based cross-modal semantic preserving module, feature learning networks, and hash function learning. Figure 2 presents the flowchart of our proposed QDCMH, which cooperates quadruplet-based cross-modal semantic preserving module, hash representation learning, and hash codes generation into an end-to-end framework. In our proposed QDCMH method, we assume that each instance has two modalities, i.e., an image modality and a text modality, but they can be easily applied to multimodalities.

Figure 2.

Figure 2

Flowchart of the proposed quadruplet-based deep cross-modal hashing (QDCMH) method. QDCMH encompasses three steps: (1) a quadruplet-based cross-modal semantic preserving module, (2) a classical convolutional neural network is used to learn image-modality features and the TxtNet in SSAH [15] is adopted to learn the text-modality features, and (3) an intermodal quadruplet loss is utilized to efficiently capture the relevant semantic information during the feature learning process and a quantization loss is used to decrease information loss during the hash codes generation procedure. (a) Quadruplet (Vq, Tp, Tn1, Tn2), which utilizes an image instance Vq to retrieve three text instances: Tp, Tn1, and Tn2. Vq and Tp have at least one common labels, while Vq and Tn1, Vq and Tn2, and Tn1 and Tn2 are three pairwise instances and the two instances in each pairwise have no common label. (b) Quadruplet (Vq, Tp, Tn1, Tn2), which utilizes a text instance Tq to retrieve three image instances: Vp, Vn1, and Vn2. Tq and Vp have at least one common labels, while Tq and Vn1, Tq and Vn2, and Vn1 and Vn2 are three pairwise instances and the two instances in each pairwise have no common label.

2.1. Notations

Assume that the training data comprises n image-text pairs, i.e., the original image features VRn×dv and the original text features TRn×dt. Besides, there is a label vector associated with each image-text pair and label vectors for all training instances constitute a label matrix LRn×dl. dv and dt are the corresponding original dimensions of image features and text features, respectively, and dl is the total number of class categories. If image-text pair {Vi, Ti} attaches to the jth category, then Lij=1, otherwise Lij=0. The quadruplet (Vq, Tp, Tn1, Tn2) denotes that Vq is a query instance from the image modality, and Tp, Tn1, Tn2 are three retrieval instances from the text modality, where Vq and Tp have at least one common categories, while Vq and Tn1, Vq and Tn2, and Tn1 and Tn2 are three pairwise instances and the two instances in each pairwise have no common label.

With the known quadruplet (Vq, Tp, Tn1, Tn2), the target of our proposed QDCMH is to learn the corresponding hash codes (BVq, BTp, BTn1, BTn2), where BVq, BTp, BTn1, BTn2 are the hash codes of instances Vq, Tp, Tn1, Tn2, respectively. To learn the above hash codes, we first learn the hash representations (FVq, GTp, GTn1, GTn2) from the quadruplet (Vq, Tp, Tn1, Tn2) with deep neural networks, where FVq=f(Vq, θV) and GTp=g(Tp, θT) are the hash representations of instance Vq and Tp, respectively. f(., θV) and g(., θT) are the hash representation learning functions for the image modality and the text modality, respectively. θV and θT are the parameters of deep neural networks to extract features for the image modality and for the text modality, respectively. Secondly, we can utilize the following sign function to approximately map the hash representations into the corresponding hash codes, i.e., BVq=sign(FVq) and BTp=sign(GTp). In the same way, we can learn the hash codes of quadruplet (Tq, Vp, Vn1, Vn2). For convenience, we denote the hash codes of all training image-text pairs, the hash representations of all training image instances, and the hash representations of all training text instances as B ∈ {−1,1}n×k, FRn×k, and GRn×k, respectively, where k is the length of hash codes:

y=1,if x>=0,xR,1,if x<0,xR. (1)

2.2. Quadruplet-Based Cross-Modal Semantic Preserving Module

In cross-modal hashing retrieval, given an image instance Vi and a text instance Tj, it is intractable to preserve the semantic relativity during the hash code learning procedure as the huge semantic gap across modalities. To solve this, DCMH [13] defines pairwise loss to map similar/dissimilar image-text pairs into similar/dissimilar hash codes. TDH [16] utilizes triplet loss to learn similar hash codes for similar cross-modal instances and generate distinct hash codes for semantic irrelevant cross-modal instances. Both pairwise loss and triplet loss can preserve the relevance in the original instance space; however, pairwise loss- and triplet loss-based hashing methods often suffer from a weaker generalization capability from the training set to the testing set [17, 18]. To solve this problem, in this section, a quadruplet-based cross-modal semantic preserving module is proposed to boost the generalization capability and better preserve the semantic relevance for cross-modal hashing.

For a quadruplet (Vq, Tp, Tn1, Tn2), we should keep the semantic relevance unchanged during the hash representation learning, i.e., FVq should be similar to GTp, FVq should be distinct to GTn1 and GTn2, and GTn1 should be dissimilar with GTn2. Thus, we can define the following quadruplet loss for cross-modal hashing:

JquadrupletITFVq,GTp,GTn1,GTn2=Vq,Tp,Tn1max0,FVqGTp22FVqGTn122+α1+Vq,Tp,Tn1,Tn2max0,FVqGTp22GTn1GTn222+α2, (2)

where Vq is a query instance from the image modality, Tp, Tn1, and Tn2 are three retrieval instances from the text modality, and Vq and Tp are semantic similar. While Vq and Tn1, Vq and Tn2, and Tn1 and Tn2 are three pairwise instances, and the two instances in each pairwise have distinct semantics. Equation (2) denotes that the distance of hash representations of similar cross-modal pairwise instances should be smaller than that of dissimilar pairwise instances (both from intermodalities and from intramodalities) with a positive margin (α1 or α2). This can ensure that similar cross-modal instances have similar hash representations while dissimilar instances have distinct hash representations. By this quadruplet loss, the cross-modal semantic relevance can be preserved during the hash representation learning stage.

Similarly, given a quadruplet (Tq, Vp, Vn1, Vn2), we can have the following cross-modal quadruplet loss:

JquadrupletTIGTq,FVp,FVn1,FVn2=Tq,Vp,Vn1max0,GTqFVp22GTqFVn122+α3+Tq,Vp,Vn1,Vn2max0,GTqFVp22FVn1FVn222+α4, (3)

where Tq is a query instance from the text modality, Vp, Vn1, and Vn2 are three retrieval instances from the image modality, GTq, FVp, FVn1, and FVn2 are hash representations for instances Tq, Vp, Vn1, and Vn2, respectively, and α3 and α4 are two positive margins. Equation (3) is distinct to equation (2) as the modality of query instance and the modality of retrieval instances are inverse.

2.3. Hash Representation Learning and Hash Code Learning

For each quadruplet from training set, it is easy to learn their hash representations and fully protect the semantic similarity with the above quadruplet-based cross-modal semantic relevance preserving module, so we have the following hash representation learning loss:

Jrepresentation=1nITJquadrupletITFVq,GTp,GTn1,GTn2+βnTIJquadrupletTIGTq,FVp,FVn1,FVn2, (4)

where nIT is the number of quadruplets for utilizing image to retrieve text, nTI is the number of quadruplets for utilizing text to retrieve images, and β is a hyperparameter to balance the two parts.

Additionally, to learn high-quality hash codes, we generate hash codes from the learned hash representations with the sign function in equation (1), and the final hash codes matrix for all training image-text pairs are generated as follows:

B=signF+G2. (5)

As F and G are real-valued features, to decrease the information loss from F and G to B in equation (5), it is necessary to force F and G to be as close as possible to B; thus, we introduce the following quantization loss:

Jquantization=BF22+BG222nk. (6)

Integrating the hash representation loss and the quantization loss together, the whole loss function is as follows:

J=Jrepresentation+γJquantization, (7)

where γ is a hyperparameter to balance the hash representation loss and the quantization loss.

2.4. Feature Extraction Networks

In QDCMH, feature extraction includes two deep neural networks: a classical convolutional neural network is used to extract the features of images and a multiscale fusion model is utilized to learn features from texts. Specifically, for image modality, we deploy AlexNet [11] pretrained on the ImageNet [19] dataset. We then fine-tune the last layer using a new fully connected hash layer which consists of k hidden nodes. Therefore, the learned deep features have been embedded into a k-dimensional Hamming space. For text modality, the TxtNet in SSAH [15] is used, which comprises a three-layer feedforward neural network and a multiscale (MS) fusion model (Input⟶MS⟶4096⟶512⟶k).

3. Learning Algorithm of QDCMH

For QDCMH, we utilize alternating strategy to learn parameters θV of deep neural networks for image modality and parameters θT of deep neural networks for text modality and hash codes matrix B for all training image-text pairs. When we learn one of θV, θT, and B, we keep the other two fixed. The specific algorithm for QDCMH is depicted in Algorithm 1.

Algorithm 1.

Algorithm 1

QDCMH: quadruplet-based deep cross-modal hashing.

3.1. Update θV with θT and B Fixed

When θT and B are maintained fixed, we utilize stochastic gradient descent and backpropagation to optimize the deep neural network parameters θV.

3.2. Update θT with θV and B Fixed

When we fix the values of θV and B, we use stochastic gradient descent and backpropagation to learn the deep neural network parameters θT.

3.3. Update B with θT and θV Fixed

When the deep neural networks' parameters θT and θV are kept unchanged, the hash codes matrix B can be optimized with equation (5).

4. Experiments

4.1. Datasets

To investigate the performance of QDCMH, we conduct experiments on two benchmark cross-modal retrieval datasets: MIRFLICKR-25K [20] and Microsoft COCO2014 [21], and the brief descriptions of the datasets are listed in Table 1.

Table 1.

Brief description of the experimental datasets.

Dataset Used Train Query Retrieve Tag dimension Labels
MIRFLICKR-25K 20,015 10,000 2,000 18,015 1,386 24
MS-COCO2014 122,218 10,000 5,000 117,218 2,026 80

4.2. Evaluation Metrics

In our experiments, we utilize mean average precision (MAP), top N-precision curves (top N Curves), and precision-recall curves (PR Curves) as evaluation metrics; for the detailed description of these evaluation metrics, refer to [22, 23].

4.3. Baselines and Implementation Details

We compare our proposed QDCMH method with eight state-of-the-art cross-modal hashing methods, including four handcrafted ones, i.e., cross-modal similarity sensitive hashing (CMSSH) method [7], semantics-preserving hashing (SePH) [9] method, semantic correlation maximization (SCM) method [8], and generalized semantic preserving hashing (GSPH) method [10] and four deep feature-based ones, i.e., deep cross-modal hashing (DCMH) method [13], pairwise relationship guided deep hashing (PRDH) method [14], self-supervised adversarial hashing (SSAH) method [15], and triplet-based deep hashing (TDH) method [16]. Most baseline methods are carefully implemented based on the codes provided by the authors. A few baseline methods are implemented by us following the suggestions and descriptions of the original papers.

All the experiments are executed by using the open source deep learning framework pytorch and running on an NVIDIA GTX Titan XP GPU server. In our experiments, we set nIT=nTI=10000, max_epoch=500, and λ=10−5 and the learning rate is initialized to 10−1.5 and gradually decreased to 10−6 in 500 epochs. For those handcrafted feature-based baselines, each image in the two datasets is represented by a bag of words (BoW) histogram or feature vector having 512 dimensions. For the whole experiment, we use IT to denote using a querying image while returning text and TI to denote using a querying text while returning an image.

4.4. Performance Evaluation and Discussion

Firstly, we investigate the performance of QDCMH with different hyperparameters β and γ. To this goal, we experiment on MIRFLICKR-25K with the hash code length k=64 and record the corresponding MAPs under different values of β and γ, as shown in Figure 3. We find that high performance can be acquired when β=1 and γ=0.2.

Figure 3.

Figure 3

A sensitivity analysis of the hyperparameters. (a) Hyperparameter β on MIRFLICKR-25K dataset. (b) Hyperparameter γ on MIRFLICKR-25K dataset.

Secondly, to validate the performance of QDCMH, we perform the experiment to compare QDCMH with baseline methods in terms of MAP on datasets MIRFLICKR-25K and MS-COCO2014. Table 2 presents the MAPs of each method for different hash code lengths, i.e., 16, 32, and 64. DSePH represents the SePH method whose features of the original images are extracted by CNN–F. From Table 2, we can see that the following. (1) The MAPs of our proposed QDCMH are higher than the MAPs of most baseline methods in most cases, which demonstrates the superiority of QDCMH. We can also observe that SSAH outperforms than our proposed QDCMH in most cases, which is partly because SSAH takes self-supervised learning and generative adversarial networks into account during hash representation learning procedure. (2) The MAPs of QDCMH is always higher than the MAPs of TDH, which shows that quadruplet loss can better preserve semantic relevance than triplet loss in cross-modal hashing retrieval. (3) The MAPs of DSePH is always higher than the MAPs of SePH, which demonstrates that deep neural networks have powerful features learning capacity. (4) Our proposed QDCMH can achieve better performance on MS-COCO 2014 dataset than on MIRFlickr-25K dataset, which is partly because the instances in MS-COCO 2014 dataset belong to 80 categories while the instances in MIRFlickr-25K dataset belong to 24 categories, and this makes the quadruplets generated from the MS-COCO 2014 dataset have better generalization ability than the quadruplets generated from the MIRFlickr-25K dataset.

Table 2.

Comparison to baselines in terms of MAP on two datasets: MIRFLICKR-25K, and Microsoft COCO2014, respectively. The best accuracy is shown in boldface.

Task Methods MIRFlickr-25K MS-COCO
16bits 32bits 64bits 16bits 32bits 64bits
I⟶T Handcrafted methods CMSSH [7] 0.5600 0.5709 0.5836 0.5439 0.5450 0.5410
SePH [9] 0.6740 0.6813 0.6803 0.4295 0.4353 0.4726
SCM [8] 0.6354 0.6407 0.6556 0.4252 0.4344 0.4574
GSPH [10] 0.6068 0.6191 0.6230 0.4427 0.4733 0.4840
Deep methods DCMH [13] 0.7316 0.7343 0.7446 0.5228 0.5438 0.5419
PRDH [14] 0.6952 0.7072 0.7108 0.5238 0.5521 0.5572
SSAH [15] 0.7745 0.7882 0.7990 0.5127 0.5256 0.5067
TDH [16] 0.7423 0.7478 0.7512 0.5164 0.5222 0.5276
DSePH [9] 0.7128 0.7285 0.7422 0.4621 0.4958 0.5112
QDCMH 0.7635 0.7688 0.7713 0.5286 0.5313 0.5371

T⟶I Handcrafted methods CMSSH [7] 0.5726 0.5776 0.5753 0.3793 0.3876 0.3899
SePH [9] 0.7139 0.7258 0.7294 0.4348 0.4606 0.5195
SCM [8] 0.6340 0.6458 0.6541 0.4118 0.4183 0.4345
GSPH [10] 0.6282 0.6458 0.6503 0.5435 0.6039 0.6461
Deep methods DCMH [13] 0.7607 0.7737 0.7805 0.4883 0.4942 0.5145
PRDH [14] 0.7626 0.7718 0.7755 0.5122 0.5190 0.5404
SSAH [15] 0.7860 0.7974 0.7910 0.4832 0.4831 0.4922
TDH [16] 0.7516 0.7577 0.7634 0.5198 0.5332 0.5399
DSePH [9] 0.7422 0.7578 0.7760 0.4616 0.4882 0.5305
QDCMH 0.7762 0.7725 0.7859 0.5245 0.5398 0.5487

Thirdly, to further investigate the performance of QDCMH, we plot the precision-recall curves and top N-precision curves of QDCMH and baseline methods with hash code lengths 64 on datasets MIRFLICKR-25K, Microsoft COCO2014, respectively, as presented in Figures 4 and 5. From this figure, we can see that the precision-recall curves and top N-precision curves are nearly consistent with the MAPs in Table 2.

Figure 4.

Figure 4

Precision-recall curves on datasets MIRFLICKR-25K and Microsoft COCO2014.

Figure 5.

Figure 5

Top N-precision curves on datasets MIRFLICKR-25K and Microsoft COCO2014.

5. Conclusions

In this paper, we introduce a quadruplet loss into deep cross-modal hashing to fully preserve semantic relevance of original cross-modal quadruple instances and propose a quadruplet based deep cross-modal hashing method (QDCMH). QDCMH integrates quadruplet-based cross-modal semantic relevance preserving module, hash representation learning, and hash code generation into an end-to-end framework. Experiments on two benchmark cross-modal retrieval datasets demonstrate the efficiency of our proposed QDCMH.

Data Availability

The experimental datasets and the related settings can be found in https://github.com/SWU-CS-MediaLab/MLSPH. The experimental codes used to support the findings of this study will been deposited in the github repository after the publication of this paper or can be provided by xitaozou@sanxiau.edu.cn.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  • 1.Peng Y., Huang X., Zhao Y. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology. 2017;28(9):2372–2385. doi: 10.1109/TCSVT.2017.2705068. [DOI] [Google Scholar]
  • 2.Wang K., Yin Q., Wang W., Wu S., Wang L. A comprehensive survey on cross-modal retrieval. Multimedia. 2016 https://arxiv.org/abs/1607.06215. [Google Scholar]
  • 3.Deng C., Yang E., Liu T., Tao D. Two-stream deep hashing with class-specific centers for supervised image search. IEEE Transactions on Neural Networks and Learning Systems. 2019;31(6):2189–2201. doi: 10.1109/TNNLS.2019.2929068. [DOI] [PubMed] [Google Scholar]
  • 4.Deng C., Yang E., Liu T., Liu W., Tao D. Unsupervised semantic-preserving adversarial hashing for image search. IEEE Transaction on Image Processing. 2019;28 doi: 10.1109/tip.2019.2903661.2903661 [DOI] [PubMed] [Google Scholar]
  • 5.Yang E., Deng C., Li C., Liu W., Li J., Tao D. Shared predictive cross-modal deep quantization. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(11):5292–5303. doi: 10.1109/tnnls.2018.2793863. [DOI] [PubMed] [Google Scholar]
  • 6.Yang E., Liu T., Deng C., Tao D. Adversarial examples for hamming space search. IEEE Transactions on Cybernetics. 2018;50(4):1473–1484. doi: 10.1109/TCYB.2018.2882908. [DOI] [PubMed] [Google Scholar]
  • 7.Bronstein M. M., Bronstein A. M., Michel F., Paragios N. Data fusion through cross-modality metric learning using similarity-sensitive hashing. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; June 2010; San Francisco, CA, USA. pp. 3594–3601. [DOI] [Google Scholar]
  • 8.Zhang D., Li W.-J. Large-scale supervised multimodal hashing with semantic correlation maximization. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence; July 2014; Québec City, Québec, Canada. pp. 2177–2183. [Google Scholar]
  • 9.Lin Z., Ding G., Hu M., Wang J. Semantics-preserving hashing for cross-view retrieval. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June 2015; Boston, MA, USA. pp. 3864–3872. [DOI] [Google Scholar]
  • 10.Mandal D., Chaudhury K. N., Biswas S. Generalized semantic preserving hashing for n-label cross-modal retrieval. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); July 2017; Honolulu, HI, USA. pp. 4076–4084. [DOI] [Google Scholar]
  • 11.Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2012;60(6):1097–1105. doi: 10.1145/3065386. [DOI] [Google Scholar]
  • 12.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June 2016; Las Vegas, NV, USA. pp. 770–778. [DOI] [Google Scholar]
  • 13.Jiang Q.-Y., Li W.-J. Deep cross-mmodal hashing. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); July 2017; Honolulu, HI, USA. pp. 3232–3240. [DOI] [Google Scholar]
  • 14.Yang E., Deng C., Liu W., Liu X., Tao D., Gao X. Pairwise relationship guided deep hashing for cross-modal retrieval. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence 2017; February 2017; San Francisco, CA, USA. [Google Scholar]
  • 15.Li C., Deng C., Li N., Liu W., Gao X., Tao D. Self-supervised adversarial hashing networks for cross-modal retrieval. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2018; Salt Lake City, UT, USA. pp. 4242–4251. [DOI] [Google Scholar]
  • 16.Deng C., Chen Z., Liu X., Gao X., Tao D. Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions Image Processing. 2018;27(8):3893–3903. doi: 10.1109/tip.2018.2821921. [DOI] [PubMed] [Google Scholar]
  • 17.Zhu J., Chen Z., Zhao L., Wu S. Quadruplet-based deep hashing for image retrieval. Neurocomputing. 2019;366:161–169. doi: 10.1016/j.neucom.2019.07.082. [DOI] [Google Scholar]
  • 18.Chen W., Chen X., Zhang J., Huang K. Beyond triplet loss: a deep quadruplet network for person rre-identification. Computer Vision and Pattern Recognition. 2017:403–412. https://arxiv.org/abs/1704.01719. [Google Scholar]
  • 19.Deng J., Dong W., Socher R., et al. A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; June 2009; Miami, FL, USA. pp. 248–255. [DOI] [Google Scholar]
  • 20.Huiskes M. J., Lew M. S. The mir flickr retrieval evaluation. Proceedings of the 1st ACM international conference on Multimedia information retrieval; October, 2008; New York, NY, USA. pp. 39–43. [DOI] [Google Scholar]
  • 21.Lin T.-Y., Maire M., Belongie S., et al. Microsoft COCO: common objects in context. Proceedings of the European Conference on Computer Vision ECCV 2014; September 2014; Zurich, Switzerland. pp. 740–755. [DOI] [Google Scholar]
  • 22.Wang X., Zou X., Bakker E. M., Wu S. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing. 2020;400:255–271. doi: 10.1016/j.neucom.2020.03.019. [DOI] [Google Scholar]
  • 23.Zou X., Wang X., Bakker E. M., Wu S. Multi-label semantics preserving based deep cross-modal hashing. Signal Processing Image Communication. 2021;93(9) doi: 10.1016/j.image.2020.116131.116131 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The experimental datasets and the related settings can be found in https://github.com/SWU-CS-MediaLab/MLSPH. The experimental codes used to support the findings of this study will been deposited in the github repository after the publication of this paper or can be provided by xitaozou@sanxiau.edu.cn.


Articles from Computational Intelligence and Neuroscience are provided here courtesy of Wiley

RESOURCES