Quadruplet-Based Deep Cross-Modal Hashing

Huan Liu; Jiang Xiong; Nian Zhang; Fuming Liu; Xitao Zou

doi:10.1155/2021/9968716

. 2021 Jul 2;2021:9968716. doi: 10.1155/2021/9968716

Quadruplet-Based Deep Cross-Modal Hashing

Huan Liu ¹, Jiang Xiong ^1,^✉, Nian Zhang ², Fuming Liu ¹, Xitao Zou ¹

PMCID: PMC8270718 PMID: 34306059

Abstract

Recently, benefitting from the storage and retrieval efficiency of hashing and the powerful discriminative feature extraction capability of deep neural networks, deep cross-modal hashing retrieval has drawn more and more attention. To preserve the semantic similarities of cross-modal instances during the hash mapping procedure, most existing deep cross-modal hashing methods usually learn deep hashing networks with a pairwise loss or a triplet loss. However, these methods may not fully explore the similarity relation across modalities. To solve this problem, in this paper, we introduce a quadruplet loss into deep cross-modal hashing and propose a quadruplet-based deep cross-modal hashing (termed QDCMH) method. Extensive experiments on two benchmark cross-modal retrieval datasets show that our proposed method achieves state-of-the-art performance and demonstrate the efficiency of the quadruplet loss in cross-modal hashing.

1. Introduction

With the advent of the era of big data, there are surging massive multimedia data on the Internet, such as images, videos, and texts. These data usually exist in diversified modalities, for example, there may exist a textual data and an audio data describing a video data or an image data. As data from different modalities may have compact semantic relevance, cross-modal retrieval [1, 2] is proposed to retrieve semantic similar data from one modality while the querying data is from a distinct modality. Benefitting from the high efficiency and low cost, hashing-based cross-modal retrieval (cross-modal hashing) [3–6] has drew extensive attention. The goal of cross-modal hashing is to map the modal heterogeneous data into a common binary space and ensure that semantic similar/dissimilar cross-modal data have similar/dissimilar hash codes. Cross-modal hashing methods can usually achieve superior performance; nonetheless, most of existing cross-modal hashing methods (such as cross-modal similarity sensitive hashing (CMSSH) [7], semantic correlation maximization (SCM) [8], semantics-preserving hashing (SePH) [9], and generalized semantic preserving hashing (GSPH) [10]) are based on handcrafted feature learning, which cannot effectively capture the heterogeneous relevance between different modalities and thus may result in inferior performance.

In the last decade, deep convolutional neural networks [11, 12] have been successfully utilized in many computer vision tasks, and therefore, some researchers also deploy it in cross-modal hashing, such as deep cross-modal hashing (DCMH) [13], pairwise relationship guided deep hashing (PRDH) [14], self-supervised adversarial hashing (SSAH) [15], and triplet-based deep hashing (TDH) [16]. Cross-modal hashing methods with deep neural networks efficiently integrate the hash representation learning and the hash function learning into an end-to-end framework, which can capture heterogeneous cross-modal relevance more effectively and thus acquire better cross-modal retrieval performance.

To date, most deep cross-modal hashing methods utilize the pairwise loss (such as [13–15]) or the triplet loss (such as [16]) to preserve semantic relevance during the hash representation learning procedure. Nevertheless, the pairwise loss- and triplet loss-based hash methods suffer from a weak generalization capacity from the training set to the testing set [17, 18], as shown in Figure 1(a). On the contrary, quadruplet loss is proposed and has been utilized in image hashing retrieval [17] and person reidentification [18], and in these works, it has been proved that the quadruplet loss-based model can enhance the generalization capability. Therefore, cross-modal hashing combines quadruplet loss as a natural solution to enhance the performance of cross-modal hashing, as shown in Figure 1(b).

(a) Triplet loss-based cross-modal hashing methods suffer from a weak generalization capacity from the training set to the testing set because the test instances belong to the category and cannot be mapped into compact binary codes (see the lower-right corner). (b) Triplet loss-based cross-modal hashing methods can project the test instances, which belong to the category , into compact binary space (see the lower right corner).

Inline graphic — (a) Triplet loss-based cross-modal hashing methods suffer from a weak generalization capacity from the training set to the testing set because the test instances belong to the category and cannot be mapped into compact binary codes (see the lower-right corner). (b) Triplet loss-based cross-modal hashing methods can project the test instances, which belong to the category , into compact binary space (see the lower right corner).

To this end, in this paper, we introduce quadruplet loss into cross-modal hashing and propose a quadruplet-based deep cross-modal hashing method (QDCMH). Specifically, QDCMH firstly defines a quadruplet-based cross-modal semantic preserving module. Afterwards, QDCMH integrates this module, hash representation learning, and hash code generation into an end-to-end framework. Finally, experiments on two benchmark cross-modal retrieval datasets are conducted to validate the performance of the proposed method. The main contributions of our proposed QDCMH include the following:

We introduce quadruplet loss into cross-modal retrieval and propose a novel deep cross-modal hashing method. To the best of our knowledge, this is the first work to introduce quadruplet loss into cross-modal hashing retrieval.
We conduct extensive experiments on benchmark cross-modal retrieval datasets to investigate the performance of our proposed QDCMH.

The remainder of this paper is organized as follows. Section 2 elaborates our proposed quadruplet-based deep cross-modal hashing method. Section 3 presents the learning algorithm of QDCMH. Section 4 is the experimental results and the corresponding analysis. Section 5 concludes our work.

2. Proposed Method

In this section, we elaborate our proposed quadruplet-based deep cross-modal hashing (QDCMH) method with the following sections: notations, quadruplet-based cross-modal semantic preserving module, feature learning networks, and hash function learning. Figure 2 presents the flowchart of our proposed QDCMH, which cooperates quadruplet-based cross-modal semantic preserving module, hash representation learning, and hash codes generation into an end-to-end framework. In our proposed QDCMH method, we assume that each instance has two modalities, i.e., an image modality and a text modality, but they can be easily applied to multimodalities.

Flowchart of the proposed quadruplet-based deep cross-modal hashing (QDCMH) method. QDCMH encompasses three steps: (1) a quadruplet-based cross-modal semantic preserving module, (2) a classical convolutional neural network is used to learn image-modality features and the TxtNet in SSAH [15] is adopted to learn the text-modality features, and (3) an intermodal quadruplet loss is utilized to efficiently capture the relevant semantic information during the feature learning process and a quantization loss is used to decrease information loss during the hash codes generation procedure. (a) Quadruplet (V_q, T_p, T_n1, T_n2), which utilizes an image instance V_q to retrieve three text instances: T_p, T_n1, and T_n2. V_q and T_p have at least one common labels, while V_q and T_n1, V_q and T_n2, and T_n1 and T_n2 are three pairwise instances and the two instances in each pairwise have no common label. (b) Quadruplet (V_q, T_p, T_n1, T_n2), which utilizes a text instance T_q to retrieve three image instances: V_p, V_n1, and V_n2. T_q and V_p have at least one common labels, while T_q and V_n1, T_q and V_n2, and V_n1 and V_n2 are three pairwise instances and the two instances in each pairwise have no common label.

2.1. Notations

Assume that the training data comprises n image-text pairs, i.e., the original image features V ∈ R^n×d_v and the original text features T ∈ R^n×d_t. Besides, there is a label vector associated with each image-text pair and label vectors for all training instances constitute a label matrix L ∈ R^n×d_l. d_v and d_t are the corresponding original dimensions of image features and text features, respectively, and d_l is the total number of class categories. If image-text pair {V_i, T_i} attaches to the jth category, then L_ij=1, otherwise L_ij=0. The quadruplet (V_q, T_p, T_n1, T_n2) denotes that V_q is a query instance from the image modality, and T_p, T_n1, T_n2 are three retrieval instances from the text modality, where V_q and T_p have at least one common categories, while V_q and T_n1, V_q and T_n2, and T_n1 and T_n2 are three pairwise instances and the two instances in each pairwise have no common label.

With the known quadruplet (V_q, T_p, T_n1, T_n2), the target of our proposed QDCMH is to learn the corresponding hash codes (B_{V_q}, B_{T_p}, B_{T_n1}, B_{T_n2}), where B_{V_q}, B_{T_p}, B_{T_n1}, B_{T_n2} are the hash codes of instances V_q, T_p, T_n1, T_n2, respectively. To learn the above hash codes, we first learn the hash representations (F_{V_q}, G_{T_p}, G_{T_n1}, G_{T_n2}) from the quadruplet (V_q, T_p, T_n1, T_n2) with deep neural networks, where F_{V_q}=f(V_q, θ_V) and G_{T_p}=g(T_p, θ_T) are the hash representations of instance V_q and T_p, respectively. f(., θ_V) and g(., θ_T) are the hash representation learning functions for the image modality and the text modality, respectively. θ_V and θ_T are the parameters of deep neural networks to extract features for the image modality and for the text modality, respectively. Secondly, we can utilize the following sign function to approximately map the hash representations into the corresponding hash codes, i.e., B_{V_q}=sign(F_{V_q}) and B_{T_p}=sign(G_{T_p}). In the same way, we can learn the hash codes of quadruplet (T_q, V_p, V_n1, V_n2). For convenience, we denote the hash codes of all training image-text pairs, the hash representations of all training image instances, and the hash representations of all training text instances as B ∈ {−1,1}^n×k, F ∈ R^n×k, and G ∈ R^n×k, respectively, where k is the length of hash codes:

\begin{matrix} y = \{\begin{matrix} 1, & if x > = 0, x \in R, \\ - 1, & if x < 0, x \in R . \end{matrix}) \end{matrix}

(1)

2.2. Quadruplet-Based Cross-Modal Semantic Preserving Module

In cross-modal hashing retrieval, given an image instance V_i and a text instance T_j, it is intractable to preserve the semantic relativity during the hash code learning procedure as the huge semantic gap across modalities. To solve this, DCMH [13] defines pairwise loss to map similar/dissimilar image-text pairs into similar/dissimilar hash codes. TDH [16] utilizes triplet loss to learn similar hash codes for similar cross-modal instances and generate distinct hash codes for semantic irrelevant cross-modal instances. Both pairwise loss and triplet loss can preserve the relevance in the original instance space; however, pairwise loss- and triplet loss-based hashing methods often suffer from a weaker generalization capability from the training set to the testing set [17, 18]. To solve this problem, in this section, a quadruplet-based cross-modal semantic preserving module is proposed to boost the generalization capability and better preserve the semantic relevance for cross-modal hashing.

For a quadruplet (V_q, T_p, T_n1, T_n2), we should keep the semantic relevance unchanged during the hash representation learning, i.e., F_{V_q} should be similar to G_{T_p}, F_{V_q} should be distinct to G_{T_n1} and G_{T_n2}, and G_{T_n1} should be dissimilar with G_{T_n2}. Thus, we can define the following quadruplet loss for cross-modal hashing:

\begin{matrix} J_{quadruplet}^{I ⟶ T} (F_{V_{q}}, G_{T_{p}}, G_{T_{n 1}}, G_{T_{n 2}}) = \sum_{V_{q}, T_{p}, T_{n 1}} \max (0, {‖F_{V_{q}} - G_{T_{p}}‖}_{2}^{2} - {‖F_{V_{q}} - G_{T_{n 1}}‖}_{2}^{2} + α_{1}) \\ + \sum_{V_{q}, T_{p}, T_{n 1}, T_{n 2}} \max (0, {‖F_{V_{q}} - G_{T_{p}}‖}_{2}^{2} - {‖G_{T_{n 1}} - G_{T_{n 2}}‖}_{2}^{2} + α_{2}), \end{matrix}

(2)

where V_q is a query instance from the image modality, T_p, T_n1, and T_n2 are three retrieval instances from the text modality, and V_q and T_p are semantic similar. While V_q and T_n1, V_q and T_n2, and T_n1 and T_n2 are three pairwise instances, and the two instances in each pairwise have distinct semantics. Equation (2) denotes that the distance of hash representations of similar cross-modal pairwise instances should be smaller than that of dissimilar pairwise instances (both from intermodalities and from intramodalities) with a positive margin (α₁ or α₂). This can ensure that similar cross-modal instances have similar hash representations while dissimilar instances have distinct hash representations. By this quadruplet loss, the cross-modal semantic relevance can be preserved during the hash representation learning stage.

Similarly, given a quadruplet (T_q, V_p, V_n1, V_n2), we can have the following cross-modal quadruplet loss:

\begin{matrix} J_{quadruplet}^{T ⟶ I} (G_{T_{q}}, F_{V_{p}}, F_{V_{n 1}}, F_{V_{n 2}}) = \sum_{T_{q}, V_{p}, V_{n 1}} \max (0, {‖G_{T_{q}} - F_{V_{p}}‖}_{2}^{2} - {‖G_{T_{q}} - F_{V_{n 1}}‖}_{2}^{2} + α_{3}) \\ + \sum_{T_{q}, V_{p}, V_{n 1}, V_{n 2}} \max (0, {‖G_{T_{q}} - F_{V_{p}}‖}_{2}^{2} - {‖F_{V_{n 1}} - F_{V_{n 2}}‖}_{2}^{2} + α_{4}), \end{matrix}

(3)

where T_q is a query instance from the text modality, V_p, V_n1, and V_n2 are three retrieval instances from the image modality, G_{T_q}, F_{V_p}, F_{V_n1}, and F_{V_n2} are hash representations for instances T_q, V_p, V_n1, and V_n2, respectively, and α₃ and α₄ are two positive margins. Equation (3) is distinct to equation (2) as the modality of query instance and the modality of retrieval instances are inverse.

2.3. Hash Representation Learning and Hash Code Learning

For each quadruplet from training set, it is easy to learn their hash representations and fully protect the semantic similarity with the above quadruplet-based cross-modal semantic relevance preserving module, so we have the following hash representation learning loss:

\begin{matrix} J_{representation} = \frac{1}{n_{I ⟶ T}} J_{quadruplet}^{I ⟶ T} (F_{V_{q}}, G_{T_{p}}, G_{T_{n 1}}, G_{T_{n 2}}) + \frac{β}{n_{T ⟶ I}} J_{quadruplet}^{T ⟶ I} (G_{T_{q}}, F_{V_{p}}, F_{V_{n 1}}, F_{V_{n 2}}), \end{matrix}

(4)

where n_I⟶T is the number of quadruplets for utilizing image to retrieve text, n_T⟶I is the number of quadruplets for utilizing text to retrieve images, and β is a hyperparameter to balance the two parts.

Additionally, to learn high-quality hash codes, we generate hash codes from the learned hash representations with the sign function in equation (1), and the final hash codes matrix for all training image-text pairs are generated as follows:

\begin{matrix} B = sign (\frac{F + G}{2}) . \end{matrix}

(5)

As F and G are real-valued features, to decrease the information loss from F and G to B in equation (5), it is necessary to force F and G to be as close as possible to B; thus, we introduce the following quantization loss:

\begin{matrix} J_{quantization} = \frac{{‖B - F‖}_{2}^{2} + {‖B - G‖}_{2}^{2}}{2 n k} . \end{matrix}

(6)

Integrating the hash representation loss and the quantization loss together, the whole loss function is as follows:

\begin{matrix} J = J_{representation} + γ J_{quantization}, \end{matrix}

(7)

where γ is a hyperparameter to balance the hash representation loss and the quantization loss.

2.4. Feature Extraction Networks

In QDCMH, feature extraction includes two deep neural networks: a classical convolutional neural network is used to extract the features of images and a multiscale fusion model is utilized to learn features from texts. Specifically, for image modality, we deploy AlexNet [11] pretrained on the ImageNet [19] dataset. We then fine-tune the last layer using a new fully connected hash layer which consists of k hidden nodes. Therefore, the learned deep features have been embedded into a k-dimensional Hamming space. For text modality, the TxtNet in SSAH [15] is used, which comprises a three-layer feedforward neural network and a multiscale (MS) fusion model (Input⟶MS⟶4096⟶512⟶k).

3. Learning Algorithm of QDCMH

For QDCMH, we utilize alternating strategy to learn parameters θ_V of deep neural networks for image modality and parameters θ_T of deep neural networks for text modality and hash codes matrix B for all training image-text pairs. When we learn one of θ_V, θ_T, and B, we keep the other two fixed. The specific algorithm for QDCMH is depicted in Algorithm 1.

Algorithm 1 — QDCMH: quadruplet-based deep cross-modal hashing.

3.1. Update θ_V with θ_T and B Fixed

When θ_T and B are maintained fixed, we utilize stochastic gradient descent and backpropagation to optimize the deep neural network parameters θ_V.

3.2. Update θ_T with θ_V and B Fixed

When we fix the values of θ_V and B, we use stochastic gradient descent and backpropagation to learn the deep neural network parameters θ_T.

3.3. Update B with θ_T and θ_V Fixed

When the deep neural networks' parameters θ_T and θ_V are kept unchanged, the hash codes matrix B can be optimized with equation (5).

4. Experiments

4.1. Datasets

To investigate the performance of QDCMH, we conduct experiments on two benchmark cross-modal retrieval datasets: MIRFLICKR-25K [20] and Microsoft COCO2014 [21], and the brief descriptions of the datasets are listed in Table 1.

Table 1.

Brief description of the experimental datasets.

Dataset	Used	Train	Query	Retrieve	Tag dimension	Labels
MIRFLICKR-25K	20,015	10,000	2,000	18,015	1,386	24
MS-COCO2014	122,218	10,000	5,000	117,218	2,026	80

Open in a new tab

4.2. Evaluation Metrics

In our experiments, we utilize mean average precision (MAP), top N-precision curves (top N Curves), and precision-recall curves (PR Curves) as evaluation metrics; for the detailed description of these evaluation metrics, refer to [22, 23].

4.3. Baselines and Implementation Details

We compare our proposed QDCMH method with eight state-of-the-art cross-modal hashing methods, including four handcrafted ones, i.e., cross-modal similarity sensitive hashing (CMSSH) method [7], semantics-preserving hashing (SePH) [9] method, semantic correlation maximization (SCM) method [8], and generalized semantic preserving hashing (GSPH) method [10] and four deep feature-based ones, i.e., deep cross-modal hashing (DCMH) method [13], pairwise relationship guided deep hashing (PRDH) method [14], self-supervised adversarial hashing (SSAH) method [15], and triplet-based deep hashing (TDH) method [16]. Most baseline methods are carefully implemented based on the codes provided by the authors. A few baseline methods are implemented by us following the suggestions and descriptions of the original papers.

All the experiments are executed by using the open source deep learning framework pytorch and running on an NVIDIA GTX Titan XP GPU server. In our experiments, we set n_I⟶T=n_T⟶I=10000, max_epoch=500, and λ=10⁻⁵ and the learning rate is initialized to 10^−1.5 and gradually decreased to 10⁻⁶ in 500 epochs. For those handcrafted feature-based baselines, each image in the two datasets is represented by a bag of words (BoW) histogram or feature vector having 512 dimensions. For the whole experiment, we use I⟶T to denote using a querying image while returning text and T⟶I to denote using a querying text while returning an image.

4.4. Performance Evaluation and Discussion

Firstly, we investigate the performance of QDCMH with different hyperparameters β and γ. To this goal, we experiment on MIRFLICKR-25K with the hash code length k=64 and record the corresponding MAPs under different values of β and γ, as shown in Figure 3. We find that high performance can be acquired when β=1 and γ=0.2.

A sensitivity analysis of the hyperparameters. (a) Hyperparameter β on MIRFLICKR-25K dataset. (b) Hyperparameter γ on MIRFLICKR-25K dataset.

Secondly, to validate the performance of QDCMH, we perform the experiment to compare QDCMH with baseline methods in terms of MAP on datasets MIRFLICKR-25K and MS-COCO2014. Table 2 presents the MAPs of each method for different hash code lengths, i.e., 16, 32, and 64. DSePH represents the SePH method whose features of the original images are extracted by CNN–F. From Table 2, we can see that the following. (1) The MAPs of our proposed QDCMH are higher than the MAPs of most baseline methods in most cases, which demonstrates the superiority of QDCMH. We can also observe that SSAH outperforms than our proposed QDCMH in most cases, which is partly because SSAH takes self-supervised learning and generative adversarial networks into account during hash representation learning procedure. (2) The MAPs of QDCMH is always higher than the MAPs of TDH, which shows that quadruplet loss can better preserve semantic relevance than triplet loss in cross-modal hashing retrieval. (3) The MAPs of DSePH is always higher than the MAPs of SePH, which demonstrates that deep neural networks have powerful features learning capacity. (4) Our proposed QDCMH can achieve better performance on MS-COCO 2014 dataset than on MIRFlickr-25K dataset, which is partly because the instances in MS-COCO 2014 dataset belong to 80 categories while the instances in MIRFlickr-25K dataset belong to 24 categories, and this makes the quadruplets generated from the MS-COCO 2014 dataset have better generalization ability than the quadruplets generated from the MIRFlickr-25K dataset.

Table 2.

Comparison to baselines in terms of MAP on two datasets: MIRFLICKR-25K, and Microsoft COCO2014, respectively. The best accuracy is shown in boldface.

Task	Methods		MIRFlickr-25K			MS-COCO
Task	Methods		16bits	32bits	64bits	16bits	32bits	64bits
I⟶T	Handcrafted methods	CMSSH [7]	0.5600	0.5709	0.5836	0.5439	0.5450	0.5410
		SePH [9]	0.6740	0.6813	0.6803	0.4295	0.4353	0.4726
		SCM [8]	0.6354	0.6407	0.6556	0.4252	0.4344	0.4574
		GSPH [10]	0.6068	0.6191	0.6230	0.4427	0.4733	0.4840
	Deep methods	DCMH [13]	0.7316	0.7343	0.7446	0.5228	0.5438	0.5419
		PRDH [14]	0.6952	0.7072	0.7108	0.5238	0.5521	0.5572
		SSAH [15]	0.7745	0.7882	0.7990	0.5127	0.5256	0.5067
		TDH [16]	0.7423	0.7478	0.7512	0.5164	0.5222	0.5276
		DSePH [9]	0.7128	0.7285	0.7422	0.4621	0.4958	0.5112
		QDCMH	0.7635	0.7688	0.7713	0.5286	0.5313	0.5371

T⟶I	Handcrafted methods	CMSSH [7]	0.5726	0.5776	0.5753	0.3793	0.3876	0.3899
		SePH [9]	0.7139	0.7258	0.7294	0.4348	0.4606	0.5195
		SCM [8]	0.6340	0.6458	0.6541	0.4118	0.4183	0.4345
		GSPH [10]	0.6282	0.6458	0.6503	0.5435	0.6039	0.6461
	Deep methods	DCMH [13]	0.7607	0.7737	0.7805	0.4883	0.4942	0.5145
		PRDH [14]	0.7626	0.7718	0.7755	0.5122	0.5190	0.5404
		SSAH [15]	0.7860	0.7974	0.7910	0.4832	0.4831	0.4922
		TDH [16]	0.7516	0.7577	0.7634	0.5198	0.5332	0.5399
		DSePH [9]	0.7422	0.7578	0.7760	0.4616	0.4882	0.5305
		QDCMH	0.7762	0.7725	0.7859	0.5245	0.5398	0.5487

Open in a new tab

Thirdly, to further investigate the performance of QDCMH, we plot the precision-recall curves and top N-precision curves of QDCMH and baseline methods with hash code lengths 64 on datasets MIRFLICKR-25K, Microsoft COCO2014, respectively, as presented in Figures 4 and 5. From this figure, we can see that the precision-recall curves and top N-precision curves are nearly consistent with the MAPs in Table 2.

Precision-recall curves on datasets MIRFLICKR-25K and Microsoft COCO2014.

Top N-precision curves on datasets MIRFLICKR-25K and Microsoft COCO2014.

5. Conclusions

In this paper, we introduce a quadruplet loss into deep cross-modal hashing to fully preserve semantic relevance of original cross-modal quadruple instances and propose a quadruplet based deep cross-modal hashing method (QDCMH). QDCMH integrates quadruplet-based cross-modal semantic relevance preserving module, hash representation learning, and hash code generation into an end-to-end framework. Experiments on two benchmark cross-modal retrieval datasets demonstrate the efficiency of our proposed QDCMH.

Data Availability

The experimental datasets and the related settings can be found in https://github.com/SWU-CS-MediaLab/MLSPH. The experimental codes used to support the findings of this study will been deposited in the github repository after the publication of this paper or can be provided by xitaozou@sanxiau.edu.cn.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

1.Peng Y., Huang X., Zhao Y. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology. 2017;28(9):2372–2385. doi: 10.1109/TCSVT.2017.2705068. [DOI] [Google Scholar]
2.Wang K., Yin Q., Wang W., Wu S., Wang L. A comprehensive survey on cross-modal retrieval. Multimedia. 2016 https://arxiv.org/abs/1607.06215. [Google Scholar]
3.Deng C., Yang E., Liu T., Tao D. Two-stream deep hashing with class-specific centers for supervised image search. IEEE Transactions on Neural Networks and Learning Systems. 2019;31(6):2189–2201. doi: 10.1109/TNNLS.2019.2929068. [DOI] [PubMed] [Google Scholar]
4.Deng C., Yang E., Liu T., Liu W., Tao D. Unsupervised semantic-preserving adversarial hashing for image search. IEEE Transaction on Image Processing. 2019;28 doi: 10.1109/tip.2019.2903661.2903661 [DOI] [PubMed] [Google Scholar]
5.Yang E., Deng C., Li C., Liu W., Li J., Tao D. Shared predictive cross-modal deep quantization. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(11):5292–5303. doi: 10.1109/tnnls.2018.2793863. [DOI] [PubMed] [Google Scholar]
6.Yang E., Liu T., Deng C., Tao D. Adversarial examples for hamming space search. IEEE Transactions on Cybernetics. 2018;50(4):1473–1484. doi: 10.1109/TCYB.2018.2882908. [DOI] [PubMed] [Google Scholar]
7.Bronstein M. M., Bronstein A. M., Michel F., Paragios N. Data fusion through cross-modality metric learning using similarity-sensitive hashing. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; June 2010; San Francisco, CA, USA. pp. 3594–3601. [DOI] [Google Scholar]
8.Zhang D., Li W.-J. Large-scale supervised multimodal hashing with semantic correlation maximization. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence; July 2014; Québec City, Québec, Canada. pp. 2177–2183. [Google Scholar]
9.Lin Z., Ding G., Hu M., Wang J. Semantics-preserving hashing for cross-view retrieval. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June 2015; Boston, MA, USA. pp. 3864–3872. [DOI] [Google Scholar]
10.Mandal D., Chaudhury K. N., Biswas S. Generalized semantic preserving hashing for n-label cross-modal retrieval. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); July 2017; Honolulu, HI, USA. pp. 4076–4084. [DOI] [Google Scholar]
11.Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2012;60(6):1097–1105. doi: 10.1145/3065386. [DOI] [Google Scholar]
12.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June 2016; Las Vegas, NV, USA. pp. 770–778. [DOI] [Google Scholar]
13.Jiang Q.-Y., Li W.-J. Deep cross-mmodal hashing. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); July 2017; Honolulu, HI, USA. pp. 3232–3240. [DOI] [Google Scholar]
14.Yang E., Deng C., Liu W., Liu X., Tao D., Gao X. Pairwise relationship guided deep hashing for cross-modal retrieval. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence 2017; February 2017; San Francisco, CA, USA. [Google Scholar]
15.Li C., Deng C., Li N., Liu W., Gao X., Tao D. Self-supervised adversarial hashing networks for cross-modal retrieval. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2018; Salt Lake City, UT, USA. pp. 4242–4251. [DOI] [Google Scholar]
16.Deng C., Chen Z., Liu X., Gao X., Tao D. Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions Image Processing. 2018;27(8):3893–3903. doi: 10.1109/tip.2018.2821921. [DOI] [PubMed] [Google Scholar]
17.Zhu J., Chen Z., Zhao L., Wu S. Quadruplet-based deep hashing for image retrieval. Neurocomputing. 2019;366:161–169. doi: 10.1016/j.neucom.2019.07.082. [DOI] [Google Scholar]
18.Chen W., Chen X., Zhang J., Huang K. Beyond triplet loss: a deep quadruplet network for person rre-identification. Computer Vision and Pattern Recognition. 2017:403–412. https://arxiv.org/abs/1704.01719. [Google Scholar]
19.Deng J., Dong W., Socher R., et al. A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; June 2009; Miami, FL, USA. pp. 248–255. [DOI] [Google Scholar]
20.Huiskes M. J., Lew M. S. The mir flickr retrieval evaluation. Proceedings of the 1st ACM international conference on Multimedia information retrieval; October, 2008; New York, NY, USA. pp. 39–43. [DOI] [Google Scholar]
21.Lin T.-Y., Maire M., Belongie S., et al. Microsoft COCO: common objects in context. Proceedings of the European Conference on Computer Vision ECCV 2014; September 2014; Zurich, Switzerland. pp. 740–755. [DOI] [Google Scholar]
22.Wang X., Zou X., Bakker E. M., Wu S. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing. 2020;400:255–271. doi: 10.1016/j.neucom.2020.03.019. [DOI] [Google Scholar]
23.Zou X., Wang X., Bakker E. M., Wu S. Multi-label semantics preserving based deep cross-modal hashing. Signal Processing Image Communication. 2021;93(9) doi: 10.1016/j.image.2020.116131.116131 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[B1] 1.Peng Y., Huang X., Zhao Y. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology. 2017;28(9):2372–2385. doi: 10.1109/TCSVT.2017.2705068. [DOI] [Google Scholar]

[B2] 2.Wang K., Yin Q., Wang W., Wu S., Wang L. A comprehensive survey on cross-modal retrieval. Multimedia. 2016 https://arxiv.org/abs/1607.06215. [Google Scholar]

[B3] 3.Deng C., Yang E., Liu T., Tao D. Two-stream deep hashing with class-specific centers for supervised image search. IEEE Transactions on Neural Networks and Learning Systems. 2019;31(6):2189–2201. doi: 10.1109/TNNLS.2019.2929068. [DOI] [PubMed] [Google Scholar]

[B4] 4.Deng C., Yang E., Liu T., Liu W., Tao D. Unsupervised semantic-preserving adversarial hashing for image search. IEEE Transaction on Image Processing. 2019;28 doi: 10.1109/tip.2019.2903661.2903661 [DOI] [PubMed] [Google Scholar]

[B5] 5.Yang E., Deng C., Li C., Liu W., Li J., Tao D. Shared predictive cross-modal deep quantization. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(11):5292–5303. doi: 10.1109/tnnls.2018.2793863. [DOI] [PubMed] [Google Scholar]

[B6] 6.Yang E., Liu T., Deng C., Tao D. Adversarial examples for hamming space search. IEEE Transactions on Cybernetics. 2018;50(4):1473–1484. doi: 10.1109/TCYB.2018.2882908. [DOI] [PubMed] [Google Scholar]

[B7] 7.Bronstein M. M., Bronstein A. M., Michel F., Paragios N. Data fusion through cross-modality metric learning using similarity-sensitive hashing. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; June 2010; San Francisco, CA, USA. pp. 3594–3601. [DOI] [Google Scholar]

[B8] 8.Zhang D., Li W.-J. Large-scale supervised multimodal hashing with semantic correlation maximization. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence; July 2014; Québec City, Québec, Canada. pp. 2177–2183. [Google Scholar]

[B9] 9.Lin Z., Ding G., Hu M., Wang J. Semantics-preserving hashing for cross-view retrieval. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June 2015; Boston, MA, USA. pp. 3864–3872. [DOI] [Google Scholar]

[B10] 10.Mandal D., Chaudhury K. N., Biswas S. Generalized semantic preserving hashing for n-label cross-modal retrieval. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); July 2017; Honolulu, HI, USA. pp. 4076–4084. [DOI] [Google Scholar]

[B11] 11.Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2012;60(6):1097–1105. doi: 10.1145/3065386. [DOI] [Google Scholar]

[B12] 12.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June 2016; Las Vegas, NV, USA. pp. 770–778. [DOI] [Google Scholar]

[B13] 13.Jiang Q.-Y., Li W.-J. Deep cross-mmodal hashing. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); July 2017; Honolulu, HI, USA. pp. 3232–3240. [DOI] [Google Scholar]

[B14] 14.Yang E., Deng C., Liu W., Liu X., Tao D., Gao X. Pairwise relationship guided deep hashing for cross-modal retrieval. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence 2017; February 2017; San Francisco, CA, USA. [Google Scholar]

[B15] 15.Li C., Deng C., Li N., Liu W., Gao X., Tao D. Self-supervised adversarial hashing networks for cross-modal retrieval. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2018; Salt Lake City, UT, USA. pp. 4242–4251. [DOI] [Google Scholar]

[B16] 16.Deng C., Chen Z., Liu X., Gao X., Tao D. Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions Image Processing. 2018;27(8):3893–3903. doi: 10.1109/tip.2018.2821921. [DOI] [PubMed] [Google Scholar]

[B17] 17.Zhu J., Chen Z., Zhao L., Wu S. Quadruplet-based deep hashing for image retrieval. Neurocomputing. 2019;366:161–169. doi: 10.1016/j.neucom.2019.07.082. [DOI] [Google Scholar]

[B18] 18.Chen W., Chen X., Zhang J., Huang K. Beyond triplet loss: a deep quadruplet network for person rre-identification. Computer Vision and Pattern Recognition. 2017:403–412. https://arxiv.org/abs/1704.01719. [Google Scholar]

[B19] 19.Deng J., Dong W., Socher R., et al. A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; June 2009; Miami, FL, USA. pp. 248–255. [DOI] [Google Scholar]

[B20] 20.Huiskes M. J., Lew M. S. The mir flickr retrieval evaluation. Proceedings of the 1st ACM international conference on Multimedia information retrieval; October, 2008; New York, NY, USA. pp. 39–43. [DOI] [Google Scholar]

[B21] 21.Lin T.-Y., Maire M., Belongie S., et al. Microsoft COCO: common objects in context. Proceedings of the European Conference on Computer Vision ECCV 2014; September 2014; Zurich, Switzerland. pp. 740–755. [DOI] [Google Scholar]

[B22] 22.Wang X., Zou X., Bakker E. M., Wu S. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing. 2020;400:255–271. doi: 10.1016/j.neucom.2020.03.019. [DOI] [Google Scholar]

[B23] 23.Zou X., Wang X., Bakker E. M., Wu S. Multi-label semantics preserving based deep cross-modal hashing. Signal Processing Image Communication. 2021;93(9) doi: 10.1016/j.image.2020.116131.116131 [DOI] [Google Scholar]

PERMALINK

Quadruplet-Based Deep Cross-Modal Hashing

Huan Liu

Jiang Xiong

Nian Zhang

Fuming Liu

Xitao Zou

Abstract

1. Introduction

Figure 1.

2. Proposed Method

Figure 2.

2.1. Notations

2.2. Quadruplet-Based Cross-Modal Semantic Preserving Module

2.3. Hash Representation Learning and Hash Code Learning

2.4. Feature Extraction Networks

3. Learning Algorithm of QDCMH

Algorithm 1.

3.1. Update θV with θT and B Fixed

3.2. Update θT with θV and B Fixed

3.3. Update B with θT and θV Fixed

4. Experiments

4.1. Datasets

Table 1.

4.2. Evaluation Metrics

4.3. Baselines and Implementation Details

4.4. Performance Evaluation and Discussion

Figure 3.

Table 2.

Figure 4.

Figure 5.

5. Conclusions

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.1. Update θ_V with θ_T and B Fixed

3.2. Update θ_T with θ_V and B Fixed

3.3. Update B with θ_T and θ_V Fixed