Threshold-based exploitation of noisy label in black-box unsupervised domain adaptation

Huiwen Xu; Jaeri Lee; U Kang

doi:10.1371/journal.pone.0321987

. 2025 May 12;20(5):e0321987. doi: 10.1371/journal.pone.0321987

Threshold-based exploitation of noisy label in black-box unsupervised domain adaptation

Huiwen Xu ¹, Jaeri Lee ¹, U Kang ^1,^*

Editor: Lei Chu²

PMCID: PMC12068613 PMID: 40354435

Abstract

How can we perform unsupervised domain adaptation when transferring a black-box source model to a target domain? Black-box Unsupervised Domain Adaptation focuses on transferring the labels derived from a pre-trained black-box source model to an unlabeled target domain. The problem setting is motivated by privacy concerns associated with accessing and utilizing source data or source model parameters. Recent studies typically train the target model by mimicking the labels derived from the black-box source model, which often contain noise due to domain gaps between the source and the target. Directly exploiting such noisy labels or disregarding them may lead to a decrease in the model’s performance. We propose Threshold-Based Exploitation of Noisy Predictions (TEN), a method to accurately learn the target model with noisy labels in Black-box Unsupervised Domain Adaptation. To ensure the preservation of information from the black-box source model, we employ a threshold-based approach to distinguish between clean labels and noisy labels, thereby allowing the transfer of high-confidence knowledge from both labels. We utilize a flexible thresholding approach to adjust the threshold for each class, thereby obtaining an adequate amount of clean data for hard-to-learn classes. We also exploit knowledge distillation for clean data and negative learning for noisy labels to extract high-confidence information. Extensive experiments show that TEN outperforms baselines with an accuracy improvement of up to 9.49%.

Introduction

How can we transfer the knowledge from a black-box source model to a target task? Unsupervised domain adaptation (UDA) has emerged as a crucial research topic in the field of machine learning and computer vision. The goal of UDA is to adapt a model trained on a source domain with labeled data to a target domain with only unlabeled data, where the target domain has similar but different statistical characteristics to the source domain. The ability to perform UDA is essential in many real-world applications, such as image classification, object recognition, and natural language processing, where the target domain may not have labeled data for training.

Unsupervised domain adaptation [1, 2] has been shown to have limitations, one of which involves the necessity to access either the source data or a white-box source model. Nonetheless, sharing the source data might not be suitable due to privacy concerns, particularly in sensitive domains such as medical records or financial data. Additionally, even transferring a pre-trained white-box source model to a target domain raises security concerns, as the source data could potentially be reconstructed using techniques like generative adversarial training [3].

Recent studies focus on a new problem setting known as Black-box Unsupervised Domain Adaptation (Black-box UDA), where the source domain provides only a black-box model without revealing its model parameters. In this scenario, the knowledge that can be transferred to the target domain is limited to the outputs produced by the black-box source model. However, the outputs contain noise due to the intrinsic dissimilarities between the source and target domains, which makes the domain adaptation process more challenging. DINE [4] adopts knowledge distillation and pseudo-labeling strategies, instructing the target model to distill the labels produced by the black-box source model. IterLNL [5] utilizes a noisy labeling technique to select clean instances from the target data, thereby training the model solely on clean data. Both algorithms may result in decreased performance, as DINE is susceptible to learning from mislabeled data, while IterLNL experiences information loss owing to its exclusion of noisy data during training.

In this paper, we propose Threshold-Based Exploitation of Noisy Predictions (TEN), a precise method for Black-box UDA by distilling reliable high-confidence knowledge from the source labels. Owing to the presence of noise in the outputs from the source model, we partition the target data into distinct clean and noisy subsets, and apply distinct strategies to distill the high-confidence knowledge. Pseudo-labels associated with the clean subset align closely with the actual ground truths, whereas those linked to the noisy subset frequently deviate from them. In the process of data partitioning, a flexible threshold is determined for each class to ensure that hard-to-learn classes possess an adequate number of untainted instances. We harness knowledge distillation on the clean subset to emulate the source model’s labels. Conversely, on the noisy subset, we exploit negative learning to discern which classes the instances do not pertain to. In addition, we employ consistency regularization coupled with entropy regularization techniques to learn the structural features of the target domain. Extensive experiments shows that TEN surpasses baseline methods, with an accuracy increase of up to 9.49%.

Our contributions are summarized as follows:

Algorithm. We propose TEN, a precise method for distilling reliable high-confidence knowledge from the outputs of a black-box source model, even in the presence of noise.
Accuracy. Extensive experiments conducted on real-world datasets demonstrate that TEN outperforms baselines with up to 9.49 $%$ higher accuracy for single-source UDA, and 4.81% higher accuracy for multi-source UDA.
Ablation Study. We show that the performance of TEN exhibits an upward trend when more noisy labels are used for training.

Table 1 provides the definitions of symbols used in this paper.

Table 1. Table of symbols.

Symbol	Terminology	Description
f _s	source model	black-box neural network where only network labels are available
f _t	target model	neural network that classifies target inputs
D _s	source data	labeled source data
D _t	target data	unlabeled target data
$x_{s}^{i}$	source input	feature of i-th instance in source domain
$y_{s}^{i}$	source label	label of i-th instance in source domain
$x_{t}^{i}$	target input	feature of i-th instance in target domain
${\tilde{y}}_{t}^{i}$	target pseudo label	target label of i-th instance from source model
${\hat{y}}_{t}^{i}$	target label	target label of i-th instance from target model
${\tilde{y}}_{t, c}^{i}$	target pseudo probability	target probability of i-th instance belonging to the c-th class obtained from source model
${\hat{y}}_{t, c}^{i}$	target probability	target probability of i-th instance belonging to the c-th class obtained from target model
${\hat{y}}_{t, w}^{i}$	target weak label	target label of i-th weakly augmented instance from target model
${\hat{y}}_{t, w, c}^{i}$	target weak probability	target probability of i-th weakly augmented instance belonging to the c-th class obtained from target model
${\hat{y}}_{t, s, c}^{i}$	target strong probability	target probability of i-th strongly augmented instance belonging to the c-th class obtained from target model
n _s	source data size	number of source instances
n _t	target data size	number of target instances
$𝒳_{t}$	source input space	feature space in source domain
$𝒴_{s}$	source label space	label space in source domain
$𝒳_{t}$	target input space	feature space in target domain
$𝒴_{t}$	target label space	label space in target domain
c	class	class for source and target domains
$τ_{p}, τ_{n}$	predefined thresholds	predefined positive and negative thresholds
$T_{p} (\cdot)$	flexible threshold	adjusted positive threshold for clean data
$γ$	regularization threshold	constant threshold for consistency regularization
$λ_{1}, λ_{2}, λ_{3}$	balancing parameters	weights for losses
L _kd	knowledge distillation loss	knowledge distillation loss for clean subset
L _nce	negative learning loss	negative loss for noisy subset
L _er	entropy regularization loss	entropy regularization loss for all target data
L _cr	consistency regularization loss	consistency regularization loss for all target data
L _total	overall loss	total loss for target data
L _s	source loss	source loss for smoothed label vectors
q _s	source prediction	smoothed prediction in source domain
$ϵ$	smoothing parameter	parameter for label smoothing
C	number of classes	number of classes in source and target domains

Open in a new tab

Related works

Unsupervised domain adaptation

Unsupervised model adaptation, also known as source-free UDA, has garnered increasing attention due to its ability to operate without accessing the source domain, making it suitable for more practical scenarios. Early researches [6] provide a theoretical analysis of transfer learning, which motivated deep domain adaptation without source data. Zhu et al. [7] enhance domain adaptation by leveraging high-order graphs and low-rank tensors. Zhu et al. [8] propose a multiview latent space framework for UDA and MultiDA with selective pseudo-labeling. These methods are limited to solving UDA problems which does not fundamentally address privacy concerns.

In this paper, we tackle an even more challenging problem by leveraging only the predictions from a black-box model trained in the source domain for model adaptation. Few works have been conducted in this field. Zhang et al. [5] focus on selecting clean instances from noisy data, training the model only on these clean samples. However, this approach has a risk of information loss by excluding noisy data, potentially limiting the model’s generalization ability. Liang et al. [4] employ a knowledge distillation and pseudo-labeling strategy, where the target model distills labels from a black-box source model. Zhang et al. [9] use a bi-directional memorization mechanism to identify useful features and progressively correct noisy pseudo labels, improving generalization across visual recognition tasks. However, it is prone to learning from mislabeled data, which can reduce the model’s performance. In contrast, our TEN approach divides the target dataset into “clean" and “noisy" subsets, and distills high-confidence predictions from both, allowing the model to leverage information from both clean and noisy data. This strategy mitigates the risks of information loss and mislabeled data, offering a more robust learning process.

Semi-supervised learning with noisy labels

Noise can be easily accumulated during training when incorrect predictions are used in semi-supervised or unsupervised learning [10]. Such noise can cause the model to overfit to the noisy feature space, making it challenging to adapt to new domains [11]. In UDA, pseudo-labeling [12,13] and knowledge distillation [14,15] are effective techniques, but their performances can be degraded by noise. In particular, for transfer learning tasks involving distant domains, the pseudo labels for the target domain can be extremely noisy, resulting in a deterioration of subsequent training. Our proposed method in this work addresses the issue using (1) a flexible threshold technique which distills more instances for hard-to-learn classes, and (2) pseudo-labeling with negative learning which distills the information from noise.

Proposed method

Given a black-box source model f_s and unlabeled target data D_t, our objective is to train a target model f_t that performs well on the target data without accessing any source data or source model parameters. The target data $D_{t} = {x_{t}^{i}}_{i = 1}^{n_{t}}$ consists of n_t instances distributed across C categories, where $x_{t}^{i} \in 𝒳_{t}$ ; $𝒳_{t}$ represents the target input space. The source model is pre-trained using labeled source data $D_{s} = {(x_{s}^{i}, y_{s}^{i})}_{i = 1}^{n_{s}}$ , which contains n_s labeled instances also in C categories, where $x_{s}^{i} \in 𝒳_{s}$ and $y_{s}^{i} \in 𝒴_{s}$ . $𝒳_{s}$ and $𝒴_{s}$ represent the source input and label spaces, respectively. We assume that the source label space $𝒴_{s}$ and the target label space $𝒴_{t}$ are identical, but the source and target input data have different distributions, i.e., $P (𝒳_{s}) \neq P (𝒳_{t})$ . Distinctively diverging from Unsupervised Domain Adaptation, which mandates access to either the source data or its model parameters, the Black-box Unsupervised Domain Adaptation (Black-box UDA) facilitates the training of the target model in the absence of both source data or source parameters. Black-box UDA depends solely on the soft labels generated by the source model for target instances, denoted as ${\tilde{y}}_{t}^{i} = f_{s} (x_{t}^{i})$ .

Overview

The challenge of Black-box UDA resides in distilling the knowledge from the outputs of the black-box source model. Due to the dissimilarities between the source and target domains, the outputs generated by the black-box source model contain noise. Such noise can yield erroneous results, further exacerbating the challenge of accurately labeling the target data. Consequently, it is imperative to train the target model effectively by utilizing soft labels even in the presence of such noise.

The following detailed challenges need to be addressed for the goal.

C1
How can we effectively divide the target data into clean and noisy subsets? Utilizing a fixed high threshold for data separation may lead to extreme cases where no training data are selected for hard-to-learn classes.
C2
How can we distill meaningful information from noisy labels? When the gap between the source and target domains is substantial, the amount of noisy labels increases, and a failure to effectively learn from them can significantly impede the target model’s performance.
C3
How can we learn the structural information about the target data? Insufficient exploration of hidden representations leads to diminished performance of the target model owing to the disregard of the target domain’s structure.

We address the aforementioned challenges with the following main ideas:

I1
Flexible Threshold. We design a flexible threshold for each class, thereby facilitating the allocation of a larger amount of data to those classes that are difficult to learn.
I2
Negative Learning. We distill the information that reflects the absence of certain classes from the noisy labels.
I3
Structural Regularization. We exploit entropy regularization and consistency regularization so that the target model learns intrinsic data structure about the target data.

We propose TEN, an accurate method for Black-box UDA. The overview of the proposed TEN is depicted in Fig. 1. The entire procedure comprises two distinct phases: division and training. In the division phase, given a predefined high threshold, we count the number of instances of each class whose confidences surpass the threshold, and subsequently adjust the threshold for each class based on these counts. The target data are divided into clean and noisy subsets in accordance with the adjusted thresholds. Throughout the training phase, soft labels of the target data are generated by leveraging the black-box source model. The target model mimics the soft labels of clean and noisy data via knowledge distillation and negative learning, respectively. In order to facilitate the acquisition of the structural information of the target, we exploit entropy regularization and consistency regularization. These ideas cohesively establish a comprehensive strategy for enhancing the performance of the target model by exploiting the strengths of the black-box source model and structural information in the target data.

Flexible threshold

How can we select clean labels from the target data so that reliable knowledge can be learned during knowledge distillation? Clean subset comprises instances whose soft labels generated by the black-box source model closely align with the ground truths of the target task. Conversely, noisy subset primarily consists of instances whose soft labels tend to be inaccurate. Our goal is to train the target model to mimic only clean labels of the black-box source model, since the noisy labels may mislead the target model.

A naive technique involves a predefined high threshold to split instances into (1) clean instances whose confidences surpass the threshold, and (2) noisy instances whose confidences fall below the threshold. Noisy instances easily have wrong predictions due to the gap between the source and target domains. Thus, we consider only the soft labels of clean instances as teacher labels for knowledge distillation. Nonetheless, distinct classes have different properties for training; utilizing identical thresholds for each class to select clean instances may result in an undesirable scenario where hard-to-learn classes cannot identify appropriate instances for training the model, ultimately leading to inadequate performance.

We propose to use a flexible threshold to set a lower threshold for classes difficult to learn. When the threshold is high, the number of predictions that belong to a certain class and exceed the threshold represents the learning difficulty of the class [16]. We count the number of instances whose confidence exceeds the predefined threshold and belong to the class c:

α (c) = \sum_{i = 1}^{n_{t}} 1 [max ({\tilde{y}}_{t}^{i}) > τ_{p}] \cdot 1 [a r g m a x ({\tilde{y}}_{t}^{i}) = c]

(1)

where ${\tilde{y}}_{t}^{i}$ is the soft label of i-th target instance generated by the black-box source model, i.e., ${\tilde{y}}_{t}^{i} = f_{s} (x_{t}^{i})$ , $τ_{p}$ represents a predefined positive threshold, and n_t represents the number of target instances. We scale the threshold for each class:

T_{p} (c) = β (c) \cdot τ_{p} and β (c) = \frac{α (c)}{{max}_{c} α (c)}

(2)

where $T_{p} (c)$ is the flexible confidence threshold for class c, and $β (c) \in [0 ~ 1]$ represents the scale factor for class c. The flexible threshold is computed by multiplying the predefined threshold with the scale factor, and enables the selection of a greater quantity of clean labels for training. We train the target model by mimicking the labels of the source model on the selected clean instances whose confidence exceed the adjusted threshold:

L_{k d} = - \sum_{i = 1}^{n_{t}} 1 [max ({\tilde{y}}_{t}^{i}) \geq T_{p} (a r g m a x ({\tilde{y}}_{t}^{i}))] \sum_{c = 1}^{C} {\tilde{y}}_{t, c}^{i} \log {\hat{y}}_{t, c}^{i}

(3)

where ${\hat{y}}_{t}^{i}$ represents the soft label of i-th target instances obtained from the target model, i.e., ${\hat{y}}_{t}^{i} = f_{t} (x_{t}^{i})$ . ${\tilde{y}}_{t, c}^{i}$ and ${\hat{y}}_{t, c}^{i}$ represent the probabilities of the i-th instance for class c, generated by the source and target models, respectively.

IterLNL [5] also suggests the noise rate technique to adjust class-wise threshold, but it requires many hyperparameters which are sensitive to target performance. We automatically adjust the threshold for each class according to the outputs of the black-box source model, which does not need any extra validation set or hyperparameters.

Negative learning

The strategy of training the target model using only clean labels results in a reduction of available training data and information loss. Although the confidence of noisy labels may be too low for knowledge distillation, it does not imply that noisy labels lack learnable information. Indeed, the model’s confidence in data is reflected not only in the presence of certain classes, but also in their absence [17,18]. We select a subset of the noisy labels whose confidence is low enough, and employ negative learning on them.

Given a predefined negative threshold $τ_{n}$ , we apply negative cross-entropy for the selected labels of noisy instances whose probability falls beneath the negative confidence threshold.

L_{n c e} = - \sum_{i = 1}^{n_{t}} 1 [min ({\tilde{y}}_{t}^{i}) \leq τ_{n}] \frac{1}{\sum_{c} 1 [{\tilde{y}}_{t, c}^{i} \leq τ_{n}]} \sum_{c = 1}^{C} 1 [{\tilde{y}}_{t, c}^{i} \leq τ_{n}] (1 - {\tilde{y}}_{t, c}^{i}) \log (1 - {\hat{y}}_{t, c}^{i})

(4)

where $\sum_{c} 1 [{\tilde{y}}_{t, c}^{i} \leq τ_{n}]$ represents the number of selected labels for the i-th instance.

Structural regularization

Knowledge distillation and negative learning are designed to learn reliable high-confidence knowledge from soft labels. However, they may not capture the intrinsic data structure of the target data [4]. To address the issue, we exploit consistency regularization and entropy regularization strategies.

We assume that the target model yields similar labels when provided with diverse augmentations of the same target input [19]. The target model is trained by strongly augmented target instances, which are supervised by the pseudo-labels derived from weakly augmented target instances, as follows:

L_{c r} = - \sum_{i = 1}^{n_{t}} 1 [max ({\hat{y}}_{t, w}^{i}) \geq γ] \cdot \sum_{c = 1}^{C} {\hat{y}}_{t, w, c}^{i} \log {\hat{y}}_{t, s, c}^{i}

(5)

where ${\hat{y}}_{t, w}^{i}$ denotes the target labels of the i-th instance under weak augmentations. ${\hat{y}}_{t, w, c}^{i}$ and ${\hat{y}}_{t, s, c}^{i}$ denote the probabilities of the i-th instance for class c under weak and strong augmentations, respectively. $γ$ represents a threshold to ensure that the target model is trained on high-confidence instances. We employ a flip-and-shift augmentation for the weak augmentation, while the strong augmentation is achieved by RandAugment [20].

Entropy regularization operates under the premise that the probability distribution of a well-trained model’s outputs resembles a one-hot vector across classes. The target model is trained under the supervision of the hard labels derived from the weakly augmented target instances:

L_{e r} = - \sum_{i = 1}^{n_{t}} 1 [max ({\hat{y}}_{t, w}^{i}) \geq γ] \sum_{c = 1}^{C} 1 [a r g m a x ({\hat{y}}_{t, w}^{i}) = c] \cdot \log {\hat{y}}_{t, w, c}^{i}

(6)

Summarizing all the losses, the overall objective is formulated as:

L_{t o t a l} = L_{k d} + λ_{1} L_{n c e} + λ_{2} L_{c r} + λ_{3} L_{e r}

(7)

where $λ_{1}$ , $λ_{2}$ , and $λ_{3}$ are balancing hyperparameters. We present the complete algorithm for TEN in Algorithm 1.

Experiments

We present experimental results to answer the following questions about TEN:

Q1
Classification accuracy. Does TEN show better accuracies than baselines on benchmarks?
Q2
Effect of flexible threshold. Does flexible threshold improve the target performance?
Q3
Effect of negative learning. Does TEN exhibit desirable performance when a large amount of noisy data are used for training?
Q4
Ablation study. Do our ideas, such as knowledge distillation on clean subset, negative learning, and structural information, improve the target performance?
Q5
Hyperparameter sensitivity. Is accuracy sensitive to the positive and negative thresholds?

Algorithm 1 Threshold-Based Exploitation of Noisy Predictions (TEN)

Input: black-box source model f_s, unlabeled target data

$D = {x_{t}^{i}}_{i = 1}^{n_{t}}$ , randomly initialized target model f_t, and

predefined thresholds $τ_{p}$ and $τ_{n}$

Output: well-trained target model f_t

1: for each class do

2: Count the number of high-confidence instances for knowledge distillation using Eq. (1)

3: end for

4: Calculate positive threshold $T_{p} (c)$ using Eq. (2)

5: Divide the target data into clean and noisy subsets in accordance with $T_{p} (c)$ and $τ_{n}$

6: for each epoch do

7: for each batch do

8: Compute knowledge distillation loss L_kd for clean subset using Eq. (3)

9: Compute negative learning loss L_nce for noisy subset using Eq. (4)

10: Compute consistency regularization loss L_cr and entropy regularization loss L_er for all data using Eq. (5) and Eq. (6), respectively

11: Compute the overall loss L_total using Eq. (7), and update parameters of the target model f_t

12: end for

13: end for

Experimental setup

We present datasets, models, baselines, and hyperparameters for our experiments.

Datasets.

We use 5 image classification datasets summarized in Table 2. Office-31[1] comprises 3 domains, Amazon (A), DSLR (D), and Webcam (W), whereas Office-Home[2] [21] encompasses 4 domains, Art (A), Clipart (C), Product (P), and Real-World (R). Image-CLEF[3] dataset is composed of 4 domains, including Caltech-256 (C), ILSVRC2021 (I), PASCAL VOC2021 (P), and Bing (B). Adaptiope[4] [22] contains 3 domains, Product (P), Real life (R), and Synthetic (S), while VisDA[5] [23] comprises Synthetic (S) and Real domains (R). The imbalance ratio in a multi-class dataset indicates the ratio of the number of instances in the least prevalent (minority) class to that in the most prevalent (majority) class.

Table 2. Summary of datasets.

Dataset	# of instances	# of classes	Imbalance ratio
Office-31¹	4,110	31	22.58%
Office-Home²	15,588	65	15.15%
Image-CLEF³	2,400	12	100%
Adaptiope⁴	36,900	123	100%
VisDA⁵	> 280,000	212	34.19%

Open in a new tab

¹ https://faculty.cc.gatech.edu/ judy/domainadapt/

² https://www.hemanthdv.org/officeHomeDataset.html

³ https://www.imageclef.org/2014/adaptation

⁴ https://gitlab.com/tringwald/adaptiope

⁵ https://ai.bu.edu/visda-2017/

Models.

We use ResNet50 as source backbone and follow DINE [4] to train the source classifier. We add a fully-connected layer at the end of backbone feature extractor and train the source model f_s with label smoothing technique [24]. The loss of the source model is defined as $L_{s} = - 𝔼_{(x_{s}, y_{s}) \in 𝒳_{s} \times 𝒴_{s}} (q_{s})^{T} \log f_{s} (x_{s})$ , where q_s = (1 − $ϵ)$ $\cdot$ $1_{y_{s}}$ + $ϵ / K$ represents the smoothed label vector. The smoothing parameter $ϵ$ is set empirically to a value of 0.1, and $1_{j}$ represents a one-hot encoding of K dimensions where only the j-th value is 1. For the target model, we also use ResNet50 as backbone, and follow [2, 4] to replace the original classifier with a refined architecture that consists of a bottleneck layer with 256 units and a task-specific classifier. We place a batch-normalization layer after the fully-connected layer within the bottleneck layer, and a weight normalization [25] layer in the task-specific classifier.

Baselines.

We compare our proposed TEN with two competitors: DINE [4] and IterLNL [5]. DINE trains the target model by distilling the soft labels derived from the black-box source model, which includes noisy data, while IterLNL focuses on training exclusively with clean data, which are selected based on the noise rate. Additionally, we train the target model using only the target predictions from the black-box source model, and the method is denoted as “No adapt.”

Hyperparameters.

We conduct experiments five times and report the average accuracies. For all experiments, we use PyTorch on a GeForce RTX 3080. Following DINE [4], the models initialized from the pre-trained ImageNet model have their learning rate set to 1e-3, and those learned from scratch are set to a learning rate of 1e-2. Furthermore, we adopt learning rate scheduler, momentum (0.9), weight decay (1e-3), bottleneck size (256), and batch size (64). The values for the thresholds $τ_{p}$ and $τ_{n}$ are determined to be within the sets {0.5, 0.6, 0.7, 0.8} and {0.0001, 0.0005, 0.001, 0.005, 0.01}, respectively. The hyperparameters are optimized using Optuna.

Table 4. Accuracies (%) on Image-CLEF for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 3.67% higher accuracy.

Method	B $\to$ C	B $\to$ I	B $\to$ P	C $\to$ B	C $\to$ I	C $\to$ P	I $\to$ B	I $\to$ C	I $\to$ P	P $\to$ B	P $\to$ C	P $\to$ I
No adapt.	89.00	83.67	67.17	60.17	83.50	71.83	61.50	92.33	76.00	60.00	90.83	89.00
DINE	96.00	93.67	78.17	64.83	91.50	77.83	64.17	96.67	79.00	63.83	96.00	93.17
IterLNL	94.67	93.17	76.83	65.50	91.17	78.50	64.67	95.00	78.67	64.00	95.17	93.50
TEN (proposed)	97.00	94.83	80.00	66.33	93.67	80.50	66.50	97.67	79.67	67.67	97.50	95.00

Open in a new tab

Table 5. Accuracies (%) on Office-31 for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 7.59% higher accuracy.

Method	A $\to$ D	A $\to$ W	D $\to$ A	D $\to$ W	W $\to$ A	W $\to$ D
No adapt.	80.12	76.98	57.15	92.70	61.02	98.39
DINE	94.18	86.67	71.67	93.71	73.09	99.20
IterLNL	94.02	87.13	71.49	92.96	69.40	99.28
TEN (proposed)	95.38	94.72	74.83	98.24	76.22	99.80

Open in a new tab

Table 6. Accuracies (%) on Adaptiope for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 9.49% higher accuracy.

Method	P $\to$ R	P $\to$ S	R $\to$ P	R $\to$ S	S $\to$ P	S $\to$ R
No adapt.	67.03	33.62	87.54	29.61	10.28	2.10
DINE	78.85	44.55	91.71	40.69	18.50	4.84
IterLNL	76.36	48.76	88.64	41.79	15.97	4.07
TEN (proposed)	78.54	53.33	91.80	50.37	25.46	6.09

Open in a new tab

Classification accuracy (Q1)

We report the target accuracies on the five datasets in Tables 3 $~$ 7. TEN achieves the highest average accuracy in most cases and surpasses the second-best method by up to 7.59%, 4.08%, 3.67%, 9.49%, and 3.62% for Office-31, Office-Home, Image-CLEF, Adaptiope, and VisDA, respectively. Despite the significant disparity between the synthetic domain (Synthetic) and real-world (Product or Real life) domains in Adaptiope, TEN exhibits considerable improvement which shows that TEN is an effective method for domain adaptation.

Table 3. Accuracies (%) on Office-Home for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 4.08% higher accuracy.

Method	A $\to$ C	A $\to$ P	A $\to$ R	C $\to$ A	C $\to$ P	C $\to$ R	P $\to$ A	P $\to$ C	P $\to$ R	R $\to$ A	R $\to$ C	R $\to$ P
No adapt.	44.24	66.68	74.20	53.23	62.65	64.82	52.90	40.96	73.61	66.38	47.06	77.20
DINE	52.88	78.01	81.91	64.44	75.72	78.82	62.34	49.28	82.26	71.03	55.95	84.25
IterLNL	50.47	78.48	78.34	65.95	77.15	75.92	61.84	49.01	78.09	67.58	53.56	84.58
TEN (proposed)	55.01	78.55	81.29	66.79	78.89	79.37	66.42	51.75	81.75	72.31	58.49	85.42

Open in a new tab

Table 7. Accuracies (%) on VisDA for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 3.62% higher accuracy.

Method	S $\to$ R	R $\to$ S
No adapt.	39.92	58.84
DINE	60.95	76.61
IterLNL	62.27	73.21
TEN (proposed)	65.73	80.23

Open in a new tab

We also evaluate the performance of Black-box UDA with multiple source models. The soft labels of the source models are aggregated through averaging to establish the initialized soft label of the source models. We conduct experiments on four multi-source datasets, and use ResNet101 as backbone. As shown in Table 9, TEN demonstrates competitive performance across all these datasets, surpassing competitors by up to 4.81%.

Table 9. Accuracies (%) on four datasets for multi-source model adaptation. The best is in bold. TEN outperforms baselines with up to 4.81% higher accuracy.

Method	Office-Home				Image-CLEF				Office-31			Adaptiope
	$\to$ A	$\to$ C	$\to$ P	$\to$ R	$\to$ B	$\to$ C	$\to$ I	$\to$ P	$\to$ A	$\to$ D	$\to$ W	$\to$ P	$\to$ R	$\to$ S
No Adapt.	54.98	49.90	69.64	76.77	61.64	92.18	87.47	72.40	64.55	82.34	80.75	73.96	63.90	33.15
DINE	74.95	64.15	84.63	84.79	64.99	97.87	93.08	79.79	77.19	99.21	98.28	73.48	73.60	45.33
IterLNL	76.29	65.19	82.48	80.50	63.77	97.17	94.19	80.09	77.06	97.38	96.38	78.84	72.98	45.53
TEN (proposed)	76.09	65.41	86.74	86.08	66.29	98.32	94.51	81.47	79.27	99.27	98.76	83.65	75.25	46.25

Open in a new tab

Effect of flexible threshold (Q2)

To demonstrate the effectiveness of the flexible threshold, we compare the target performance against various division strategies as presented in Table 8. “No div.” skips the division phase, and leverages all soft labels including noise for knowledge distillation. “TEN (fixed)” chooses clean labels based on a predefined threshold to ensure that the labels possess high confidence. Notably, our proposed TEN surpasses the baselines, demonstrating that the flexible threshold is beneficial by fully harnessing clean labels with high-confidence. We analyze the reason by plotting the instances (clean subset) utilized for knowledge distillation in Fig. 2. “Pos” and “Neg” represent instances that are labeled correctly and inaccurately, respectively. With a lower fixed threshold as in Fig. 2 (a), all instances including noise are incorporated into training, which potentially undermine the target performance. In contrast, a higher fixed threshold ensures the selection of high-confidence instances but might also curtail the overall number of “Pos” training instances as shown in Fig. 2 (b). Compared to the baselines, TEN proposes a flexible threshold to diminish the number of “Neg" instances in the selection process while simultaneously ensuring the inclusion of a sufficient quantity of “Pos" instances as shown in Fig. 2 (c). This closely aligns with the ideal scenario where all “Pos” instances are employed for distillation, while precluding the inclusion of all “Neg” instances.

Table 8. Accuracies (%) on Office-31 for various division strategies used for splitting the target data into clean and noisy subsets. The best is in bold. Our proposed TEN outperforms baselines, demonstrating that flexible threshold is effective by taking advantages of high-confidence clean labels.

Method	A $\to$ D	A $\to$ W	D $\to$ A	D $\to$ W	W $\to$ A	W $\to$ D
No div.	94.62	96.77	74.19	97.85	75.27	97.85
TEN (fixed)	83.87	77.42	72.04	96.77	73.12	97.85
TEN (proposed)	95.38	94.72	74.83	98.24	76.22	99.80

Open in a new tab

Fig 2 — The stacked bars indicate the number of instances used for knowledge distillation in each case. “Pos” and “Neg” represent instances that are labeled correctly and inaccurately, respectively. (a) When the fixed threshold is low, all instances, including noise, are used for training, which may diminish the target performance. (b) Conversely, when the fixed threshold is high, high-confidence instances are selected, but with the disadvantage of reducing a significant number of training instances. (c) TEN mitigates the occurrence of “Neg” instances in the selection process, while concurrently ensuring the inclusion of an adequate number of “Pos” instances. It closely approximates the optimal scenario where all positive instances are utilized for distillation, while excluding all negative instances.

Effect of negative learning (Q3)

We conduct experiments on Office-31 dataset with varying quantities of noisy instances for training. In Table 10, the percentage in the first column indicates the proportion of noisy data used for training relative to the total amount of noisy data available. We have two observations. First, TEN achieves the highest accuracy with an improvement of up to 7.59% in all cases. Second, the relative performance of TEN compared to competitors increase as more noisy data are used training. The reason is that TEN effectively extracts high-confidence information from the noisy data, mitigating information loss.

Table 10. Accuracies (%) on Office-31 for different number of noisy instances used for negative learning. The best is in bold. The percentage in the first column indicates the proportion of noisy data used for training relative to the entire noisy data. Note that TEN outperforms competitors in almost all the cases. Also note that the performance gap of TEN and competitors increases as more noisy data are used for training.

Noisy data for training	Method	A $\to$ D	A $\to$ W	D $\to$ A	D $\to$ W	W $\to$ A	W $\to$ D
100%	No adapt.	80.12	76.98	57.15	92.70	61.02	98.39
	DINE	94.18	86.67	71.67	93.71	73.09	99.20
	IterLNL	94.02	87.13	71.49	92.96	69.40	99.28
	TEN (proposed)	95.38	94.72	74.83	98.24	76.22	99.80
66%	No adapt.	78.12	74.82	55.37	91.63	58.37	96.64
	DINE	93.17	85.29	69.74	91.73	70.19	97.90
	IterLNL	91.99	84.74	70.45	90.22	67.65	96.94
	TEN (proposed)	92.06	86.87	72.27	92.89	72.93	98.20
33%	No adapt.	75.39	73.00	53.31	87.39	55.94	95.64
	DINE	91.46	82.73	68.05	89.36	68.29	95.05
	IterLNL	90.39	82.30	68.45	87.75	65.04	95.85
	TEN (proposed)	91.82	83.95	69.57	88.09	68.58	96.06

Open in a new tab

Ablation study (Q4)

We examine the contribution of various components of TEN in Table 11. The proposed flexible threshold, negative learning, entropy regularization, and consistency regularization loss consistently enhance the accuracy of TEN with up to 21.28%, 1.90%, 3.93%, and 4.77%, respectively.

Table 11. Ablation study for TEN. The best is in bold. TEN achieves the highest accuracy among its variants, demonstrating that the main ideas of TEN are effective for its superior performance.

L _kd	L _nce	L _er	L _cr	A $\to$ D	A $\to$ W	D $\to$ A	D $\to$ W	W $\to$ A	W $\to$ D
o	o	o		91.00	90.53	70.06	93.95	73.16	97.39
o	o		o	92.35	91.28	70.90	94.42	72.67	96.56
o		o	o	93.49	92.82	73.52	96.80	75.21	98.38
	o	o	o	76.89	75.21	56.49	76.96	56.95	77.98
o	o	o	o	95.38	94.72	74.83	98.24	76.22	99.80

Open in a new tab

Hyperparameter sensitivity (Q5)

Fig 3 evaluates the sensitivity of the two hyperparameters $τ_{p}$ (positive threshold) and $τ_{n}$ (negative threshold) on accuracy using the Office-31 dataset. The horizontal axis at the bottom represents $τ_{p}$ ranging from 0.50 to 0.80, while the top horizontal axis corresponds to $τ_{n}$ , presented on a logarithmic scale from 10⁻⁴ to 10⁻². The vertical axis indicates accuracy. For $τ_{p}$ , the accuracy is optimal at 0.6, while it drops when $τ_{p}$ is too high because a higher $τ_{p}$ reduces the number of clean samples used to adjust the threshold, leading to an inaccurate evaluation of the flexible thresholds. For $τ_{n}$ , the accuracy is optimal at 0.005 and remains stable across different values with only minor variations.

Conclusion

We propose TEN, an accurate method for Black-box Unsupervised Domain Adaptation. TEN partitions the target data into clean and noisy subsets. The pseudo labels of the clean subset correspond closely to the ground truths of the target task, while those of the noisy subset are often inaccurate. The high-confidence of clean subset reflects the presence of certain classes, while the high-confidence of noisy subset reflects their absence. Considering this, we exploit knowledge distillation on clean labels and negative learning on noisy labels to learn their respective high-confidence predictions. Experimental results demonstrate that TEN outperforms the baseline methods by up to 9.49% higher accuracy for single-source UDA, and 4.81% higher accuracy for multi-source UDA. The performance of our approach exhibits an upward trend as an increasing amount of noisy data is utilized for training.

There are several possible future research directions. Our primary contribution in this work is the significant boost in accuracy. However, achieving computational efficiency is another important aspect. Also, adapting our method for non-image data by considering their unique characteristics would be interesting. Finally, addressing the case when we have few unlabeled target data is a promising direction.

Data Availability

All relevant data can be downloaded from the following URLs: https://faculty.cc.gatech.edu/ judy/domainadapt/, https://www.hemanthdv.org/officeHomeDataset.html, https://www.imageclef.org/2014/adaptation, https://gitlab.com/tringwald/adaptiope, https://ai.bu.edu/visda-2017/ The authors do not own these datasets and had no special access privileges that others would not have.

Funding Statement

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No.2022-0-00641, XVoice: Multi-Modal Voice Meta Learning], [No.RS-2020-II200894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [No.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO.RS-2021-II212068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)]. The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study.

References

1.Kundu JN, Venkat N, V RM, Babu RV. Universal source-free domain adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/ IEEE; 2020. pp. 4543–52. [Google Scholar]
2.Liang J, Hu D, Feng J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. vol. 119 of Proceedings of Machine Learning Research. PMLR; 2020. pp. 6028–39. [Google Scholar]
3.Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. ACM Commun. 2020;63(11):139–44. [Google Scholar]
4.Liang J, Hu D, Feng J, He R. DINE: domain adaptation from single and multiple black-box predictors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE; 2022. pp. 7993–8003. [Google Scholar]
5.Zhang H, Zhang Y, Jia K, Zhang L. Unsupervised domain adaptation of black-box source models. In: 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22–25, 2021. BMVA Press; 2021. p. 147. [Google Scholar]
6.Kuzborskij I, Orabona F. Stability and hypothesis transfer learning. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, June 16–21, 2013. vol. 28 of JMLR Workshop and Conference Proceedings. JMLR.org; 2013. pp. 942–50. [Google Scholar]
7.Zhu C, Zhang L, Luo W, Jiang G, Wang Q. Tensorial multiview low-rank high-order graph learning for context-enhanced domain adaptation. Neural Netw. 2025;181:106859. doi: 10.1016/j.neunet.2024.106859 [DOI] [PubMed] [Google Scholar]
8.Zhu C, Wang Q, Xie Y, Xu S. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Inf Sci. 2024;662:120223. doi: 10.1016/j.ins.2024.120223 [DOI] [Google Scholar]
9.Zhang J, Huang J, Jiang X, Lu S. Black-box unsupervised domain adaptation with bi-directional Atkinson-Shiffrin memory. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023. [Google Scholar]
10.Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA; 2017. pp. 1195–204. [Google Scholar]
11.Arazo E, Ortego D, Albert P, O’Connor NE, McGuinness K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19–24, 2020. IEEE; 2020. p. 1–8. [Google Scholar]
12.Morerio P, Volpi R, Ragonesi R, Murino V. Generative pseudo-label refinement for unsupervised domain adaptation. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1–5, 2020. IEEE; 2020. pp. 3119–28. [Google Scholar]
13.Saito K, Ushiku Y, Harada T. Asymmetric tri-training for unsupervised domain adaptation. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. vol. 70 of Proceedings of Machine Learning Research. PMLR; 2017. pp. 2988–97. [Google Scholar]
14.Kundu JN, Lakkakula N, Radhakrishnan VB. UM-Adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 1436–45. [Google Scholar]
15.Zhou B, Kalra N, Krähenbühl P. Domain adaptation through task distillation. In: Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI. vol. 12371 of Lecture Notes in Computer Science. Springer; 2020. pp. 664–80. [Google Scholar]
16.Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, et al. FlexMatch: boosting semi-supervised learning with curriculum pseudo labeling. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual; 2021. pp. 18408–19. [Google Scholar]
17.Kim Y, Yim J, Yun J, Kim J. NLNL: negative learning for noisy labels. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 101–10. [Google Scholar]
18.Rizve MN, Duarte K, Rawat YS, Shah M. In defense of pseudo-labeling: an uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net; 2021. [Google Scholar]
19.Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel C, et al. FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020. [Google Scholar]
20.Cubuk ED, Zoph B, Shlens J, Le Q. RandAugment: practical automated data augmentation with a reduced search space. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020. [Google Scholar]
21.Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S. Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. [Google Scholar]
22.Ringwald T, Stiefelhagen R. Adaptiope: a modern benchmark for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2021. [Google Scholar]
23.Peng X, Usman B, Kaushik N, Hoffman J, Wang D, Saenko K. VisDA: the visual domain adaptation challenge; 2017. [Google Scholar]
24.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society; 2016. pp. 2818–26. [Google Scholar]
25.Salimans T, Kingma DP. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain; 2016. p. 901. [Google Scholar]

PLoS One. 2025 May 12;20(5):e0321987. doi: 10.1371/journal.pone.0321987.r001

Author response to Decision Letter 0

22 Sep 2024

PLoS One. doi: 10.1371/journal.pone.0321987.r002

Decision Letter 0

Lei Chu

26 Nov 2024

PONE-D-24-42138Threshold-Based Exploitation of Noisy Label in Black-box Unsupervised Domain AdaptationPLOS ONE

Dear Dr. Kang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 10 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Lei Chu

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work.

Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

4. Thank you for stating the following financial disclosure:

“This work was supported by Institute of Information \& communications Technology Planning \& Evaluation(IITP) grant funded by the Korea government(MSIT) [No.2022-0-00641, XVoice: Multi-Modal Voice Meta Learning], [No.RS-2020-II200894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [No.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO.RS-2021-II212068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)]. The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study.”

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

5. We note that your Data Availability Statement is currently as follows:

“All relevant data are within the manuscript and its Supporting Information files.”

Please confirm at this time whether or not your submission contains all raw data required to replicate the results of your study. Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods (https://journals.plos.org/plosone/s/data-availability#loc-minimal-data-set-definition).

For example, authors should submit the following data:

- The values behind the means, standard deviations and other measures reported;

- The values used to build graphs;

- The points extracted from images for analysis.

Authors do not need to submit their entire data set if only a portion of the data was used in the reported study.

If your submission does not contain these data, please either upload them as Supporting Information files or deposit them to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories.

If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. If data are owned by a third party, please indicate how others may request data access.

Additional Editor Comments:

The study shows promise, but certain aspects require further elaboration and clarification to enhance its rigor and impact. For example, the reviewers highlight the need for a more detailed analysis of hyperparameter sensitivity, particularly thresholds and regularization parameters critical to the method's performance. They recommend including explicit comparisons of computational efficiency against baseline methods to emphasize the practical benefits. Additionally, a discussion on potential failure cases and the method's generalizability to domains beyond image classification is necessary to clarify its broader applicability and limitations.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors proposed "Threshold-Based Exploitation of Noisy Label in Black-box Unsupervised Domain Adaptation". The structure of the article is well structured. But authors should follow the following comments.

1.Proofread the entire manuscript.

2.Draw a graphical abstract of your proposed approach.

3.Compare your approach with previous approaches.

4. Explain about features you are using in this study.

Reviewer #2: This paper tackles the significant challenge of Black-box Unsupervised Domain Adaptation (UDA), where source data and model parameters are inaccessible due to privacy constraints. The authors propose Threshold-Based Exploitation of Noisy Predictions (TEN), a method designed to adapt target models using noisy labels generated by a black-box source model. TEN introduces a flexible thresholding mechanism to classify data into clean and noisy subsets, effectively handling class imbalance and improving learning from high-confidence instances. It further employs knowledge distillation for clean labels, negative learning for noisy labels, and structural regularization techniques to enhance the adaptation process. Extensive experiments demonstrate that TEN achieves up to 9.49% higher accuracy than existing baselines, highlighting its robustness and practicality. The flexible threshold approach is particularly notable, addressing the imbalance and difficulty of learning from noisy labels effectively. Additionally, the integration of knowledge distillation, negative learning, and entropy regularization creates a well-rounded and efficient framework for improving target model performance. The authors support their claims with extensive experiments across multiple datasets and scenarios, including single-source and multi-source UDA, demonstrating consistent accuracy improvements. The inclusion of ablation studies further validates the significance of each component in the proposed method.

However, there are still some issues to be addressed before the paper is accepted:

1. There should be a deeper discussion of hyperparameter sensitivity, particularly the thresholds and regularization parameters critical to TEN's performance.

2. While the computational efficiency of the method is implied, explicit comparisons of overhead against baselines would strengthen the practical appeal.

3. The authors should discuss the potential failure cases or generalizability to domains beyond image classification limits the broader applicability of the findings.

4. The authors should include references related to UDA and black-box UDA in related work, such as [1-3].

[1] Zhang, J., Huang, J., Jiang, X., & Lu, S. (2023). Black-box unsupervised domain adaptation with bi-directional atkinson-shiffrin memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 11771-11782).

[2] Zhu, C., Wang, Q., Xie, Y., & Xu, S. (2024). Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Information Sciences, 662, 120223.

[3] Zhu, C., Zhang, L., Luo, W., Jiang, G., & Wang, Q. (2024). Tensorial multiview low-rank high-order graph learning for context-enhanced domain adaptation. Neural Networks, 106859.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 May 12;20(5):e0321987. doi: 10.1371/journal.pone.0321987.r003

Author response to Decision Letter 1

23 Feb 2025

We have carefully addressed all your comments and concerns in the “Response to Reviewers” file.

Attachment

Submitted filename: response to reviewers.pdf

pone.0321987.s001.pdf^{(64.6KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0321987.r004

Decision Letter 1

Lei Chu

14 Mar 2025

Threshold-Based Exploitation of Noisy Label in Black-box Unsupervised Domain Adaptation

PONE-D-24-42138R1

Dear Dr. Kang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Lei Chu

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #2: The authors have addressed my previous concerns. The manuscript is suggested to be accepted under current version.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0321987.r005

Acceptance letter

Lei Chu

PONE-D-24-42138R1

PLOS ONE

Dear Dr. Kang,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Lei Chu

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: response to reviewers.pdf

pone.0321987.s001.pdf^{(64.6KB, pdf)}

Data Availability Statement

[pone.0321987.ref001] 1.Kundu JN, Venkat N, V RM, Babu RV. Universal source-free domain adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/ IEEE; 2020. pp. 4543–52. [Google Scholar]

[pone.0321987.ref002] 2.Liang J, Hu D, Feng J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. vol. 119 of Proceedings of Machine Learning Research. PMLR; 2020. pp. 6028–39. [Google Scholar]

[pone.0321987.ref003] 3.Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. ACM Commun. 2020;63(11):139–44. [Google Scholar]

[pone.0321987.ref004] 4.Liang J, Hu D, Feng J, He R. DINE: domain adaptation from single and multiple black-box predictors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE; 2022. pp. 7993–8003. [Google Scholar]

[pone.0321987.ref005] 5.Zhang H, Zhang Y, Jia K, Zhang L. Unsupervised domain adaptation of black-box source models. In: 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22–25, 2021. BMVA Press; 2021. p. 147. [Google Scholar]

[pone.0321987.ref006] 6.Kuzborskij I, Orabona F. Stability and hypothesis transfer learning. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, June 16–21, 2013. vol. 28 of JMLR Workshop and Conference Proceedings. JMLR.org; 2013. pp. 942–50. [Google Scholar]

[pone.0321987.ref007] 7.Zhu C, Zhang L, Luo W, Jiang G, Wang Q. Tensorial multiview low-rank high-order graph learning for context-enhanced domain adaptation. Neural Netw. 2025;181:106859. doi: 10.1016/j.neunet.2024.106859 [DOI] [PubMed] [Google Scholar]

[pone.0321987.ref008] 8.Zhu C, Wang Q, Xie Y, Xu S. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Inf Sci. 2024;662:120223. doi: 10.1016/j.ins.2024.120223 [DOI] [Google Scholar]

[pone.0321987.ref009] 9.Zhang J, Huang J, Jiang X, Lu S. Black-box unsupervised domain adaptation with bi-directional Atkinson-Shiffrin memory. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023. [Google Scholar]

[pone.0321987.ref010] 10.Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA; 2017. pp. 1195–204. [Google Scholar]

[pone.0321987.ref011] 11.Arazo E, Ortego D, Albert P, O’Connor NE, McGuinness K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19–24, 2020. IEEE; 2020. p. 1–8. [Google Scholar]

[pone.0321987.ref012] 12.Morerio P, Volpi R, Ragonesi R, Murino V. Generative pseudo-label refinement for unsupervised domain adaptation. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1–5, 2020. IEEE; 2020. pp. 3119–28. [Google Scholar]

[pone.0321987.ref013] 13.Saito K, Ushiku Y, Harada T. Asymmetric tri-training for unsupervised domain adaptation. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. vol. 70 of Proceedings of Machine Learning Research. PMLR; 2017. pp. 2988–97. [Google Scholar]

[pone.0321987.ref014] 14.Kundu JN, Lakkakula N, Radhakrishnan VB. UM-Adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 1436–45. [Google Scholar]

[pone.0321987.ref015] 15.Zhou B, Kalra N, Krähenbühl P. Domain adaptation through task distillation. In: Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI. vol. 12371 of Lecture Notes in Computer Science. Springer; 2020. pp. 664–80. [Google Scholar]

[pone.0321987.ref016] 16.Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, et al. FlexMatch: boosting semi-supervised learning with curriculum pseudo labeling. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual; 2021. pp. 18408–19. [Google Scholar]

[pone.0321987.ref017] 17.Kim Y, Yim J, Yun J, Kim J. NLNL: negative learning for noisy labels. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 101–10. [Google Scholar]

[pone.0321987.ref018] 18.Rizve MN, Duarte K, Rawat YS, Shah M. In defense of pseudo-labeling: an uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net; 2021. [Google Scholar]

[pone.0321987.ref019] 19.Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel C, et al. FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020. [Google Scholar]

[pone.0321987.ref020] 20.Cubuk ED, Zoph B, Shlens J, Le Q. RandAugment: practical automated data augmentation with a reduced search space. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020. [Google Scholar]

[pone.0321987.ref021] 21.Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S. Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. [Google Scholar]

[pone.0321987.ref022] 22.Ringwald T, Stiefelhagen R. Adaptiope: a modern benchmark for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2021. [Google Scholar]

[pone.0321987.ref023] 23.Peng X, Usman B, Kaushik N, Hoffman J, Wang D, Saenko K. VisDA: the visual domain adaptation challenge; 2017. [Google Scholar]

[pone.0321987.ref024] 24.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society; 2016. pp. 2818–26. [Google Scholar]

[pone.0321987.ref025] 25.Salimans T, Kingma DP. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain; 2016. p. 901. [Google Scholar]

PERMALINK

Threshold-based exploitation of noisy label in black-box unsupervised domain adaptation

Huiwen Xu

Jaeri Lee

U Kang

Roles

Abstract

Introduction

Table 1. Table of symbols.

Related works

Unsupervised domain adaptation

Semi-supervised learning with noisy labels

Proposed method

Overview

Fig 1. The overall structure of TEN.

Flexible threshold

Negative learning

Structural regularization

Experiments

Experimental setup

Datasets.

Table 2. Summary of datasets.

Models.

Baselines.

Hyperparameters.

Table 4. Accuracies (%) on Image-CLEF for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 3.67% higher accuracy.

Table 5. Accuracies (%) on Office-31 for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 7.59% higher accuracy.

Table 6. Accuracies (%) on Adaptiope for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 9.49% higher accuracy.

Classification accuracy (Q1)

Table 3. Accuracies (%) on Office-Home for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 4.08% higher accuracy.

Table 7. Accuracies (%) on VisDA for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 3.62% higher accuracy.

Table 9. Accuracies (%) on four datasets for multi-source model adaptation. The best is in bold. TEN outperforms baselines with up to 4.81% higher accuracy.

Effect of flexible threshold (Q2)

Table 8. Accuracies (%) on Office-31 for various division strategies used for splitting the target data into clean and noisy subsets. The best is in bold. Our proposed TEN outperforms baselines, demonstrating that flexible threshold is effective by taking advantages of high-confidence clean labels.

Fig 2. Impact of threshold on instance selection.

Effect of negative learning (Q3)

Ablation study (Q4)

Table 11. Ablation study for TEN. The best is in bold. TEN achieves the highest accuracy among its variants, demonstrating that the main ideas of TEN are effective for its superior performance.

Hyperparameter sensitivity (Q5)

Fig 3. Hyperparameter sensitivity to accuracy.

Conclusion

Data Availability

Funding Statement

References

Author response to Decision Letter 0

Decision Letter 0

Lei Chu

Roles

Author response to Decision Letter 1

Decision Letter 1

Lei Chu

Roles

Acceptance letter

Lei Chu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases