Skip to main content
. 2021 Nov 30;21(23):7996. doi: 10.3390/s21237996
Algorithm 1: The dynamic sampling technique training algorithm
Input: S={(x1,y1),(x2,y2),,(xn,yn)},xiVn,yi{l1,l2,,ln} is the multi-label data set. where n is the total number of labels, NSCLC has four stages, so n=4, the labels to be trained are li{I,II,III,IV}, the number of iterations is E and the size of the training data block for each iteration is M.
Output: The prediction model Hi(x).
Step 1: For any label lj, the co-occurrence frequency F(li,lj) of the small sample pathological stage label li and lj is calculated, as shown in Equation (4). Then, the parameters of the training model for the large sample case dataset are selected and saved according to the label ljmax corresponding to the maximum F(li,lj) value, which is used as the initialization value for the small sample pathological period prediction model. where Q({li,lj}yn) is a binary function that labels each case sample of NSCLC patients, and the labeling value is 1 if li and lj have appeared in the same case sample, otherwise the labeling value is 0, as shown in Equation (5).
F(li,lj)=(xn,yn)SQ({li,lj}yn) (4)
Q({li,lj}yn)={1,    liyn and ljyn0,    liyn or ljyn (5)
Step 2: S is split into multiple binary data sets {S1,S2,,Sk} of NSCLC pathological stages using the One-VS-Rest approach, where Si is the training set of the pathological stage label li. Then, select the majority pathological stage label lk with the most frequent co-occurrence with the pathological stage label li. Train the large sample NSCLC pathological stage prediction model Hk(x) using the training dataset Sk with label lk (see Equation (6)), and save the parameters of Hk(x).
Hk(x)=T(Sk) (6)
Step 3: The parameters of the pathological stage prediction model Hk(x) for NSCLC in a large sample of cases are read as the initialized model Hi,1(x) for the pathological stage label li. The majority category sample set of li is Sk,neg, the minority category sample set is Sk,pos, the sample size is Nneg and Npos, respectively, with a total sample size of N. Initialize the minority category sample sampling probability Pi,1={Pi,1(1),Pi,1(2),,Pi,1(N)}, as shown in Equation (7). Since the sum of the sampling probabilities of both positive and negative samples is M/2, after sampling for each positive and negative sample according to the sampling method in step 4, the average value of the number of positive and negative samples can be obtained by sampling M/2, so the samples constructed by sampling are balanced.
Pi,1(j)={M2×Npos,    liyjM2×Nneg,    liyj (7)
Step 4: The positive and negative sample sets are sampled separately based on the sampling probability Pi,t. For any sample (xj,yj) with a sampling probability of Pi,t(j), a random value R(xj) with a uniform distribution between 0 and 1 is generated using R(). If R(xj)Pi,t(j), the sample (xj,yj) is added to the new balanced sample set Si,train. At this point, if liyj, the sample (xj,yj) will be added to the partial majority class sample set Si,negsel. On the contrary, the sample (xj,yj) will be added to the minority class sample set Si,pos if liyj, as shown in Equations (8) and (9). For each sample (xj,yj), its sampling probability is Pi,t(j), which is equal to the probability that the randomly generated number R(xj) is smaller than Pi,t(j). Therefore, when R(xj) is smaller than Pi,t(j), the sample (xj,yj) is added to this balanced sample set, and it is reasonable to update the sampling probability using this algorithm.
Si,negsel={(xj,yj)|R(xj)Pi,t(j),(xj,yj)Si,neg} (8)
Si,possel={(xj,yj)|R(xj)Pi,t(j),(xj,yj)Sj,pos} (9)
 Finally, Si,negsel and Si,possel are combined to form the training set Si,train:
Si,train=Si,negselSi,possel (10)
Step 5: Hi,t1(x) is trained based on the set data Si,train to generate the new model Hi,t(x), as shown in Equation (11).
Hi,t(x)=T(Hi,t1(x);Si,train) (11)
Step 6: If result of calculating the probability that the predicted sample of model Hi,t(x) on the overall training sample of a positive sample is ηi,t and ηi,t(j)[0,1], this indicates that the probability value the predicted sample of the classifier belongs to a positive sample. The larger ηi,t(j) is better for positive samples, and the smaller ηi,t(j) is better for negative samples. ηi,t can be used to update the sampling probability Pi,t+1={Pi,t+1(1),Pi,t+1(2),,Pi,t+1(N)}, as shown in Equation (12).
Pi,t+1(j)={Pi,t(j)exp(1ηi,t(j)),    liyjPi,t(j)exp(ηi,t(j)),    liyj (12)
 When the model Hi,t(x) predicts the training sample incorrectly, or correctly but with low confidence, sampling probability of that sample is increased, which increases the focus of the model on the sample. Conversely, when the model predicts a sample correctly and with high confidence, it relatively reduces the sampling probability of the sample which reduces the attention of the model to it. This will increase the distinguishability of the model for positive and negative samples to improve the prediction accuracy and confidence of the model. Therefore, when the sample (xj,yj) is a positive sample, the closer ηi,t(j) is to 0, the probability of the updated sample increases when the classification is incorrect or correct but the confidence is low. When it is a negative sample, the closer ηi,t(j) is to 1, the probability of updated sampling will increase when the classification is incorrect or correct but the confidence level is not high.
 The sampling probabilities of positive samples are regulized, where Sumt,pos is the sum of all positive sample sampling probabilities, as shown in Equations (13) and (14).
Pi,t+1(j)=M×Pi,t+1(j)2×Sumt,pos (13)
Sumt,pos=(xn,yn)Si,posPi,t+1(n) (14)
 Similarly, the sampling probability of negative samples are regularized, where Sumt,neg is the sum of the sampling probabilities of all negative samples, as shown in Equations (15) and (16), respectively.
Pi,t+1(j)=M×Pi,t+1(j)2×Sumt,neg (15)
Sumt,neg=(xn,yn)Si,negPi,t+1(n) (16)
Step 7: Determine whether the specified number of iterations is reached and return the final classifier if it is satisfied; otherwise, continue with steps 4 to 7 using the new sampling probabilities.