Input:
is the multi-label data set. where is the total number of labels, NSCLC has four stages, so , the labels to be trained are , the number of iterations is and the size of the training data block for each iteration is . Output: The prediction model . Step 1: For any label , the co-occurrence frequency of the small sample pathological stage label and is calculated, as shown in Equation (4). Then, the parameters of the training model for the large sample case dataset are selected and saved according to the label corresponding to the maximum value, which is used as the initialization value for the small sample pathological period prediction model. where is a binary function that labels each case sample of NSCLC patients, and the labeling value is 1 if and have appeared in the same case sample, otherwise the labeling value is 0, as shown in Equation (5).
Step 2:
is split into multiple binary data sets of NSCLC pathological stages using the One-VS-Rest approach, where is the training set of the pathological stage label . Then, select the majority pathological stage label with the most frequent co-occurrence with the pathological stage label . Train the large sample NSCLC pathological stage prediction model using the training dataset with label (see Equation (6)), and save the parameters of .
Step 3: The parameters of the pathological stage prediction model for NSCLC in a large sample of cases are read as the initialized model for the pathological stage label . The majority category sample set of is , the minority category sample set is , the sample size is and , respectively, with a total sample size of . Initialize the minority category sample sampling probability , as shown in Equation (7). Since the sum of the sampling probabilities of both positive and negative samples is , after sampling for each positive and negative sample according to the sampling method in step 4, the average value of the number of positive and negative samples can be obtained by sampling , so the samples constructed by sampling are balanced.
Step 4: The positive and negative sample sets are sampled separately based on the sampling probability . For any sample with a sampling probability of , a random value with a uniform distribution between 0 and 1 is generated using . If , the sample is added to the new balanced sample set . At this point, if , the sample will be added to the partial majority class sample set . On the contrary, the sample will be added to the minority class sample set if , as shown in Equations (8) and (9). For each sample , its sampling probability is , which is equal to the probability that the randomly generated number is smaller than . Therefore, when is smaller than , the sample is added to this balanced sample set, and it is reasonable to update the sampling probability using this algorithm.
Finally, and are combined to form the training set :
Step 5:
is trained based on the set data to generate the new model , as shown in Equation (11).
Step 6: If result of calculating the probability that the predicted sample of model on the overall training sample of a positive sample is and , this indicates that the probability value the predicted sample of the classifier belongs to a positive sample. The larger is better for positive samples, and the smaller is better for negative samples. can be used to update the sampling probability , as shown in Equation (12).
When the model predicts the training sample incorrectly, or correctly but with low confidence, sampling probability of that sample is increased, which increases the focus of the model on the sample. Conversely, when the model predicts a sample correctly and with high confidence, it relatively reduces the sampling probability of the sample which reduces the attention of the model to it. This will increase the distinguishability of the model for positive and negative samples to improve the prediction accuracy and confidence of the model. Therefore, when the sample is a positive sample, the closer is to 0, the probability of the updated sample increases when the classification is incorrect or correct but the confidence is low. When it is a negative sample, the closer is to 1, the probability of updated sampling will increase when the classification is incorrect or correct but the confidence level is not high. The sampling probabilities of positive samples are regulized, where is the sum of all positive sample sampling probabilities, as shown in Equations (13) and (14).
Similarly, the sampling probability of negative samples are regularized, where is the sum of the sampling probabilities of all negative samples, as shown in Equations (15) and (16), respectively.
Step 7: Determine whether the specified number of iterations is reached and return the final classifier if it is satisfied; otherwise, continue with steps 4 to 7 using the new sampling probabilities. |