2SpamH: A Two-Stage Pre-Processing Algorithm for Passively Sensed mHealth Data

. 2024 Oct 31;24(21):7053. doi: 10.3390/s24217053

Algorithm 1. Pseudocode of the 2SpamH algorithm which has two stages: (1) prototype selection in the feature space of device use and sensor activity levels to label data points as “missing” or “non-missing” with some confidence based on a threshold, and (2) a k-nearest neighbors (KNN) approach to label non-prototype data points in the feature space based on their proximity to the labeled prototypes. The algorithm returns “missing” labels for all data points.
2SpamH Algorithm
Input: Sensor activity matrix $W$ , Device usage matrix $Z$ , Prototype selection percentiles ${θ_{l o w e r}, θ_{u p p e r}}$ , Number of nearest neighbors $k$
Output: Missing label matrix $M$
Stage 1: Prototype Selection
1.	Perform PCA on $Z$ and $W$ to obtain the principal components:
	$C^{Z} = P C A (Z), C^{W} = P C A (W)$
	where $C^{Z}$ and $C^{W}$ are vectors of length T of the first principal components of $Z$ and $W$ . If ncol( $Z$ ) = 1, then $C^{Z} = Z$ ; if ncol( $W$ ) = 1, then $C^{W}$ = $W$ .
2.	Construct the feature space $F$ as the set of points $f_{t} = (C_{t}^{Z}, C_{t}^{W})$ for each t:
	$F = {f_{t} \| t = 1, \dots, T}$
	where $f_{t} = (C_{t}^{Z}, C_{t}^{W})$ represents the coordinates of the t^th data point in the constructed feature space.
3.	Compute the lower and upper quantiles for $C^{Z}$ and $C^{W}$ :
	$q_{l o w e r}^{Z}, q_{u p p e r}^{Z} = Q (C^{Z}, θ_{l o w e r}), Q (C^{Z}, θ_{u p p e r})$
	$q_{l o w e r}^{W}, q_{u p p e r}^{W} = Q (C^{W}, θ_{l o w e r}), Q (C^{W}, θ_{u p p e r})$
4.	Identify the set of missing prototypes in the feature space $F$ :
	$P_{m i s s i n g} = {f_{t} \in F \| C_{t}^{Z} < q_{l o w e r}^{Z} a n d C_{t}^{W} < q_{l o w e r}^{W}}$
5.	Identify the set of non-missing prototypes in the feature space F:
	$P_{n o n - m i s s i n g} = {f_{t} \in F \| C_{t}^{Z} > q_{u p p e r}^{Z} a n d C_{t}^{W} < q_{u p p e r}^{W}}$
6.	For each data point $f_{t} \in F$ :
7.	Assign labels to rows of $M$ based on whether data points fall within the prototype regions:
	$M_{t, :} = \{\begin{matrix} M i s s i n g & i f f_{t} \in P_{m i s s i n g} \\ Non - missing & i f f_{t} \in P_{n o n - m i s s i n g} \\ N A & o t h e r w i s e \end{matrix}$
Stage 2: Labeling Unlabeled Data Using KNN
8.	For each unlabeled data point $f_{t^{'}} \in F$ that was not assigned a label in Stage 1:
9.	Implement KNN with $K = k$ and Euclidean distance function $d (f_{t}, f_{t^{'}}) = \sqrt{({(C_{t}^{Z} - {C_{t}^{Z}}^{'})}^{2} + {(C_{t}^{W} - {C_{t}^{W}}^{'})}^{2})}$ to label the remaining unlabeled data points: $M_{t^{'}, :} : = K N N (f_{t^{'}} \| P_{m i s s i n g}, P_{n o n - m i s s i n g}, k)$
Return: Missing label matrix $M .$