Can Hierarchical Transformers Learn Facial Geometry?

. 2023 Jan 13;23(2):929. doi: 10.3390/s23020929

Algorithm 1 Face geometry representation using hierarchical shifted windows architecture

Input:
P = { $p_{x}^{1}, p_{x}^{2}, \dots, p_{x}^{n}$ } where $p_{x}^{n}$ is the nth $4 \times 4$ patch of image x.

$E$ = learned linear embedding matrix.

$E_{p o s}$ = positional encoding matrix.

MSA, SW-MSA = multiheaded self-attention and shifted window MSA.

MLP = multi-layer perceptron

LN = layer normalization
Output:
Face Representation Classification
1:
for $p \in P$ do ▹ For each patch
2:
$p \leftarrow$ flatten(p)
3:
$p \leftarrow [p_{f}^{1} E; p_{f}^{2} E; \dots; p_{f}^{48} E]$
4:
$p \leftarrow p + E_{p o s}$
5:
end for
6:
X←P
7:
for block pair in transformer blocks do
8:
X← MSA(LN(X)) + X
9:
X← MLP(LN(X)) + X
10:
X← SW-MSA(LN(X)) + X
11:
X← MLP(LN(X)) + X

If at block pair 1, 2, 11:
12:
for $x_{1, 1} \dots x_{M, N} \in X$ do
13:
$x_{m, n} \leftarrow$ merge( $x_{2 m, 2 n}, x_{2 m + 1, 2 n}, x_{2 m, 2 n + 1},$ $x_{2 m + 1, 2 n + 1}$ )
14:
end for
15:
end for