ViT-DCNN: Vision Transformer with Deformable CNN Model for Lung and Colon Cancer Detection

. 2025 Sep 15;17(18):3005. doi: 10.3390/cancers17183005

Algorithm 1: ViT-DCNN (Vision Transformer with Deformable Convolution) for Lung and Colon Cancer Classification

1: Input: D = {(X_i, Y_i)}, α, T, B, θ_ViT, θ_DConv, N

2: Initialize: θ_ViT, θ_DConv

3: for epoch = 1 to T do

4: for batch = 1 to

(\frac{N}{B})

5: Extract mini-batch:

(X_{b a t c h,} Y_{b a t c h}) \leftarrow {\{X_{i,} Y_{i}\}}_{i = b a t c h}

7: Apply Data Augmentation:

(X_{a u g,} Y_{a u g}) \leftarrow A u g m e n t (X_{b a t c h,} Y_{b a t c h})

9: Vision Transformer (ViT) Forward Pass:

10: Patch Embedding:

11:

P_{i} = F l a t t e n (X_{i}) \cdot W_{e m b} + b_{e m b}

12: Positional Encoding:

13: Z_i = P_i + PE_i

14: Multi-Head Self-Attention:

15:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

16: Feed-Forward Network:

17: FFN(Z) = max(0, ZW₁ + b₁) W₂ + b₂

18: Deformable Convolution Forward Pass:

19: Deformable Convolution:

20:

y_{i j} = \sum_{m = 1}^{M} \sum_{n = 1}^{N} x_{i + m + ∆ m_{i j}, j + n + ∆ n_{i j} \cdot w_{m n}}

21: Offset Learning:

22:

[∆ m_{i j}, ∆ n_{i j}] = C o n v (F_{i j}), F_{i j} \in R^{C}

23: Spatial Attention:

24:

F_{r e f i n e d} = F_{i j} \cdot A_{i j}

25: Hierarchical Feature Fusion (HFF):

26: Concatenate Vision Transformer and Deformable CNN Features:

27: F_concat = concat(F_ViT, F_DConv)

28: Squeeze-and-Excitation (SE) Block:

29:

s = σ (M L P (z))

30: Refined Feature Map:

31:

F_{s e} = F_{c o n c a t} \cdot s

32: Prediction and Softmax Activation:

33: Global Average Pooling:

34:

Ζ = G A P (F_{c o n c a t}) = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c o n c a t} (i, j)

35: Softmax Layer:

36:

P_{c} = S o f t m a x (W \cdot z + b)

37: Predicted Class:

38:

Y^{'} \leftarrow a r g m a x (P_{s o f t m a x})

39: Compute Class:

40: Cross-Entropy Loss:

41:

L_{c r o s s} = - \sum_{i = 1}^{N} \sum_{j = 1}^{K} y_{i j} l o g (y_{i j}^{'})

42: Gradient Computation:

43:

\nabla θ_{V I T} L_{c r o s s}, \nabla θ_{D C o n v} L_{c r o s s}

44: Parameter Update (Using AdamW optimizer):

45: Update the Vision Transformer parameters:

46:

θ_{V I T} \leftarrow θ_{V I T} - α \cdot \nabla θ_{V I T} L_{c r o s s}

47: Update the Deformable CNN parameters:

48:

θ_{D C o m v} \leftarrow θ_{D C o n v} - α \cdot \nabla θ_{D C o n v} L_{c r o s s}

49: end for

50: end for

51: Output:

52: Trained ViT-Deformable CNN model with updated parameters θ_ViT and θ_DConv