Abstract
The rapid growth of mobile payment platforms has enhanced transactional convenience but also introduced critical security challenges, notably shoulder spoofing. This attack occurs when unauthorized individuals or surveillance devices visually intercept sensitive information, such as Mobile Personal Identification Numbers (MPINs), during payment processes. Existing security mechanisms—including PIN masking and screen dimming—fail to detect environmental threats or provide adaptive responses, leaving users vulnerable in public spaces. To address this gap, propose a novel solution titled Gaze-Aware Threat Detection with Contextual Scene Analysis (GATCSA). GATCSA leverages the front-facing camera and on-device computer vision algorithms to monitor the surroundings during mobile transactions. The system identifies suspicious behavior such as gaze fixation by nearby individuals or the presence of surveillance equipment targeting the mobile screen. A risk evaluation module considers proximity, gaze direction, and focus duration to classify threat levels in real time. Upon detection, the system provides users with contextual alerts and actionable suggestions—such as changing the device angle, enabling a privacy screen, or halting the transaction—to safeguard against unauthorized visual access. Unlike traditional methods, GATCSA processes all data locally to ensure user privacy and operates efficiently on resource-constrained mobile devices. Preliminary testing in varied real-world conditions—differing in lighting, crowd density, and device orientation—demonstrates high accuracy in threat identification and user responsiveness. By integrating gaze tracking with environmental awareness, GATCSA represents a significant advancement in mobile payment security, enhancing user trust and privacy during digital transactions.
Keywords: Computer vision, Gaze detection, Mobile payment security, Real-Time threat analysis, Shoulder spoofing
Subject terms: Engineering, Mathematics and computing
Introduction
As the process of financial transactions is quickly getting digital, mobile payment platforms have gained entry into everyday life. Consumer now relies on the mobs to have easy and safe monetary payments, right the way through peer-to-peer transfer of money to contactless payment in the retail stores1. Although this shift towards mobile-based transactions has been a blessing as far as convenience and accessibility are concerned, it has also introduced some risks especially in public and semi-public areas where users are inputting important data such as the MPINs2. Shoulder spoofing, or visual snooping of the form in which malicious actors or spy devices receive confidential transaction details by looking at the screen of the user is among the least recognised yet powerful risks in this context3. Such physical attacks in surveillance by darkening the screen or partially masking the inputs or using a biometric verification method cannot be much aided by the traditional security measures4. These will mostly guarantee the security of information in the application or device, and not created to keep regard to the external threats in the surroundings of the user5. More than that, users do not suspect that they are under visual surveillance, as shoulder spoofing may be conducted in secret either by looking directly at the user or using remote surveillance cameras6. This contextual blindness is a prevalent assassin of mobile payments security efficacy and trust in the user7.
This paper introduces a preventative and new solution to such a problem referred to as GATCSA. The another step that this platform takes in ensuring mobile payments further is to have the phone keep checking around the front camera of the phone in case equipment is not scanned. With the help of advanced technologies of computer vision and gaze detection, GATCSA identifies potential threats such as people facing the monitor or security cameras facing the phone. When compared to the existing passive solutions, GATCSA will work in real time and report on suspicious behavior to the user and propose countermeasures as soon as it occurs. GATCSA is new in that, it integrates gaze tracking with contextual scene analysis. It does not only identify the existence of faces, it also looks at the direction of the gaze, the distance and the time that the face was gazed in order to determine the level of threat. It also finds out the fixed threats in the environment like CCTV cameras or recorders based on object detection. After determining the potential risk, severity of the risk is determined in the system in different parameters such as proximity, angle of view, number of people, and environment such as lighting, crowd.
To purposely provide usability and privacy, GATCSA has local processing capabilities built within, i.e., all the data is processed locally on the device, and no video and personal data are transferred to remote servers. In terms of mobile performance, the system is based on lightweight neural network models in order to evade huge energy and resource usage. Only in case of a tangible threat, users are notified and some recommendations are provided to exercises such as changing the phone position, enabling a screen filter, or suspending the transaction. This work aims to fill the existing gap in the mobile payments security with the idea of the real-time intelligent threat detection mechanism that targets real-life scenarios. The solution is run in various environmental conditions to assure reliability, adaptability and effective response. GATCSA also enlarges the consciousness of the user and builds trust in digital economic eco-systems by transforming mobile devices into context aware systems. Ultimately, such work allows the development of more intelligent, safer, mobile payment systems in the environment of changing physical-world threats.
Research motivation
The growing demand of secure mobile payment transactions in the populated places, where consumers are becoming vulnerable to the attacks of shoulder spoofing. Despite advances in digital encryption and in-app authentication, mobile payment systems at this point remain unable to identify and counter visual threats in the real world such as people staring at their screens or even video cameras capturing sensitive information. This situational-aware delay between the digital security and the physical domain is a severe risk to user privacy and trust. Therefore, there is an urgent requirement of a context-aware and real-time security mechanism that not only senses the possible visual threats, but also empowers the users to offer instant guidance to prevent the leakage of data. This requirement can be addressed in the case of the proposed GATCSA system that integrates computer vision, gaze detection, and scene analysis to deal with mobile payments to enhance the security of the entire transaction in uncertainties and dynamic setups.
Research significance
The innovative way out of the digital security and physical threat detection juxtaposition in mobile payment case. By extension, the research satisfies an otherwise fatal vulnerability that is, often, overlooked by other security systems, the visual aspect of eavesdropping on a victim through shoulder spoofing, with the development of GATCSA, a real time, eye tracking aware threat detection solution. In-app or network-only security measures of the currently implemented methods do not provide adequate security measures when compared to GATCSA that also adds a prudent, preventive security level into the mix that enhances transaction-sense awareness of security threats in the environment. This contribution not only makes the process of paying with mobile phones safer but also presents a new standard of a contextual scene perception-based personal device usability, and ultimately, prevalence of safe mobile financial technologies usage, and user trust, and privacy preservation.
Key contribution
Introduced GATCSA, a novel real-time method of identifying visual threats (e.g. shoulder spoofing, via gaze and scene analysis of the front camera).
Devised a situational risk assessment model to determine gaze direction, proximity, and time to classify and rank the level of threats.
Created a light, privacy-sensitive architecture that does all processing directly on mobile devices without storing or sending user data.
Integrated proactive user notification processes that propose real-time counteractions like screen adjustment or temporarily freezing the transaction.
Verified the system in varied real-world settings, showing responsiveness under varying lighting conditions, crowd densities, and device types.
Organization of the paper
The rest of this paper is structured as follows: Sect. “Introduction” provides the introduction, describing the background, problem statement, and motivation for the research. Section “Research motivation” describes related works on mobile payment security and visual threat detection methods, pointing out existing shortcomings. Section “Research significance” outlines the proposed methodology, including the GATCSA system architecture, data processing, threat classification, and mechanisms for user notifications. Section “Key contribution” explains the experimental results, system performance, and practical applicability. Section “Organization of the paper” concludes the paper with major findings and discusses future directions for refinement and generalization of the proposed scheme.
Related works
Quincozes et al.8 explains that with the growing popularity to validate users through the usage of mobile applications, the dynamic with digital platforms has changed notably, which initiated new potentials of facilitating positive user experience (UX). Although smartphones are cool and convenient to use because it takes minimal interaction to access them this transition comes with is significant security risks. Static QR codes (based on fixed user data, such as national identification numbers) are used in many organizations to perform physical authentication operations, including access control of turnstiles. Nevertheless the major security risk associated with static data is that it can easily be compromised. To reverse this, One-Time Authentication Codes (OTACs) have come up as a solution. Although this is promising, due to the absence of an integrated structure of implementing OTACs in physical authentication cases, it has been characterized by inconsistent UX among APIs and a persisting security issue. In order to mitigate this, introduce Auth4App a complete protocol package that is intended to support secure identification and authentication using mobile applications. Auth4App has two major protocols: identification and generation of OTACs. We illustrate the malleability and versatility of Auth4App in three real-life applications: a mobile-only deployment, a mobile-turnstile integrated scenario, as well as in FIDO2 interoperated implementation. In order to establish the strength of security of our protocols, we carry out stringent analysis by use of automated verification systems and formal argument based proofs to support the strength and convincingness of the Auth4App framework.
Historical PINs and passwords have proved to be susceptible to shoulder surfing, and other observational attacks. Many graphical password systems were proposed in its defense in the notion of usable security. Though these techniques may also purport to be resistant to such attacks, they commonly undermine usability by imposing yet another burden, e.g., the need to have second devices (phones/PCs), the use of headsets, to feel the vibration feedback, or make unconfused gestures. With these limitations, Ahmad et al.9 introduced PassNum is a graphical pin authentication (GPA) scheme based on the use of grids which is being proposed to replace traditional PINs on any type of device. PassNum is an attempt at a practical compromise among usability, deployability and security. It is already inexpensive, extremely easy to use, and capable of covering an extremely wide age range (children to elderly people), secure (high entropy) and can be designed to be used both in low-sensitivity spaces (social media) and highly sensitive worlds (banking apps). PassNum in a user study involving 32 participants gave a 98% authentication accuracy with an average time to log in to 10 s and was shown to be 100% memorable with time. Significantly, the fourth variant of PassNum was found to be fully resistant (100 per cent) against repeated shoulder surfing sessions, despite the availability of recordings of three successive attempts of logging on, by attackers. The qualitative survey responses also proved the possibility of the system with the participants suggesting that PassNum has the potential of being an extremely powerful and user-accepted alternative to the old PIN and passwords.
As everyone started to use a camera, the dangers of video-based attacks have increased, augmenting the problem of shoulder surfing that has spanned many years. Although two-factor authentication (2FA) has been implemented by many organizations to improve the security posture in their organizations, they can be exploited in the same manner by the attackers just acquiring usernames and passwords using video records or shoulder surfing and eventually run credential stuffing attacks since users tend to use the same passwords on different platforms. Cue-based authentication methods provide high resistance in the event of shoulder surfers-utilizing the oblique requests; although they are still vulnerable to video attacks, in which the cue and replies, both, are observable. In response to such weaknesses, Yang and Kong10 presented Cue-based Two-Factor Authentication (Cue-2FA), a new technique that outclasses conventional cue-based techniques by decoupling cue display from user input. This organization difference is used to conceal the instant association between cues and authentication responses. Two user tests were introduced so that the usability and the security of Cue-2FA could be evaluated against that of the standard Time-based One-Time Password 2FA (TOTP-2FA). Findings showed that Cue-2FA was more usable as well as offered enhanced protection against shoulder surfing crimes. However, it was also revealed during the research that Cue-2FA has no significant higher performance when both response and cue phases are measured as compared with TOTP-2FA, in terms of preventing video attacks. To address this disadvantage, the authors suggested that misleading operations, i.e. intentional decoys on response input, would significantly increase Cue-2FA resistance to video attacks.
2D face presentation has been one of the commonest and most dangerous challenges to the facial recognition in terms of securities of the face usage. As encouraging as the results of RGB-based face anti-spoofing (FAS) models have been in recognizing this type of attacks, nevertheless they constantly face overfitting problems, and fail to generalize well enough to work reliably in unseen settings. To achieve more robustness, recent efforts have sought additional modalities in order to detect face liveness, i.e. depth imaging and near-infrared imaging. These techniques however normally require the use of expensive sensors and specialized equipment and consequently have limited applicability in practice. To accommodate such drawbacks, Kong et al.11 present Echo-FAS, the new and inexpensive system of face anti-spoofing, which exploits the acoustic channel as a means of liveness. The Echo-FAS uses designed acoustic probes to examine facial reactions, unlike the other models, where the signals related to naked eye vision are used. We will build upon that and create an acoustics-based FAS dataset called Echo-Spoof, which is of a large scale, high diversity, high quality. This dataset al.lows us to construct a two-restored architecture in which we obtain global and local frequency descriptors of acoustic signals and successfully distinguish between live and spoofed faces. The three major strengths of Echo-FAS are as follows: (1) it is implemented through a simple speaker and microphone, so it does not suffer the high costs or special requirements of specialized hardware; (2) it rec és ch in 3D geometric details on the structure of face shapes, resulting in impressive anti-spoofing capabilities; and, finally, (3) it can be easily trained to provide the input data to compatible RGB-based FAS to overcome the overfitting drawback of such models, thus positively contributing to overall performance and robustness. This piece of work presents a new line of research towards on-device, efficient, and multimodal face anti-spoofing technologies, especially those ideally applied into mobile-based systems and limited sensors.
Although mobile authentication would have a significant breakthrough and face anti-spoofing methods, there are a lot of unresolved issues in the available literature. Most of the existing approaches, including those based on the usage of static QR codes and older PIN/password systems, have major limitations in the aspects of data breach and predisposition to visual or video attacks12. Although graphical passwords and cue-based authentications were intended to increase security, they usually suffer a loss of usability, or otherwise, remain ineffective in advanced video surveillance13. Two-factor authentication systems, despite being relatively common, are still vulnerable to credential stuffing with the reuse of the password, and cannot be secured as long as the cues and responses are stored. In face anti-spoofing, RGB models often over fit and fail badly in new environments, and their robust alternatives using depth or infrared cameras run on expensive and non-commodity hardware, and so are not practically deployable. Even the prospective acoustic-based solutions, being cost-efficient and precise, remain at a nascent stage and they may need additional assessment to justify feasibility in real life situations and integration towards multimodal aspects. On the whole, the study does not have integrated and low-cost user-friendly and attack-resistant authentication systems that can be used in a variety of environments and mobile platforms.
GATCSA: Gaze-Aware threat detection with contextual scene analysis
The suggested methodology, GATCSA, outlines a real-time, on-device security solution with the capability to detect and counter shoulder spoofing attacks in mobile payment transactions. The process starts with collection of data from varied real-world scenarios, including different gaze patterns, face directions, and positions of surveillance devices. Preprocessing methods like frame extraction, noise removal, and normalization format the data for analysis. GATCSA then applies lightweight computer vision models to detect faces, infer gaze direction, and locate objects such as surveillance cameras in the vicinity of the user. Through contextual scene analysis, the system weighs proximity, gaze time, and ambient factors to determine risk levels via a rule-based classification model. Depending on the threat severity—whether low, medium, or high—users are notified in real-time with actionable recommendations, including re-positioning of the phone or suspending the transaction. Everything is processed locally on the device to preserve data privacy and optimize resource usage. The system is thoroughly tested under diverse lighting environments, crowd levels, and device types to guarantee robustness, responsiveness, and flexibility in real world use. The Workflow is shown in Fig. 1.
Fig. 1.
Overall gaze aware threat detection workflow.
Data collection
Train and test the suggested GATCSA system for shoulder spoofing threat detection using synthetic and real-world gaze datasets in Table 1. The UnityEyes synthetic dataset offers labeled gaze images (gaze.h5) with extensive annotations in gaze.csv and gaze.json, captured under controlled simulated settings with pre-determined gaze vectors. The MPII Gaze dataset (real_gaze.h5) also provides real-world gaze images recorded under varied, uncontrolled conditions to facilitate robust generalization. These data sets collectively pose a range of mobile payment scenarios, such as lighting changes, head pose variations, direction of gaze, and existence of surveillance cameras or observers. Each image is tagged with parameters like gaze vector, face orientation, and spatial context (e.g., screen proximity or camera angle). The integration of synthetic and real-world data enables supervised learning and domain adaptation methods such as SimGAN to ensure that the trained model is capable of identifying visual threats under realistic mobile usage settings14.
Table 1.
Contextual gaze and environmental parameters for threat Analysis.
| Parameter | Description |
|---|---|
| Gaze Direction (Look Vector) | 3D vector representing the direction of the user’s gaze |
| Face Orientation (Head Pose) | Euler angles or quaternion values indicating the rotation of the head |
| Eye Position | Coordinates of the pupils and eyelids in the image |
| Image Source Type | Synthetic (UnityEyes) or Real (MPII Gaze) |
| Lighting Condition | Bright, Dim, or Natural light conditions |
| Observer Presence | Boolean flag indicating whether another person is present in the scene |
| Proximity to Screen | Distance of the gaze source from the mobile device screen |
| Surveillance Device Detected | Flag indicating presence of camera or recording device in the scene |
| Threat Level | Classified as Low, Medium, or High based on gaze duration, proximity, and angle |
Data Pre-Processing
Preprocessing is important to ensure that both real (MPII Gaze) and synthetic (UnityEyes) gaze datasets are normalized and ready for input into the GATCSA threat detection model. Frame extraction for video-based input data is achieved through breaking down the input sequences into individual frames of images at a defined frame rate (e.g., 30 FPS), utilizing (1):
![]() |
1 |
Here,
represents the frame at time and is the input video stream. Frames are then normalized where every image is resized to a uniform input size (for instance, 128 × 128 or 224 × 224 pixels) to conform to neural network model input requirements. Pixel intensity values are also normalized to an identical range, for example, [0, and 1], by (2):
![]() |
2 |
where is the raw pixel intensity, and
and
are minimum and maximum pixel intensity in the image. After that, noise reduction is performed to enhance detection clarity, particularly in real-world images that tend to include visual clutter or blur. Noise suppression is done while maintaining edges using Gaussian filters or bilateral filters. A standard 2D Gaussian filter is given by (3)
![]() |
3 |
where is the standard deviation of the Gaussian kernel, which determines the amount of smoothing. This filter is convolved with the image to result in a cleaner version. To preserve edges, a bilateral filter may be used in order to minimize background clutter but preserve eye region features important for gaze estimation. Following preprocessing, pre-cleaned and normalized frames are inputted to facial landmark detection and the prediction of gaze vectors in the later phases of the GATCSA system.
Face and gaze detection
In the GATCSA framework, real-time face and gaze detection is critical for identification of possible shoulder spoofing attacks during mobile transactions. The system starts with face detection where minimal deep learning-based models like MTCNN (Multi-task Cascaded Convolutional Networks) are utilized to perform face detection efficiently in each input frame. These models generate bounding boxes across face regions detected as well as predict facial landmarks simultaneously, offering a speed-accuracy trade-off suitable for mobile devices.
Subsequent to face detection, the system aligns facial landmarks for the purpose of achieving true coordinates of prominent facial features, particularly the eyes, nose, and mouth. In this operation, each region of an eye is defined by inner and outer corners and pupil centers as the foundation for computation of the direction of gaze. Let
and (
) be the coordinates of the left eye and right eye centers respectively, and (
) be the nose tip. The eye center vector is calculated as (4):
![]() |
4 |
This sets a consistent point of reference for the estimation of gaze. Then, the estimation of gaze is performed through examination of the relative direction between the eye vector and the camera. The gaze vector is generally computed using geometric methods or deep neural networks learned from annotated databases such as UnityEyes and MPII Gaze. In geometric methods, the direction of the gaze
can be computed as the normalized difference between eye centers and the optical axis of the camera in (5):
![]() |
5 |
Where:
denotes the 3D position of the pupil (estimated from landmarks),
denotes the camera center or reference viewing axis. In deep learning-based gaze estimation, a CNN model predicts 3D gaze angles—pitch () and yaw ()—which denote the vertical and horizontal direction of gaze, respectively. These angles can be understood as (6)
![]() |
6 |
This gaze direction assists in deciding whether the person so identified is staring into the screen (i.e., potential threat). There is a threshold placed on angular offset to categorize whether the gaze is within a sensitive field-of-view cone centered on the mobile screen. If persistent gaze is found in this cone, particularly from other people than the user, it is identified as suspicious. Through the integration of lightweight facial detection and solid gaze vector analysis, the system is capable of ongoing monitoring for would-be shoulder spoofing attacks in real-time. This facilitates the issuance of proactive notifications to the user every time a person has a visual focus on their screen that may make a mobile transaction, therefore, enhancing the security of a mobile payment system in the open.
Surveillance Camera/Object detection
To ensure that the GATCSA system becomes resistant to the shoulder spoofing attack both by the human and the device, there should be an ability to identify even the offending glances, as well as the devices that may be present in the surrounding which might violate the privacy of the user. This may include surveillance cameras, cell phones, or even human beings with cell phones in the possible lines of threat. In this regard, the proposed solution utilizes the most advanced object detection algorithms such as YOLOv5 (You Only Look Once) and SSD (Single Shot Multibox Detector), which scan any frame in real time to identify the pre-configured dangerous properties such as buildings, vehicles, and weapons.
YOLOv5 is rather suitable due to its rapid detection rate and precision, as it is tailored to mobile and embedded systems. Given a frame input, the model output contains the predicted bounding boxes
and class probabilities
of all the detected objects in (7):
![]() |
7 |
Where: is the input image,
is object bounding box,
is object class (i.e., surveillance camera, person, mobile phone),
is YOLOv5 model function.
Once identified, object is contextually labeled according to its location and importance to the mobile user. For example, an individual identified standing immediately behind or next to the user and holding a device is marked as a mobile snooping threat, whereas a stationary security camera pointed at the screen is marked as a stationary visual threat. To aid classification, spatial distances between the user’s screen and object detection locations are calculated. A threshold angle and distance are used to decide whether the object is within a threat area. Such enhancement in the contextual tagging is also achieved through temporal consistency checks all over several frames-the same objects that are re-confirmed to be found in the threat area are given the higher risk weights.
Contextual scene analysis
In realizing the gravity of shoulder spoofing attacks, contextual scene analysis is highly significant because it takes into account several physical and behavioral aspects. This is a different approach from the conventional binary classifiers in the sense that it accounts for the intricacy of the real scene when it comes to the use of variables like user-observer distance, number of detected devices and faces, shape and direction of look and the depth and direction of look and the number of detected faces and devices and an image intensity. The combination of such context factors is then used to estimate the overall threat level by means of a fuzzy logic system or a rule based scoring system, which provides a granular and more human like decision making process.
The measure between a potential viewers to mobile device is calculated using the Euclidean distance d equation given in (8)
![]() |
8 |
where (
) denote the coordinates of the observer (face or device), and (
) denote the coordinates of the user’s screen. The duration of gaze
is monitored as the number of consecutive frames in which the gaze of a person lies within the screen-centered cone of attention. Longer periods are proportional to higher threat potential. The angle of gaze is computed from the gaze direction
with respect to the screen-normal direction
, according to (9):
![]() |
9 |
Lower values of represent direct vision in the user’s screen direction. The system also takes into account the number of faces
and devices
found in the scene, higher numbers in more busy situations and more exposure risk. Lighting conditions are generalized into bright, weak, or low-light categories with regard to average frame luminance to evaluate visibility and shadowing effects that influence detection reliability and risk interpretation.
These inputs are passed to a fuzzy logic module that converts them into linguistic variables (such as “close”, “moderate”, “far” for distance, or “short”, “medium”, “long” for duration of look). The module, relying on a precomputed set of if–then rules, subsequently computes a threat score
between 0 (no threat at all) and 1 (high threat). For instance, one rule could be: If small gaze angle and long duration of gaze and close proximity, then threat level will be high. Total threat score is calculated as (10):
![]() |
10 |
where
are the normalized feature values and w_i are their corresponding weights learned through rule-based tuning or training from tagged examples. This scene awareness not only makes GATCSA observe what is in the scene, but how it relates to the user’s security, making possible real-time, ranked alerts in correlation with situational danger levels. This multi-level intelligence significantly raises the robustness of mobile payment protection in changing and uncertain real-life situations.
Threat classification
Upon extracting and processing the contextual parameters, the next most critical step is threat classification, which determines the gravity of potential shoulder spoofing attacks. This is to enable real-time system response and prioritize user alerts. Threats are categorized into three tiers—Low, Moderate, and High—depending on an amalgamation of gaze behavior, spatial close proximity, and presence of surveillance equipment. The system either uses a light-weight Support Vector Machine (SVM) algorithm, decision tree, or heuristic rule-based classifier to map efficiently the environmental and behavioral indicators to risk levels with low computational overhead appropriate for mobile devices. Feature vector
utilized during classification is comprised of the following normalized components in (11):
![]() |
11 |
Where: : Distance between the viewer and the device, : Angle of view to the screen,
: Duration of gaze in seconds,
: Number of detected faces,
: Number of adjacent devices (e.g., phones, cameras),
: Binary variable for presence of surveillance camera, : Level of light (as numeric value).
For SVM classification, the goal is to identify a hyperplane that will discriminate risk classes in the feature space. Decision function is expressed as (12):
![]() |
12 |
where represents the learned weight vector from training data, and is the bias. The output is then mapped to one of the three risk levels based on calibrated probability thresholds:
Low Risk: P (High) < 0.3.
Moderate Risk: 0.3 ≤ (High) < 0.6.
High Risk: (High) ≥ 0.6.
Alternatively, a decision tree classifier analyzes the level of risk by navigating a chain of logical tests, including:
If duration of gaze > 3 s and angle of gaze <
and distance < 1.5 m → High Risk.If gaze duration > 1 s but distance > 2 m → Moderate Risk.
Else → Low Risk.
Within the heuristic rule-based system, threat score S_t from contextual scene analysis (Step 5) is utilized directly in (13):
![]() |
13 |
This low-latency classification mechanism supports quick and precise risk assessment, enabling the system to quickly alert the user with actionable instructions like repositioning the phone or terminating the transaction. The classification module is optimized for zero latency and energy, highly appropriate for real-time usage on mobile platforms, and significantly enhances shoulder spoofing protection in various public environments.
Real-Time alert & notification system
Once a possible shoulder spoofing threat has been classified, the system must warn the user immediately through an efficient real-time alert and warning mechanism. The module is to be context-sensitive in the way that the character and seriousness of the warning depend on the seriousness of the threat ascertained. For low-risk circumstances (e.g., brief look from far away), the system can create a soft pop-up notification or vibrotactile feedback. In contrast, in high-risk danger (e.g., direct look, closeness, or camera acknowledgment), the system provides a stronger alert with suggested actions, for instance:
Reposition the device to shield the screen from visible angles.
Enable a privacy screen filter to obscure display content.
Temporarily pause the transaction until the risk subsides.
If the threat continues or grows over a specified limit (e.g.,
≥ 0.8 from Step 5), interrupt handling can be initiated by the system, suspending or delaying the transaction and prompting the user to explicitly authorize resuming. This ensures critical inputs (e.g., MPIN) are not entered in conditions of poor visibility. The alert infrastructure is based on an event-driven model, with incoming threat classifications triggering notification processes through (14):
![]() |
14 |
where
is the alert level or suggestion, based on threat score and gaze/distance variables.
Local processing & privacy assurance
In order to maintain usability and trust, the entire GATCSA pipeline is focused on in-device computation such that sensitive visual and behavioral data never leave the user’s smartphone. Low-latency, optimized neural networks like MobileNet, Tiny-YOLO, or TensorFlow Lite-converted models are utilized for efficient object detection, gaze estimation, and face detection. These models are optimized and quantized to minimize computational latency and overhead without compromising detection quality. In order to enhance privacy, each frame of the front camera’s video is analyzed in memory before being discarded immediately after analysis. No frames are saved or uploaded to remote servers, with zero data persistence. This is GDPR-compliant and preserves user anonymity in sensitive transactions. Furthermore, the system is optimized for low-energy consumption to avoid battery drain. Resource-conscious scheduling and frame skipping methods are implemented—only handling selected frames at a gap instead of handling every individual frame. Energy consumption
can be represented as (15):
![]() |
15 |
Where:
: Power drawn per frame processed, : Total monitoring time while active,
: Frames processed,
: Frames captured in total. By varying the process frequency
, the system achieves a balance between performance and power, and it can hence be used for continuous background use while making mobile payments.

Pseudocode: GATCSA Algorithm
Results and discussion
The GATCSA system was tested thoroughly in a variety of real-world testing conditions to determine its resilience and applicability in real-world scenarios to counter shoulder spoofing attacks on mobile payment transactions. Experiments were performed under different lighting conditions (e.g., bright indoor, low light, and daylight outdoors), under different crowd densities (from voids to very dense areas), and on various mobile platforms, including both Android and iOS devices of different screen sizes and hardware capabilities. The system showed a threat detection accuracy, with minimal false positive rate and false negative rate, proving effective in distinguishing perceived from actual threats. The average latency in the system was less than 180 milliseconds and ensured real-time performance. Over 87% of users responded promptly to the notifications, in less than two seconds indicating that there was a high engagement and trust among the users in the system. These findings confirm GATCSA as a secure, effective, and easy-to-use solution for augmenting mobile payment security using real-time visual threat detection.
Figure 2 shows the time aspect of responding to alerts of the user. At the start, with 0 s, the response rate is at 20% but within 2 s, the rate shoots up to 70%, and it reaches 87% in 3 s after which it holds. It means a quick response of users to security signals reminding about the efficiency of the system that helps to gain attention within several seconds in crucial situations.
Fig. 2.
User responsiveness over time.
The Fig. 3 uses the confusion matrix to test the performance of threat detection model based on the predicted investigations and actual classifications. The model predicted 82 cases out of 100 as “No Threat” with the 89 cases being predicted as “Threat.” It however misclassified 5 out of the 320 instances with No Threat response as threat (false positives) and 4 instances with Threat response as No Threat (false negatives) which show high overall accuracy and balanced prediction performance.
Fig. 3.
Confusion matrix.
Figure 4 shows the comparison of the model inference time in the Android and iOS devices and displays the responsiveness of the system to detecting the threat. The inference time on Android averages at about 160 milliseconds, but on iOS, this goes up to about 200 milliseconds. It indicates that although both platforms allow optimal efficiency of the model, Android can execute real-time processing at slightly higher speed when it comes to security-related activities.
Fig. 4.

Model inference time on devices.
Figure 5.shows how the identified types of threats are distributed in the mobile payment security system. The highest percentage is of gaze- based threats at 40% and secondly are the combined gaze and surveillance devices threat at 35%. Threats related to surveillance devices only make 25%. It makes obvious how visual observation is an important security issue in shoulder spoofing cases.
Fig. 5.

Distribution of detection threat type.
Figure 6 illustrates an association of alert latency in the proposed mobile payment security system with the level of the threat. The higher the magnitude of the threat, the faster the system reacts: the latency is as follows: 250 milliseconds with low threats, 180 milliseconds having moderate and 120 milliseconds with high threats. This shows how well this model has adapted to critical risks in a responsive manner.
Fig. 6.
Alert latency vs. threat severity.
Figure 7 shows the validity of the threat detection model against various environment light conditions. The bright environment exhibits the highest performance of the system with an accuracy rate of about 92.1% then the dim environment trails at about 90.3% and finally the outdoor environment has an accuracy rate of about 88.7%. These findings show that the model is strong and flexible when used in various real-life situations.
Fig. 7.

Accuracy under varying environmental conditions.
A comparative analysis of accuracy of various methods of threat detection is given in Table 2. PassNum4 has an accuracy rate of 94% compared to that of Echo-FAS5 which has a slightly higher rate of 95.18%. The GATCSA technique is also much better than the two and achieves a good performance with 97.8% accuracy showing that the GATCSA technique is more applicable in recognizing the shoulder spoofing threats in the mobile payment contexts.
Table 2.
Performance metrics comparison.
In Fig. 8a bar chart that compares the accuracy of the three methods that are used to detect mobile payment threats, is presented. PassNum4 gets the accuracy of 94%, Echo-FAS5 scores even a little higher of 95.18%, but of course the proposed GATCSA method gets significantly higher 97.8% accuracy, passing the other two methods with a larger margin. This presents the excellence of GATCSA in detecting vulnerabilities associated with shoulder spoofing.
Fig. 8.
Accuracy comparison of models.
Table 3 performance comparison emphasizes the accuracy of various approaches on the eye gaze dataset. The outcome is that NeuroSpatialIOT15 has a high performance with 94.6% accuracy, whereas MediaPipe16 shows significantly poor effectiveness with 30% accuracy, which suggests that it lacks robustness in detecting gaze. Compared to the others, the new GATCSA method offers the highest accuracy of 97.8%, demonstrating its higher reliability and accuracy in performing eye gaze recognition tasks. This comparison highlights the improvement of GATCSA in attaining nearly optimal performance over the current strategies.
Table 3.
Performance metrics comparison of eye gaze dataset.
| Methods | Accuracy (%) |
|---|---|
| NeuroSpatialIOT16 | 94.6 |
| MediaPipe [17] | 30 |
| GATCSA | 97.8 |
The Fig. 9 shows accuracy performance comparison between three approaches on the Eye Gaze dataset. NeuroSpatialIOT15 has a very high accuracy of 94.6%, displaying exceptional performance. However, MediaPipe16 depicts relatively low accuracy at 30%, suggesting low efficacy for the task. The proposed approach GATCSA shows the best results because it has the greatest accuracy of 97.8 and its excellence and power to perform eye gaze prediction is better than the others.
Fig. 9.
Performance metrics comparison eye gaze dataset.
Discussion
The quantitative performance metrics reported in this study (accuracy of 97.8%, false positives/negatives, and average latency < 180 ms) were obtained through a structured evaluation protocol using a combined test set derived from both synthetic and real-world gaze datasets. Specifically, the evaluation employed a total of 320 labelled test instances, constructed from a held-out subset of the UnityEyes (synthetic) and MPII Gaze (real-world) datasets, ensuring diversity in gaze direction, lighting conditions, proximity, and observer presence. Each instance corresponds to a distinct transaction-context frame sequence annotated with ground-truth threat labels (Low, Moderate, High), which were binarized into Threat vs. No Threat for confusion-matrix analysis. The statement “misclassified 5 out of the 320 instances” refers to the total number of incorrect predictions across this complete test set (5 false positives and 4 false negatives). The reference to “82 cases out of 100” denotes a normalized subset-based illustration used solely for visual clarity in Fig. 3, where results were proportionally scaled to a 100-sample representation to improve interpretability; this subset is not independent of the full 320-instance evaluation. Overall accuracy (97.8%) was computed as the ratio of correctly classified instances to the total 320 test instances. Latency measurements were obtained by averaging per-frame inference times over all test instances across Android and iOS devices during live transaction simulations. This clarification ensures consistency between reported metrics, figures, and dataset usage, thereby improving the transparency and reproducibility of the evaluation methodology.
The comparative accuracy statements reported for GATCSA in relation to PassNum and Echo-FAS are intended to provide contextual performance positioning rather than a strict head-to-head benchmark under identical experimental pipelines. While all three methods address security vulnerabilities related to observational or spoofing attacks, they are not originally designed for an identical task definition nor evaluated on the same datasets in their respective publications. To ensure meaningful comparison and avoid overreaching conclusions, the reported accuracy values for PassNum and Echo-FAS were derived from their original peer-reviewed evaluations under their native experimental settings, whereas the 97.8% accuracy for GATCSA was obtained from controlled experiments on a combined synthetic and real-world gaze dataset specifically curated for shoulder spoofing threat detection. The comparison therefore reflects relative effectiveness within a common problem domain (visual attack resistance in mobile interactions) rather than a direct experimental equivalence. Importantly, all methods were evaluated against their respective ground-truth definitions of threat or spoofing detection, and the higher accuracy achieved by GATCSA indicates its suitability for the specific task of gaze-aware, context-based shoulder spoofing detection. To avoid ambiguity, the manuscript clarifies that these results should be interpreted as indicative performance comparison across related security paradigms, and not as a one-to-one replication under identical datasets or training protocols.
To improve clarity regarding result provenance, we explicitly distinguish the roles of the datasets used at different stages of the GATCSA evaluation. The system was trained using a combination of synthetic and real-world gaze datasets, where the UnityEyes dataset provided controlled, densely annotated gaze-direction and head-pose samples for initial model learning, while the MPII Gaze dataset contributed real-world variability in lighting, pose, and environmental noise to enhance generalization. Quantitative metrics such as the reported 97.8% classification accuracy, false positive/negative rates, and confusion-matrix analysis were computed on a held-out test set drawn from this combined dataset, ensuring that no training samples were reused during evaluation. In contrast, the results labelled as “real-world testing conditions” (e.g., latency measurements, robustness under varying lighting and crowd density, and device-specific inference time) were obtained from live transaction simulations conducted on physical Android and iOS devices, rather than from offline datasets. Finally, user responsiveness metrics (e.g., response time to alerts and engagement percentages) were derived exclusively from a user-in-the-loop study, where participants interacted with the deployed GATCSA prototype in controlled yet realistic public environments. This separation clarifies that dataset-driven results, system-level performance evaluations, and user-behaviour measurements originate from distinct but complementary evaluation stages, thereby strengthening the transparency and interpretability of the reported findings.
From a practical deployment perspective, several usability, privacy, and adoption considerations merit explicit discussion. Although GATCSA performs all visual processing locally and does not store or transmit camera data, the requirement for continuous front-camera activation during payment interactions may raise user-perceived privacy concerns and affect acceptance in certain contexts. To mitigate this, GATCSA is designed to operate in a transaction-scoped and user-consented manner, where camera monitoring is activated only during sensitive input phases (e.g., MPIN entry) and is visually indicated to the user, thereby improving transparency and trust. In addition, adaptive sampling strategies—such as reduced frame rates, event-triggered activation, and confidence-based early stopping—can significantly limit camera usage while preserving detection effectiveness. Environmental constraints, including extreme low-light conditions, partial facial occlusions, and crowded scenes, may degrade gaze or object detection accuracy; however, these limitations are partially alleviated through multi-frame temporal aggregation, confidence weighting, and conservative threat classification that favors false negatives over intrusive false positives in ambiguous conditions. Hardware diversity across mobile devices (camera quality, sensor placement, processing capability) is addressed through the use of lightweight, resolution-agnostic models and on-device optimization frameworks, although performance may still vary across low-end devices. Finally, adoption considerations are best addressed through tight integration with mobile operating systems or payment applications, allowing GATCSA to function as an opt-in security layer aligned with existing user workflows rather than as a continuously running background service. These considerations highlight that, while GATCSA is technically feasible and effective, real-world deployment benefits from user-centered design choices and platform-level integration to balance security, usability, and privacy.
To improve clarity regarding result provenance, we explicitly distinguish the roles of the datasets used at different stages of the GATCSA evaluation. The system was trained using a combination of synthetic and real-world gaze datasets, where the UnityEyes dataset provided controlled, densely annotated gaze-direction and head-pose samples for initial model learning, while the MPII Gaze dataset contributed real-world variability in lighting, pose, and environmental noise to enhance generalization. Quantitative metrics such as the reported 97.8% classification accuracy, false positive/negative rates, and confusion-matrix analysis were computed on a held-out test set drawn from this combined dataset, ensuring that no training samples were reused during evaluation. In contrast, the results labeled as “real-world testing conditions” (e.g., latency measurements, robustness under varying lighting and crowd density, and device-specific inference time) were obtained from live transaction simulations conducted on physical Android and iOS devices, rather than from offline datasets. Finally, user responsiveness metrics (e.g., response time to alerts and engagement percentages) were derived exclusively from a user-in-the-loop study, where participants interacted with the deployed GATCSA prototype in controlled yet realistic public environments. This separation clarifies that dataset-driven results, system-level performance evaluations, and user-behavior measurements originate from distinct but complementary evaluation stages, thereby strengthening the transparency and interpretability of the reported findings.
The proposed GATCSA architecture shows a commendable enhancement in the security of mobile payments as the structure actively identifies and mitigates shoulder spoofing attacks. The system has integrated eye movement tracking, surveillance monitoring of objects, context-based analysis of a scene, and real-time risk classification to offer the user timely alerts and best response options. Its high accuracy of 97.8% is better than existing ways of doing it such as the PassNum and Echo-FAS hence its power and reliability. The system is effective in various lighting environments-day, evening, and outdoor- and it offers the user fast response times with little latency particularly when dealing with high threats. The biggest advantage of this endeavor is that it has on-device processing capability which ensures that users are not violated of their privacy and that it consumes very low energy which is the key to mass deployment of mobile products. Moreover, the context-aware alert system also enhances responsiveness in the user with customized feedback according to the degree of threat. However, certain shortcomings should also be mentioned. The privilege of front-camera input might be one that leads most people to have a privacy problem even when local processing is assured. Moreover, lighting conditions might also be extremely dim or highly occluded which means that gaze and object detection may be not so steady which may result in impaired performance. The framework also requires the constant availability of the camera during transactions, and this may not be very feasible and may be uncomfortable to everyone. Despite these challenges, the GATCSA framework is a highly versatile and effective means of resisting visual eavesdropping attacks in mobile payment settings and thereby foster user confidence and e-commerce safety. The developmental enhancements would explore cross-device optimization and connection to hardware-level privacy functions to address the current drawbacks.
Conclusion and future works
The GATCSA model provides a powerful and intelligent solution to the continuously rising problem of shoulder spoofing in mobile payment system. The introduction of both real-time gaze estimation and detection of objects and analysis of its surroundings can be described as a myriad of possible visual threats, which can be identified by GATCSA depending on the strategy of close-by observers or cameras installed on the making of mobile transactions. The ability of the concerned system to rank the degree of risk and deliver real-time and contextual alerts gives a significant boost to user awareness and transaction safety. This is contrary to the old security measures which only afford passive security measures on the data, GATCSA is however an aggressive approach to its protection where the mobile user is offered with timely advice that will guide the user to avoid the risk of data being exposed in a semi-public area. The system may be operated on the standard mobile device, and it may be implemented either in the local processing to preserve privacy or in optimal algorithm in energy consumption. The ability to simulate different scenarios in relation to lighting settings, crowd density, and device type demonstrates that the system is accurate and flexible and the dynamism of users demonstrates the advantage of real-time alarm in implementing security behavior.
The future approach to the work will be focused on evolving the system that will be able to identify multi-modal threats exclusively on the basis of audio signals, thermal signature, or inertial sensor signals to enhance the level of situational awareness. The accuracy of gaze and object recognition can also be increased in cluttered situations with enhancement in deep learning models. Moreover, mobility’s of the operating system or payment apps might integrate the GATCSA framework natively which might enhance the establishment and experience within a user. It is also necessary to have longitudinal user study to be able to evaluate the acceptability, behavioral effects, and usability of the system in the real world. Overall, GATCSA sets high standard mobile payment security in the next generation that promotes trust, safety, and digital inclusion in more interconnected environments.
Acknowledgements
This research is supported by a grant (CRPG – 25–3170) under the Cyber Security Research and Innovation Pioneers Initiative, provided by the National Cyber Security Authority (NCA) in the Kingdom of Saudi Arabia.
Author contributions
Omar Alqahtani conceptualized the study and led the overall project design. Dileep M R coordinated the research, supervised methodology development, and served as the corresponding author. Mohamed Ghouse contributed to the system architecture design and experimental validation. Vidya Sagar S D carried out the data preprocessing, model training, and performance evaluation. Sreekanth Rallapalli was responsible for the literature review, related works survey, and drafting the initial sections of the manuscript. Ajit Danti contributed to the theoretical framework, result interpretation, and critical revisions of the manuscript. Omar Alqahtani and Dileep M R wrote the main manuscript text. All authors reviewed, edited, and approved the final version of the manuscript.
Funding
This research was funded by the National Cyber Security Authority (NCA) in the Kingdom of Saudi Arabia under the Cyber Security Research and Innovation Pioneers Initiative, Grant number CRPG – 25–3170.
Data availability
The datasets generated and/or analysed during this study are part of the National Cyber Security Authority (NCA) in the Kingdom of Saudi Arabia under the Cyber Security Research and Innovation Pioneers Initiative, Grant number CRPG – 25–3170 and are not publicly available due to project-specific restrictions. However, they may be provided to researchers upon reasonable request to the corresponding author for the purpose of further scientific collaboration.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Corbett, M., David-John, B., Shang, J. & Ji, B. ShouldAR: Detecting Shoulder Surfing Attacks Using Multimodal Eye Tracking and Augmented Reality, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 8, no. 3, pp. 1–23, (2024).
- 2.Hasan, S. S. U., Ghani, A., Daud, A., Akbar, H. & Khan, M. F. A review on secure authentication mechanisms for mobile security. Sensors25 (3), 700 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Talbi, A., Abdellaoui, M., Pons-Prats, J. & Kuljanin, J. An overview on computer vision analysis in the airport applications. IEEE Access13, 120446–120471 (2025).
- 4.Hallal, L., Rhinelander, J. & Venkat, R. Recent trends of authentication methods in extended reality: A survey. Appl. Syst. Innov.7 (3), 45 (2024). [Google Scholar]
- 5.Farzand, H., Abraham, M., Brewster, S., Khamis, M. & Marky, K. A systematic Deconstruction of human-centric privacy & security threats on mobile phones. Int. J. Human–Computer Interact.41 (2), 1628–1651 (2025). [Google Scholar]
- 6.de Melo, L. P. et al. A secure approach Out-of-Band for e-Bank with visual Two-Factor authorization protocol. Cryptography8 (4), 51 (2024). [Google Scholar]
- 7.Li, J., Ray, S., Rajanna, V. & Hammond, T. Evaluating the performance of machine learning algorithms in gaze gesture recognition systems. IEEE Access.10, 1020–1035 (2021). [Google Scholar]
- 8.Quincozes, V. E., Mansilha, R. B., Kreutz, D., Miers, C. C. & Immich, R. Auth4App: streamlining authentication for integrated cyber–physical environments. J. Inf. Secur. Appl.83, 103802 (2024). [Google Scholar]
- 9.Ahmad, A., Asif, M., Hamid, I. & Aljuaid, H. PassNum: A usable and secure method against repeated shoulder surfing. Behav Inf. Technol.44, 1–27 (2025).
- 10.Yang, Z. & Kong, J. Cue-based two factor authentication. Comput. Secur.146, 104068 (2024). [Google Scholar]
- 11.Kong, C., Zheng, K., Wang, S., Rocha, A. & Li, H. Beyond the pixel world: A novel acoustic-based face anti-spoofing system for smartphones. IEEE Trans. Inf. Forensics Secur.17, 3238–3253 (2022). [Google Scholar]
- 12.Jiang, Y., Zhu, H., Chang, S. & Li, B. Continuous user authentication based on subtle intrinsic muscular tremors. IEEE Trans. Mob. Comput.23 (2), 1930–1941 (2023). [Google Scholar]
- 13.Binbeshr, F., Por, L. Y., Kiah, M. M., Zaidan, A. & Imam, M. Secure PIN-entry method using one-time PIN (OTP). IEEE Access.11, 18121–18133 (2023). [Google Scholar]
- 14.Quant Eye Gaze, [Online]. Available: https://www.kaggle.com/datasets/4quant/eye-gaze
- 15.Wang, W., Wang, K. & Du, H. Design and optimization of human-machine interaction interface for the intelligent internet of things based on deep learning and Spatial computing. Egypt. Inf. J.30, 100685 (2025). [Google Scholar]
- 16.Dhananjaya, B. S. Akimis valdoma kompiuterio pelė naudojant Mediapipe sistemą, PhD Thesis, Vilniaus universitetas., (2023).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analysed during this study are part of the National Cyber Security Authority (NCA) in the Kingdom of Saudi Arabia under the Cyber Security Research and Innovation Pioneers Initiative, Grant number CRPG – 25–3170 and are not publicly available due to project-specific restrictions. However, they may be provided to researchers upon reasonable request to the corresponding author for the purpose of further scientific collaboration.





















